1 Supporting Evacuation Missions with Ontology-Based SPARQL Federation Audun Stolpe, Jonas Halvorsen and Bjørn Jervell Hansen Norwegian Defence Research Establishment (FFI) P O Box 25 2027 Kjeller, Norway Email: {audun.stolpe | jonas.halvorsen | bjorn-jervell.hansen}@ffi.no Abstract—We study ontology-based SPARQL federation in sup- The main contribution of this paper consists in defining a port of coordinated action by deployed units in military oper- novel federation strategy specifically designed for domains ations. It is presumed that bandwidth is limited and unstable. that share the general characteristics of this case. The most Thus, we need an approach that generates few HTTP requests. important characteristics are firstly that bandwidth is limited Existing techniques employ join-order heuristics that may cause so the total communication costs induced by the number of requests to multiply as a factor of the number of joins in a query. This can easily lead to an amount of traffic that exceeds HTTP request is a non-negligible factor, and secondly that the network capacity. We propose an approach that builds an in- network topology is dynamic, i.e. sources may come and go. memory excerpt of the remote sources, sending one request to Our aim is thus to define a federation strategy that is sound and each source. A query is answered against this excerpt, which is a complete with respect to query answering, issues a minimal provably sound and complete representation of the sources wrt. number of HTTP requests, and is compatible with run-time query answering. The paper ends with a case study involving detection of sources. three military sources used for planning evacuation missions. The paper is organized as follow: In Section II we identify a list of tentative desiderata that our federation strategy should satisfy. The desiderata points to using an ontology-based data I. I NTRODUCTION access paradigm, which is explained in Section III. Section IV outlines our solution, which is based on querying against The planning of evacuation missions is a complex and impor- an excerpt, or cropping, of the remote sources relative to an tant process in military operations. One of the most challenging incoming query. The case study is presented in Section VI, and aspects, is making all necessary information available to the the main experiences drawn from the case study is presented decision makers. These information fragments will typically in Section VII. be distributed across different systems. The paper assumes familiarity with W3C’s Semantic Web This is particularly the case when the military force con- technology stack, in particular RDF, OWL, and SPARQL. ducts its operations according to network-based concepts, like Readers not familiar with these technologies are referred to NATO’s Network Enabled Capability, henceforth NNEC [1]. [2], [3], and [4] for an introduction. The primary objective when conducting operations according to this concept, is to support the creation of a high degree of shared situational awareness among decision makers in order II. C HARACTERISTICS OF THE DOMAIN to obtain increased mission effectiveness. A prerequisite for achieving this, is extensive information sharing and a robust The NNEC concept presupposes a network-based environment scheme for information integration, enabling decision makers in which information about own and enemy units is typically to retrieve and utilize all relevant information when needed. distributed across several autonomous data sources contributed So far, the emphasis of the technical work on NNEC has by coalition members. In order to support evacuation planning, been on how to make information available throughout the these information fragments need to be integrated in order for environment. However, in order for NNEC to be of use the decision makers to obtain the highest possible degree of sit- to decision makers, the challenge of establishing a robust uational awareness. This involves tackling some idiosyncratic scheme for information integration ultimately also needs to challenges: be addressed. This is the focus of the research reported on in this paper. • The information systems are in general semantically het- erogeneous, especially in coalition operations, and cannot We present an information integration approach that combines be accessed in a coherent and unified way, query rewriting with data federation, and we study it in relation • the underlying communication network often relies on IP to an example from military evacuation planning based on live radios, and is hampered by limited bandwith, latency, and reporting of incidents over IP radio networks. limited range, STIDS 2013 Proceedings Page 141 2 • the network topology is highly dynamic, meaning that To that end, sources are selected at query time based on the information systems can appear and disappear at any time, outcome of the reasoning process, by inspecting DNS records and that are multicasted in the network (cf. section VI.). The • the shared information is mission-critical, which makes entire federation process can thus be seen as comprised of it crucial that the integration scheme yields correct and two distinct steps. First, the query is rewritten into a query exhaustive data. expressed directly in terms of the data according to the domain model expressed by the ontology. Next, the rewritten query These characteristics means that we want to define an infor- is decomposed into sub-queries that yield a set of mutually mation integration approach that: exhaustive partial answers extracted from each of the selected sources. A allows a user to access available sources in a unified way, B utilizes the available bandwith efficiently, particularly by It is not automatically the case, however, that the above men- restricting the number of HTTP requests to the remote tioned steps are separable. That is, depending on the expressive sources, power of the ontology language, reasoning may require data- C allows the relevant sources to be discovered at run-time, access. This is not the case for the class of ontology languages and that are first-order rewritable (cf. [5], [6]). This notion was first D guarantees the soundness and completeness of query introduced by Calvanese et al. ([5]) in the context of the class answering. of ontology languages called description logics. A description logic L—more generally an ontology language—is first-order III. O NTOLOGY- BASED DATA ACCESS rewritable if, for every ontology ⌃ expressed in L and a query Q , Q can be compiled into a first-order query Q⌃ that A) Based on the desiderata from section II, we decided to use an compiles away all concepts from the ontology ⌃, and B) is ontology to mitigate heterogeneities and to provide uniform such that given a data repository R, Q⌃ evaluated over R access to the data. That is, we based our approach on the yields exactly the same result as Q evaluated against R and paradigm usually called ontology-based data access in which ⌃. a conceptual model—the ontology—is used to express the First-order rewritable ontology languages is a crucial presuppo- relationship between the content of the respective sources, and sition behind our approach. It is important precisely because it to act as a single query interface towards them. ensures that the reasoning process can be decoupled from data According to the W3C Web Ontology Working Group1 , an access. This has two hugely beneficial consequences: First, ontology defines a set of concepts or term used to describe the complexity of reasoning remains unaffected as data size and represent some domain of information in an abstract way increases. In other words the computation time allocated to that gives a formal semantics to the data in question. More reasoning will not vary with changes in the network topology specifically, an ontology gives the semantics of the data in the and/or availability of sources—recall that we do not assume form of a set of logical axioms that explicate the relationship these to stay fixed. Secondly, since reasoning can be decoupled between classes of data items, and it enables computers to from data access, it does not affect the federation process per reason over the data as part of the process of answering a se. That is, source selection can be performed independently of query. One particular form that this process can take, is that the reasoning process, which means, among other things, that in which a query formulated in terms of the concepts of the a query which is run repeatedly will only have to be rewritten ontology is successively refined until the query can be executed once. directly against the data. This is usually referred to as query As always there is a price to pay, though. The property of rewriting and forms the basis for our approach, as explained first-order rewritability imposes a serious constraint on the in the next section. expressivity of an ontology language, and, as explained in Ontology-based data access is useful in all scenarios in which section VII, must be very carefully selected in order to be able accessing data in a unified and coherent way is difficult. This capture the salient aspects of our case. In particular, it turns may happen for several reasons. The data sources may have out that none of the standard fragments of the W3C-endorsed been developed for different purposes by different agencies ontology language OWL will do. or institutions, may not have a coherent design, and may not As regards HTTP-minimality (by which we mean keeping the record similar types of information in the same manner. A number of required HTTP requests as low as possible), we well-designed ontology gives a unified view of the domain in found existing approaches to federated query processing not terms of the concepts that are of interest to the user. to be well suited. One of the main reasons is that they all rely on forms of join-order heuristics that tends to multiply the IV. O UTLINE OF APPROACH number of HTTP requests as a factor of the depth and number of joins: a standard distributed join algorithm will evaluate a Our federation engine is designed to be suitable for a dynamic query iteratively one triple at a time, while propagating values network topology, in accordance with our listed desiderata. in a nested loop join fashion. This multiplies HTTP request in 1 http://www.w3.org/2001/sw/WebOnt/ proportion to the result sets returned by evaluating each join STIDS 2013 Proceedings Page 142 3 Q ⌃ the federator component which performs service discovery at run-time (cf. Section VI) to identify live and relevant sources. Rewrite Relevance here means signature overlap, where a signature is understood as a set of RDF properties. The extent of the overlap between the signature of the query and the signature Q⌃ of a given endpoint determines a SPARQL CONSTRUCT query which will be routed to that endpoint. Distribute The CONSTRUCT queries are designed to adhere to a logical form which is sufficiently structured to enable us to guarantee ... the soundness and completeness of the query answering pro- R1 R2 Rn cess wrt. the set of sources R, as explained in more detail in the next sections. Taken together with the obvious minimality of our approach wrt. the number of HTTP requests—only a Crop single request is sent to each source—as well as the relevance- based per-query discovery of sources, we conclude that our approach meets all our tentative desiderata A) to D). Evaluate Answer V. P ROPERTIES OF THE CROPPING Figure 1. System overview In this section we formally define the notion of the cropping of a distributed set of sources R relative to a query Q, and we state its essential properties. We shall assume familiarity with argument.2 Admittedly, there are several improved versions of SPARQL syntax and semantics (cf. [12]). this algorithm on offer. The bound join technique implemented Notation. We use Ri , where i is in some index set I, to denote in FedX [9], for instance, groups several instances of a join RDF graphs—variably referred to as sources, repositories or argument in a single subquery using the SPARQL UNION endpoints. A SPARQL SELECT query is a pair hP, ~xi, where construct. This reduces the number of request with a factor P is a SPARQL graph pattern and ~x a vector of elements of equivalent to the the number of instances in the grouped query variables. Similarly, a CONSTRUCT query is a tuple hT, P i, (ibid.).3 Yet, experimental evaluation shows that the number where T is a basic graph pattern and P is a union of such. of HTTP request can still grow quite fast in the number T will be identified with the CONSTRUCT block of the query, of joins.45 . What is common to all these approaches is that aka. the template, and P with the WHERE block, aka. the query the number of HTTP request varies in the number of results pattern. We shall allow ourselves the convenience of blurring returned by the sub-queries. It is a design goal of our approach, the distinction between SPARQL queries on the one hand and in contrast, to make a factor of the size of the query only. sets and families of triple patterns on the other. Where Q := To that end, we designed our federation engine to evaluate hP, ~xi is a SELECT query and G an RDF graph we denote the query, not against the sources directly, but against an the result of evaluating Q against G as Q(G), and similarly excerpt, or cropping as we call it, that is pulled from the for CONSTRUCT queries. The proofs of the claims that follow sources by sending a single HTTP-request to each. Unlike can be found in technical report [13]. traditional warehousing strategies, however, our local copy is As mentioned in the previous section, our approach to federa- not persisted, but exists only in-memory for the duration of tion is signature-based in the sense that the RDF properties that the query execution process. It is essentially a snapshot of that are found in a query are used for routing different sub-queries part of the remote sources which is relevant for answering to different endpoints. This is a common strategy (cf. [8], [9]) the query in question. In realistic cases, the cropping is much for which we claim no originality. Now, given a query pattern smaller than the total amount of data that it is extracted from P the relevant subset of P in relation to a source Ri is defined (see Section V-A). as the maximal subset of P whose signature is contained in An overview of the resulting system, is shown in Figure 1: the the signature of Ri . We shall denote this set as ⇢(P, i). system takes as input a SPARQL query Q, and a collection Recapitulating briefly, our federation engine is designed to of aligned ontologies ⌃, which are used by the rewriter to be HTTP-minimal, as well as sound and complete wrt. to produce the query Q⌃ . This rewritten query is next handed to query answering over the selected sources. A strategy that 2 DARQ [7] and SPLENDID [8] both implement a version of this algorithm. supports all three is to execute the query against an in-memory 3 Similar techniques are considered in [8] and [10] representation of the remote sources rather than against the 4 See e.g. the results for FedX on Life Sciences 3 query from the FedBench sources themselves. More specifically our federation engine suite. routes a single CONSTRUCT query to each of the selected 5 Another notable optimization is the star-shaped pattern technique of [11]. sources—achieving HTTP minimality—whereas the logical Numbers of requests are however not reported in this study. form of this construct query is defined in such a manner as STIDS 2013 Proceedings Page 143 4 to guarantee that the answer to the query assembled from the for every ?x 2 dom( j ) \ dom( k ). selected sources is both correct and complete with respect to those sources. Here soundness and completeness means that if Our CONSTRUCT queries now become: R is a set of sources selected for federation, then the answer Definition 3: For a set ofSsources R := {Ri }i2I and a query that the federator provides to a query Q should be exactly the pattern P : C (P, i) = h f (s(P, i)), f (s(P, i))i where f is same as the one that would be obtained were Q to be evaluated some separation function for s(P, i). conventionally over a single repository holding the union of the data sets in R. To the best of our knowledge, our strategy Example 2: Suppose we have two endpoints JOCWatch and is currently the only one that guarantees that this is the case. MedWatch,6 and the following conjunctive graph pattern P : ?mission medics:missionType medics:Rescue. The logical form in question is in turn defined by distinguish- ?mission medics:jocWatchIncident ?incident. ing between exclusive and non-exclusive triples in a query ?incident jocw:status ?stat. pattern P . Exclusive triples are those that are satisfied, if at Suppose further that each property prefixed by medics all, at one endpoint only. Non-exclusive triples, on the other belongs to the signature of MedWatch, that each property hand, may be satisfied by two or more. Exclusive triples can prefixed by jocw belongs to the signature of JOCWatch, and safely be grouped together and executed against the source that the jocw:status property belongs to both. The queries for which it is exclusive in as a single conjunctive pattern. that are routed to the respective endpoints are then: Non-exclusive triples, however, must be shipped to the remote MedWatch: sources as separate UNION clauses. This holds even if a group CONSTRUCT { of triple patterns are relevant to exactly the same sources since ?_1 medics:missionType medics:Evac. ?_1 medics:jocwIncident ?_2. an answer to the original query may require joining triples ?_3 jocw:status ?_4. } WHERE { across these sources. This gives rise to the following definition { ?_1 medics:missionType medics:Evac. of the set of clauses induced by P and Ri : ?_1 medics:jocwIncident ?_2.} UNION Definition 1 (Clause set): For Ri a source and P a query { ?_3 jocw:status ?_4.}} pattern: s(P, i) := {✏(P, i)} [ {{t} : t 2 ⇢(P, i) \ ✏(P, i)} JOCWatch: CONSTRUCT { Here ✏(P, i) denotes the exclusive group of a pattern P relative ?_1 jocw:instigator ?_2. ?_3 jocw:status ?_4.} WHERE { to Ri . { ?_1 jocw:instigator ?_2.} UNION Now, the basic idea behind our federation strategy is to use the { ?_3 jocw:status ?_4.}} set of clauses induced by P and Ri to define a CONSTRUCT query that extrapolates the part of Ri that is relevant for The cropping may now be defined as follows: answering P . The most straightforward way to do that may seem to be to use the clause set itself as a query pattern, whilst Definition 4 (Cropping): S Put Q := hP, ~xi and R = {Ri }i2I . using the set-theoretic union of its elements as a template. Call Then AQR := i2I C (P, i)(Ri ). this the naive strategy. Interestingly, the naive strategy, whilst We now have: complete, is not sound. Consider the following rather abstract example: Theorem 1 (Soundness/Completeness): S Let R be any set of sources, then Q( R) = Q(AQR ) for any SELECT query Q. Example 1: Let G be the RDF graph containing only the two triples s := (c1, p, d1) and t := (c2, q, d2), and assume Note that here Q is the SELECT query that is being posed a clause-set {{(?s, p, ?o)}, {(?s, q, ?s)}}. The corresponding to the system, whereas AQR , i.e. the cropping, is the result naive CONSTRUCT query is: of assembling the results of the CONSTRUCT queries that are CONSTRUCT {?s q ?o. ?s p ?o.} required for providing an excerpt guaranteed to answer it. WHERE {{?s p ?o} UNION {?s q ?o}} As regards time complexity, since every CONSTRUCT query Executing this query against G will produce a graph containing C (P, i) is in union normal form, Corollary 1 of [12] im- the triple (c1, q, d1). mediately entails that the cropping can be built efficiently. The example shows that the naive strategy may create bindings In our actual implementation, the CONSTRUCT queries that in the resulting graph that do not exist in the graph that is are allocated to the respective endpoints are, moreover, all queried. To counteract this effect it is necessary to standardize executed in parallel, so time is not a precarious measure. apart the elements of the clause-sets before using taking the union and using it as a CONSTRUCT template. To this end we introduce the notion of a separation function: A. Restricting the size of the cropping Definition 2 (Separation function): Let S := {c1 , . . . , cn } be Although time is not a precarious measure, the size of result a clause set, and let i be a uniform substitution of variables sets quickly becomes an issue. The CONSTRUCT queries that for variables in ci . A separation function f for S is a function are passed around to the remote endpoints, if not constrained, s. t. 1) f (S) = { 1 (c1 ), . . . , n (cn )}, and 2) j (?x) 6= k (?x) 6 These systems are described in VI STIDS 2013 Proceedings Page 144 5 may well distribute a triple pattern (?a, rdf : type, ?b) to all VI. C ASE remote endpoints, in effect requesting huge chunks of the data contained in each. The case we use to exemplify our approach is based upon the following scenario: A military analyst is monitoring planned Now, there is no need in our approach for join-ordering medical evacuation flight missions, and is on the lookout for heuristics in the conventional sense, since, per the approach, missions that might be threatened by enemy activity. If this joins are either executed remotely, or executed locally by is the case, she is also interested in finding friendly units a standard query processing engine after the cropping has able to counter the particular threat. More specifically, the been built. Rather, what we do, is to build the cropping information requirement is the following: Find all medical incrementally by assessing the relative selectivity of triple evacuation missions and friendly units such that a) the mission patterns and processing the most selective ones first. We can can be classified as being threatened; and b) that the friendly only describe this procedure in general outline here: unit can handle the specific type of threat that the enemy poses. The selectivity of a triple pattern may be assessed along several Normally, in order to obtain an answer to this information dimensions. For instance, studies show that a triple pattern with requirement, the analyst has to keep an eye on several systems, a literal in object position will usually be more selective than as information about evacuation flights and information about one with a URL in the same position [14]. Moreover, triple enemy activity are usually not kept in the same system. With patterns can be ordered in a plausible sequence of decreasing the aid of a system as outlined so far in this paper, however, selectivity based on the distribution, and position, of variables the analyst can pose a query formulating what she is looking in the pattern (? denotes a variable): for and let the information integration system take care of the (s, p, o) (s, ?, o) (?, p, o) (s, p, ?) (?, ?, o) rest. (s, ?, ?) (?, p, ?) (?, ?, ?) To evaluate the approach and the case outlined above, we The entire set of heuristic rules that we have used in our conducted an experiment at NATO CWIX 20137 set as close solution can be found in [14]. as possible to the dynamic and multinational environment of NNEC. The experiment involved three operational information Now, the idea is to build the cropping in layers by employing systems: 1) JOCWatch, information on incidents of relevance the following three-step procedure: 1) construct the graph to the command in an event log, 2) MedWatch, a system for corresponding to the most selective patterns pertaining to each medical mission tracking designed to support the planning, endpoint 2) extract variable bindings from the cropping so far, logging and monitoring of medical evacuation missions, and and 3) pass them on to the next iteration as constraints for the 3) Track Source, a unit tracking service providing times- next round of queries. tamped geopositional information regarding friendly units in Step 2, the extraction of variable bindings, is realized by re- the field. The information in MedWatch and JOCWatch were using the triple patterns as SELECT queries that are evaluated made available through SPARQL endpoints by using D2R [15], against the cropping as it exists so far, whereas the propagation while the Track Source service had a native SPARQL interface. of values from one layer to the next is realized with the In addition, all sources were supplied with a service description VALUES feature of SPARQL 1.1, which allows a set of according to the SPARQL 1.1 specification, and each source bindings to be shipped with a query in order to constrain the made available its ontology at an URL described in the service answers. description. Our procedure is designed to treat each exclusive group as Our prototype system performed service discovery using an atomic unit, since exclusive groups are likely to be more • mDNS8 for broadcasting and discovering the presence of selective as a set. For the same reason, they are given maximum information sources, priority, That is, the first layer of the incremental construction • DNS-SD9 for high-level description of the source in of the cropping consists of the result of executing the exclusive terms of pointers to query endpoint location and content groups as CONSTRUCT queries. The subsequent layers are then description location, and constructed from the non-exclusive patterns by rating them • the SPARQL 1.1 service description and VoID10 vocabu- according to the heuristic criteria. laries for describing source content. This procedure is sufficient to ensure that very unconstrained patterns such as (?s, ?p, ?o) will be processed late, when This approach addressed the NNEC needs as outlined in bindings are available for some of its variables. It can also section II, and had the advantage that it was independent be tuned to give low priority to predicates from existing RDF of a central registry, thus eliminating the issue of network vocabularies that are known to have a low selectivity rate, 7 Coalition Warrior Interoperability eXploration, eXperimentation and eX- such as e.g. rdf : type or dcterms : title. The procedure amination, eXercise – an annual NATO event aimed at improving alliance-wide preserves the soundness and completeness of the cropping wrt. interoperability the underlying sources, and, although the number of HTTP 8 http://www.multicastdns.org/ request is no longer minimal it is constant in a small factor of 9 http://www.dns-sd.org/ the number of triple patterns in the original query. 10 http://www.w3.org/TR/void/ STIDS 2013 Proceedings Page 145 6 JOCWatch Concept Definition ThreatenedMission MedWatch missions that are related to a ThreateningIncident jw:Hostile ThreateningIncident All JOCWatch incidents that are related to a jw:instigator rdf:type ThreateningEvent rdf:type jw:incident ThreateningEvent All JOCWatch events that are both a MilitaryOperation (from the JOCWatch ontology) and a HostileEvent jw:SAFIRE HostileEvent All events that has a HostileInstigator HostileInstigator All event participants that are classified as being hostile. med:incident Relation Definition MedWatch canHandle A relation between a military unit type and the type of events those units types are equipped to rdf:type useront:canHandle handle. med:Mission hasEvent If a mission involves an incident, and there exists an event that belongs to the same incident (inverse property), then the event is also related to the mission TrackSource Table I. R ELEVANT DEFINITIONS IN THE USER ONTOLOGY nf:unit rdf:type nf:Artil nf:lon nf:posinf SELECT ?mission ?unit WHERE{ nf:lat ?mission a useront:ThreatenedMission. “49.12352” “51.76352” ?mission useront:hasEvent ?event. ?event a useront:ThreateningEvent. ?event wgs84:lat ?elat. ?event wgs84:long ?elong. Figure 2. Conceptual relationship between data sources ?unit useront:canHandle ?event. ?unit useront:hasPosition ?pos. ?pos wgs84:lat ?ulat. ?pos wgs84:long ?ulong. } Figure 3. SPARQL query representing the information request fragmentation. The relationship between the information sources in our ex- periment is illustrated in Figure 2. In the figure we see that earlier can now be expressed by the query in Figure 3. 11 MedWatch missions are (potentially) related to JOCWatch events through a shared incident. Furthermore, the JOCWatch Here ThreateningEvent, hasEvent, canHandle, and events are typed according to category e.g. as a SAFIRE event, hasPosition are terms specific to the user’s vocabulary, see which is an event that involves hostile surface-to-air fire. In the table VI. Posing this query to any of the information sources figure we also see that units in the Track Source are typed would not return any answers. In our experiment, this query (e.g. as Artillery), and that it contains positional data. was decomposed and distributed as per the approach outlined Additionally, in the figure, a stipulated line is drawn from earlier. Artillery to SAFIRE, indicating that, units of the former The main motivation behind this experiment was to test kind are equipped to counter those of the latter. This relation whether it is feasible to provide a decision maker with the does not actually belong to the data, but is defined in the user means to request information using her own terms and without ontology useront, and is key to the formulation of the user’s presupposing detailed knowledge about a fixed set of sources. information need. This creates a coupling between the requesting system and the The experiment included four main ontologies: a JOCWatch information sources that is loose enough to adapt to a changing ontology, a MedWatch ontology, a Track Source ontol- network topology, something that should be highly relevant in ogy, and an ontology containing the concepts used by the user the NNEC environment. Our strategy of combining once-per- of the system. query federation with rewriting worked well for our sample case, and proves that the idea is sound in general outline. To In this particular case, the user ontology is derived from and be sure, if the system is to to scale well—both in terms of expresses the data models of the respective sources. We are efficiency and usability—there are some serious issues that thus assuming that, although available sources may come and need to be addressed having to do with the expressiveness of go, we know their data models. This is not overly unrealistic, the ontology language and the complexity of reasoning in it. since NATO-wide standardization is part of the NNEC con- We record some findings in the next section. cept. The assumption means that we do not have to match ontologies at run-time. A user ontology less tightly coupled with the source ontologies and run-time ontology matching is 11 In reality, we apply filtering of friendly units based on distance from something we plan to look more into in future work. events using the haversine formula and a threshold value. As this does not contribute to understanding the general approach we have left it out of the Given the user ontology, the information requirement described example. STIDS 2013 Proceedings Page 146 7 VII. E XPERIENCES AND O BSERVATIONS backwards (i.e. traversing the inverse of the relation). As explained in section IV, it is an essential presupposition The problem with 1) is that it has a binary predicate in the of our approach to federation that the ontology that is used conclusion. For that reason, it cannot be expressed as a class to provide access to the underlying sources be expressed in inclusion axiom. A description logic axiom is either a class a first-order rewritable language. This is necessary in order axiom or a relationship axiom (aka. role axiom) but cannot be to separate reasoning from source selection, thus making the a mix of the two. Indeed, a class inclusion axiom—irrespective system able to adapt to a dynamic network topology by of the particular brand of description logic that is being used— selecting sources at run-time. cannot, by design, express cross-references between antecedent and consequent in two or more variables, as our axioms Yet, it is not given that the structure and relationship between require. the sources that we selected for our case-study, as illustrated in Figure 2, can in fact be expressed in a first-order rewritable Description logics typically allow us to state axioms like the language. following (in description logic notation): Choosing an ontology language in the DL-Lite family of • Artillery v 9canHandle.SAFFIRE description logics would have been natural for several reasons: • Artillery v 8canHandle.SAFFIRE First,these languages are specifically designed to stay within the boundaries of first-order rewritability. Secondly, DL-Lite At first glance, these may seem to come close to what we forms the basis of the W3C-endorsed QL language profile wish to say, but that is not really the case. The first says that of OWL 2, and so has an XML serialization, and enjoys the an artillery unit can handle some surface-to-air-fire event, but status of an official recommendation. Finally, several efficient it does not identify the event. The second says that an artillery rewriters already exist for the DL-Lite family of languages, unit can only handle surface-to-air-fire events, although it may which, if we could use them, would of course leverage the not be able to handle all of them. What we wish to say burden of implementing our own federation engine. though us that all artillery units can handle all surface-to-air- fire events. As it turns out, however, our case cannot be expressed in any of the standard OWL2 profiles, nor in any other description Taking stock, there is thus, to the best of our knowledge, no logic we are currently aware of. This is due to a combination first-order rewritable description logic capable of expressing of features exemplified by the structure of our sources. Refer- the structure and interrelationship between our selected sample ring to Figure 2 there are mainly two sources of expressive of military information sources. As it turns out, though, there complexity: is a different family of ontology languages altogether that is sufficiently expressive for our needs, namely the family 1) As indicated by the stipulated arrow labelled of general existential rules aka. existential datalog [6], more useront:canHandle connecting TrackSource specifically the language of weakly recursive datalog. Weakly to the JOCWatch database, we wish to add axioms recursive datalog is strictly more expressive than any first- to the ontology that classify which kind of vehicle order rewritable description logic—and more importantly, it or unit that is equipped to counter which kind of is sufficiently expressive to express 1) and 2) above, thus hostile event—for instance artillery in the case of a capturing the salient features of our case. Our federation engine surface-to-air attack. Encoding this knowledge in the is therefore equipped with a rewriter that expects an ontology ontology is necessary in order to enable the user to to be encoded in weakly recursive datalog. query the sources for available military support within Although, this choice is more or less forced upon us by a given diameter from a threatened position. However, the characteristics of the case, it does not mean of course it requires that we be able to state in the ontology that the choice of recursive datalog as our ontology language that certain combinations of unit types and events does not come with its own set of drawbacks. First of all, constitute sufficient conditions for the unit and event in existential datalog in general does not currently have the kind question to stand in the canHandle relation. Stated of institutionalized support that the OWL family of languages more formally, we need to have axioms of the form enjoys. Secondly, and much for the same reason, it has far 8x8y.Artillery(x)^SAF IRE(y) ! canHandle(x, y), less endorsement from the software industry in terms of tool for all the appropriate combinations of units and events. support. 2) Presupposing that the canHandle relation has been In fact, we could not find an existing rewriter for weakly axiomatized, we further need to express in the ontology recursive datalog, and therefore had to build one from scratch. that finding the position of a unit involves traversing the Alas, implementing a correct rewriter does not entail that one TrackSource graph from the reported latitudes and has implemented an efficient one, and although queries over longitudes through the relations nf:posinf, nf:unit weakly recursive datalog ontologies are first-order rewritable, and rdf:type via useront:canHandle to an as- the size of the rewriting itself may be exponential in the size sociated hostile event. This is a fairly long and intricate of the original query. Thus, without a considerable amount path that requires traversing relations forwards as well as of research being devoted to optimization, the rewriter is not STIDS 2013 Proceedings Page 147 8 likely to perform well for any large class of cases. Theoretical [2] W3C, “RDF Primer,” http://www.w3.org/TR/rdf-primer/, February results are encouraging, though. In particular, results from [16] 2004. shows that there is a minimal rewriting of any query over a set [3] ——, “OWL 2 Web Ontology Language Primer,” of weakly recursive datalog rules. Computing such a minimum, http://www.w3.org/TR/2009/REC-owl2-primer-20091027/, October 2009. however, remains a topic for future research. [4] ——, “SPARQL 1.1 Query Language,” http://www.w3.org/TR/sparql11-query/, March 2013. VIII. R ELATED W ORK [5] D. Calvanese, G. D. Giacomo, D. Lembo, M. Lenzerini, A. Poggi, M. Rodriguez-Muro, and R. Rosati, “Ontologies and Databases: The DL-Lite Approach,” in Semantic Technologies for Informations Systems. Several studies have addressed the problem of decomposing Springer, 2009. a SPARQL query into sub-queries that can be allocated to a [6] G. Gottlob, G. Orsi, and A. Pieris, “Ontological Queries: Rewriting distributed set of remote sources. Notable examples include and Optimization (Extended Version),” Computing Research Repository, [17], [7] [18], [19], [20], [9], [11] and [8]. All of these vol. abs/1112.0343, 2011. studies belong to what we would call the join-order heuristics [7] B. Quilitz and U. Leser, “Querying distributed RDF data sources paradigm, and, unlike the present paper, none gives particular with SPARQL,” in Proceedings of the 5th European Semantic Web attention to establishing framework that is both sound/com- Conference (ESWC ’08), 2008. plete and request-minimal. Moreover, the listed reports focus [8] O. Görlitz and S. Staab, “SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions,” in Proceedings of the 2nd International exclusively on federating queries that are expressed directly in Workshop on Consuming Linked Data (COLD 2011), 2011. terms of the data. To the best of our knowledge there are very [9] A. Schwarte, P. Haase, K. Hose, R. Schenkel, and M. Schmidt, few contributions that address the question of how to combine “FedX: Optimization Techniques for Federated Query Processing on query federation with reasoning, where reasoning cuts across Linked Data,” in Proceedings of the 10th International Semantic Web several sources. Conference (ISWC 2011), 2011. [10] J. Zemanek and S. Schenk, “Optimizing SPARQL Queries over Dis- parate RDF Data Sources through Distributed Semi-Joins,” in Inter- IX. C ONCLUSION national Semantic Web Conference (Posters and Demos), ser. CEUR Workshop Proceedings, vol. 401, 2008. In this paper we have established a sound, complete and [11] G. Montoya, M.-E. Vidal, and M. Acosta, “A heuristic-based approach request-minimal baseline for query federation. Our approach for planning federated sparql queries.” in COLD, ser. CEUR Workshop is signature-based and compatible with a run-time selection of Proceedings, J. Sequeda, A. Harth, and O. Hartig, Eds., vol. 905. sources. It is therefore particularly suitable for domains that CEUR-WS.org, 2012. are characterized by low bandwidth and a dynamic network [12] M. Arenas, C. Gutierrez, and J. Pérez, “Foundations of RDF Databases,” topology. We have further described an example from military in Reasoning Web. Semantic Technologies for Information Systems. Springer, 2009. evacuation planning to illustrate the usefulness of the approach. [13] B. J. Hansen, J. Halvorsen, and A. Stolpe, “Information integration In order to mitigate the heterogeneities between sources, as experiment at NATO CWIX 2012,” Norwegian Defence Research well as to present the data in a vocabulary that is familiar Establishment (FFI), FFI/RAPPORT-2012/01543, 2012. to the user, we found it expedient to use an ontology to [14] P. Tsialiamanis, L. Sidirourgos, I. Fundulaki, V. Christophides, and provide a unifying layer above the information sources. Any P. Boncz, “Heuristics-based query optimisation for sparql,” ser. Pro- incoming query is therefore rewritten according to the ontology ceedings of EDBT ’12. ACM, 2012. before being passed on to the sources. However, we have that [15] C. Bizer and R. Cyganiak, “Publishing Relational Databases on the ontology-based data access is not coherent with our federation Semantic Web,” http://www4.wiwiss.fu-berlin.de/bizer/d2r-server/, Au- gust 2009. strategy unless the ontology is formulated in a language that [16] C. Civili and R. Rosati, “A broad class of first-order rewritable tuple- is first-order rewritable such that reasoning can be decoupled generating dependencies,” in Datalog in Academia and Industry, ser. from data access. In realistic cases like ours, one quickly Lecture Notes in Computer Science, P. Barceló and R. Pichler, Eds. transcends the expressive capabilities of familiar first-order Springer Berlin Heidelberg, 2012, vol. 7494, pp. 68–80. rewritable OWL fragments such as OWL2-QL. In our case we [17] M. Acosta, M.-E. Vidal, T. Lampo, J. Castillo, and E. Ruckhaus, overcame this limitation by resorting to a decidable fragment “Anapsid: An adaptive query processing engine for sparql endpoints.” in of existential datalog. Proceedings of the 10th International Semantic Web Conference (ISWC 2011), 2011. [18] C. Basca and A. Bernstein, “Avalanche: Putting the spirit of the web ACKNOWLEDGMENTS back into semantic web querying,” in ISWC Posters&Demos, 2010. [19] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K.-U. Sattler, and The NATO systems participating in the experiments reported in J. Umbrich, “Data summaries for on-demand queries over linked data,” in Proceedings of the 19th international conference on World wide web, this paper was made available to us by the NATO C3 Agency. ser. Proceedings WWW ’10. New York, NY, USA: ACM, 2010, pp. 411–420. [20] Y. Li and J. Heflin, “Using reformulation trees to optimize queries over R EFERENCES distributed heterogeneous sources,” in Proceedings of 9th the Interna- tional Semantic Web Conference (ISWC), ser. Proceedings ISWC’10. [1] P. Bartolomasi, T. Buckman, A. Campell, J. Grainger, J. Mahaffey, Springer-Verlag, 2010. R. Marchand, O. Kruidhof, C. Shawcross, and K. Veum, “NATO Network Enabled Capability Feasibility Study, Version 2.0,” NATO C3 Agency, Tech. Rep., October 2005. STIDS 2013 Proceedings Page 148