=Paper=
{{Paper
|id=Vol-1362/paper4
|storemode=property
|title=On Requirements for Federated Data
Integration as a Compilation Process
|pdfUrl=https://ceur-ws.org/Vol-1362/PROFILES2015_paper4.pdf
|volume=Vol-1362
|dblpUrl=https://dblp.org/rec/conf/esws/Adamoud15
}}
==On Requirements for Federated Data
Integration as a Compilation Process==
On Requirements for Federated Data Integration
as a Compilation Process
Alessandro Adamou and Mathieu d’Aquin
Knowledge Media Institute, The Open University, United Kingdom
{alessandro.adamou, mathieu.daquin}@open.ac.uk
Abstract. Data integration problems are commonly viewed as inter-
operability issues, where the burden of reaching a common ground for
exchanging data is distributed across the peers involved in the process.
While apparently an effective approach towards standardization and in-
teroperability, it poses a constraint to data providers who, for a vari-
ety of reasons, require backwards compatibility with proprietary or non-
standard mechanisms. Publishing a holistic data API is one such use
case, where a single peer performs most of the integration work in a
many-to-one scenario. Incidentally, this is also the base setting of soft-
ware compilers, whose operational model is comprised of phases that
perform analysis, linkage and assembly of source code and generation of
intermediate code. There are several analogies with a data integration
process, more so with data that live in the Semantic Web, but what re-
quirements would a data provider need to satisfy, for an integrator to be
able to query and transform its data effectively, with no further enforce-
ments on the provider? With this paper, we inquire into what practices
and essential prerequisites could turn this intuition into a concrete and
exploitable vision, within Linked Data and beyond.
Keywords: Linked Data, Query federation, Compilers
1 Introduction
Open standards play an unquestionable role in the evolution of data interoper-
ability, and an eminent example can undoubtedly be found in Linked Data. This
set of principles and standards favors uniform federated querying across multi-
ple data providers at the hands of applications. These applications, in turn, can
serve many use cases, one being the exposure of an API that publishes aggre-
gated data from multiple sources. One cannot expect such an API to conform to
the same interoperability principles as the sources it draws from, due to possible
backwards compatibility with legacy systems and other industrial constraints.
Implementing this process certainly benefits from standardized mechanisms
for federated querying such as those offered by SPARQL, however, the trans-
lation of query results into the desired specifications relies upon the integrator
itself. In Linked Data, the line is drawn on semantic interoperability with reuse
76
of resources, be they terms of a vocabulary or data items, which leaves some
loose ends, for instance as to how data URIs should be transformed if necessary.
Software compilers operate in analogy with use cases like the many-to-one
scenario above, as in there, a single program analyses and links multiple files (the
source code) into an output that is then transformed into an object that complies
with the target specification (the machine-executable program). As the compiler
literature is vast and its history long, we look into avenues for capitalizing on it.
With this paper, we intend to discuss the merits of these research questions:
RQ1. Is it possible to formulate a data integration problem based on federated
querying as a compilation process?
RQ2. If the answer to RQ1 is yes, what information should a data integration
environment expose, for us to treat it like software code to be compiled?
Being able to answer yes to RQ1 would open up a range of possibilities for
the principles and practices of data federation. Most of all, it would allow us to
bring the craftsmanship of compiler experts into the fields of data integration to
research the optimal answer to RQ2. This could help solve specific integration
problems or optimize existing solutions, effectively allowing us to discuss ‘data
compilation’ as a discipline by its own right.
In Section 2, we outline the above data API scenario in greater detail. On
its basis, in Section 3 we reformulate the associated data integration process in
terms of the classic analysis/synthesis model of compilers. Finally, in section 4
we give an insight as to what further research is being carried out on this vision.
2 Scenario
Given a collection of known linked data providers (hereinafter, sites) that ex-
pose a hierarchy of RDF graphs (datasets) through an interface such as SPARQL
endpoints and/or dereferenceable CoolURIs, the goal is to produce a data feed
published on a single endpoint (integrator ), which selectively reuses data from
the sites and encodes them in a custom target language. Not uncommonly in
industrial and traditional data management, this language must give the im-
pression that their provider is ‘in control’. To that end, it satisfies the following:
1. a single represented item appears as an attribute/value map;
2. attributes are named according to an in-house naming convention (i.e. no
ontology property names are reused);
3. values are represented as items per (1), up to a fixed level of recursion beyond
which they are identified by a reference. These references are URIs resolved
by the same API that produces the data feed (i.e. the API is self-contained).
These requirements are in stark contrast with the principles of Linked Data,
which dictate that providers should be free to use their own vocabularies and
identifiers, and that both should be reused, rather than concealed, by others.
Finally, we assume that some sites publish meta-level descriptions of their
datasets as VoID or Data Cube manifests. These, combined with other meta-level
information computed by the integrator (cf. RQ2), form the site profile.
77
3 Data integration in the front/back end compiler model
A compilation problem can be formulated in terms of requirements of the target
machine code, e.g. that it has to be executable by processors of a certain family
with certain instruction sets and register layout. Data integration can also be
approached in terms of the requirements of the final data feed, i.e. the compiled
data in a target language that agents of a certain type, human or machine, must
be able to read and interpret. We aim at identifying whether a similar parallel
is possible in the operational model of the solutions to these problems.
Fig. 1. Compilation phases in the classical front/back end model.
A traditional model of compiler design is depicted in Figure 1. It has as its
pivotal phase the generation of code in an intermediate language for the program
at hand. This phase is preceded by an analysis part, which comprises lexical,
syntactic and semantic analysis, and is followed by a synthesis part, where the
code in the target language is generated and optimizations are performed [5].
Also, it is the compiler itself that has to fulfil the requirements of most phases,
especially synthesis ones, whereas source code is mostly required to be correct for
the analysis phases not to fail (few programmers will take compiler optimizations
into account when writing the code). Synthesis is also called the back end of the
compiler, and the other phases its front end. The following sections break down
these operational strands and seek a correspondent for each in the above scenario
through query federation, where the burden of performing most of the integration
lies on a single peer that we control, and that corresponds to the compiler.
3.1 Intermediate code generation
Intermediate code is generated in a language defined for and used by the compiler
alone, in order to satisfy certain optimality conditions. An intermediate language
is not necessary, but without one, a full native compiler would be required for
each target architecture, instead of only a re-implementation of the synthesis.
Porting this notion to our linked data integration scenario, without an inter-
mediate language, all the code present in a compilation unit, i.e. an instance of
78
output of a site (in RDF or SPARQL results) would be rewritten directly into the
target language, thus reducing the potential for detecting redundant references
and collapsing them in the data feed (cf. Section 2 req. 3, self-containment).
We will assume RDF triples to be the formalism of choice, given their nat-
ural inclination to several layers of interoperability, and adapt the analysis and
synthesis parts accordingly1 . Also, there is an interesting parallel with the three-
address intermediate code of compilers, which is a serialized form of decision
trees on binary operators. The intermediate language itself is the combination of
triples and a naming convention for their nodes, e.g. resources and literals, which
is entirely up to the integrator. This naming convention is not required to make
sense on the outside world, that is, we disregard inherently Linked Data features
of RDF such as dereferenceable URIs2 . We require, however, the following:
1. globality. The naming convention should be able to rewrite URIs of aligned
resources (e.g. via owl:sameAs statements) into the same URI.
2. completeness. It must apply to every possible URI that appears in the
data supplied by any site involved in the integration process;
A naming convention supports a URI pattern if it satisfies globality for all its
occurrences. Completeness can be satisfied even for URIs whose scheme is not
known a priori: a function that, for instance, prepends a prefix to the original
URI if its pattern is unsupported would be a sufficient naming convention.
3.2 Front end: analysis and assembly
Software compilers perform lexical, syntactic and semantic analysis on the
source code and derivative data structures, to check if the code is an occur-
rence of the programming language and respects semantic requirements such as
type matching and variable scopes. These phases are usually backed by symbol
tables, i.e. data structures maintained by the compiler that keep track of the
occurrences of entities such as variable names, function signatures and objects.
To begin with, we define the compilation unit to be an instance of the output
of a site (in RDF or SPARQL results) given a query on it. The role of these
analyses in data integration is ambivalent, depending on what elements we choose
to be the symbols, syntax and semantics of the process. If we establish that
the symbols are the constituents of RDF (URIs, literals, bnodes etc.), then the
analysis part coincides with that of an RDF parser; there are no site-specific
requirements other than delivering well-formed compilation units, at the price
of not being able to perform per-site optimizations. If instead we apply the
lexicon-syntax-semantics paradigm differently, then we can expect advantages in
translating compilation units to the intermediate language. Here, we will assume
that the patterns for constructing URIs in each dataset are part of the lexicon,
and their instances are kept track of in the symbol table.
1
One could also opt for OWL as intermediate language, though we would have to be
wary of the caveats of translating RDF triples into OWL axioms appropriately.
2
The way RDF processors generate blank node IDs can be such an implementation.
79
Assuming the above in our compiler model, the semantic analysis phase can
now include matching of RDF types with URI patterns [2] and heuristics for
detecting and collapsing equivalent entities [6]. As part of a process called linking,
where a single object is built out of multiple compilation units, the results of this
analysis (which we can assume to reside in an assembly plan maintained by the
integrator) can be applied to the generation of unified intermediate code. The
question then arises as to what information about the sites and their datasets
should the assembly plan contain in order to perform linking effectively. In the
present scenario and compilation model, the goal is to avoid query broadcasting
and its network and computational overhead: it should be possible to determine
the eligibility of a site as a candidate for providing relevant data, therefore worth
querying, and the shape of the data it can deliver, so as to determine what ad-hoc
query to issue to it. We are currently investigating into how intermediate code
that transforms URIs to satisfy globality and completeness can be generated if:
1. all entities are typed, either explicitly or implicitly;
2. the relationship between a URI pattern in a dataset and the types of its
instances, or their identifying property values, is explicit;
3. it is known which conventions are employed in the assertions that are ma-
terialised in the data, and which are left to inferencing: for instance, which
property of an inverse property pair is used in asserted statements.3
Related to (1), explicit types can be found in VoID class partitions4 and
Data Cube slicing5 , or by sampling the dataset directly; implicit ones are ob-
tainable through inferencing on the compilation units and the ontologies that
describe their vocabularies. Requirement (2) is largely unsatisfied by the existing
standards and literature and is mostly left to research. Finally, (3) finds partial
fulfilment in VoID property partitioning combined with ontologies.
3.3 Back end: optimization and target encoding
An optimizing compiler modifies generated intermediate and target code in order
to improve certain efficiency measures that the compiler supports. What this
translates to in a data integration scenario is largely under investigation, but we
began by identifying certain tasks; as part of target-independent optimization:
– Consolidation of matching data items and elimination of redundant at-
tributes, through ontology alignment and other means.
– Handling query expansion; identify and construct further queries to be issued
to sites in order to perform just-in-time linking.
– Serial and parallel scheduling of these queries built through query expansion.
As part of target-dependent optimization:
3
Of course some of these limitations can be overcome in SPARQL by adding optional
triple patterns and unions to a query, but at a generally impracticable overhead.
4
Vocabulary of Interlinked Datasets, http://www.w3.org/TR/void/
5
RDF Data Cube vocabulary, http://www.w3.org/TR/vocab-data-cube/
80
– Determining the optimum threshold for including recursively-referenced items
in one feed, beyond which their data are replaced with references.
– Rewriting attribute names to avoid name clashing with attributes in other
data feeds resulting from a different query to the same sites.
While we do not expect target-dependent optimization to raise significant re-
quirements for site profiles, we hypothesize that target-independent optimization
tasks can partly rely on the symbol tables generated in the front end phases.
4 Conclusions
We have made a case of formulating typical legacy data integration problems
using the paradigm of software compilers, prognosticating that in so doing com-
piler optimizations may contribute to this cause. Although we have not identified
previous evidence of data integration problems formulated using compilers, there
is recent literature on formulating models and challenges for data integration on
the Web. Paton et al. have postulated a model for continuously improving in-
tegration in a purely Linked Data setting [4], which significantly inspired our
work. Hoang et al. have collated scholarly literature on the practices of seman-
tic mashups, which our use case shares several contact points with [3]. As for
compiler-like approaches in the Semantic Web, we previously laid out some sem-
inal work in the context of interpreting ontology networks [1]. We are currently
elaborating on the position given by this paper as applied to a concrete instan-
tiation of its scenario, now in the process of formalizing back end requirements.
References
1. Alessandro Adamou, Paolo Ciancarini, Aldo Gangemi, and Valentina Presutti. The
foundations of virtual ontology networks. In Marta Sabou, Eva Blomqvist, Tom-
maso Di Noia, Harald Sack, and Tassilo Pellegrini, editors, I-SEMANTICS 2013
- 9th International Conference on Semantic Systems, Graz, Austria, pages 49–56.
ACM, 2013.
2. Mathieu d’Aquin and Alessandro Adamou. Extracting URI patterns from SPARQL
endpoints. Technical report, Knowledge Media Institute, 2014.
3. Hanh Huu Hoang, Tai Nguyen-Phuoc Cung, Duy Khanh Truong, Dosam Hwang,
and Jason J. Jung. Semantic information integration with linked data mashups
approaches. IJDSN, 2014, 2014.
4. Norman W. Paton, Klitos Christodoulou, Alvaro A. A. Fernandes, Bijan Parsia,
and Cornelia Hedeler. Pay-as-you-go data integration for linked data: opportunities,
challenges and architectures. In Roberto De Virgilio, Fausto Giunchiglia, and Letizia
Tanca, editors, Proceedings of the 4th International Workshop on Semantic Web
Information Management, SWIM 2012, Scottsdale, AZ, USA, page 3. ACM, 2012.
5. A.A. Puntambekar. Compiler Design. Technical Publications, 2010.
6. Ziqi Zhang, Anna Lisa Gentile, Isabelle Augenstein, Eva Blomqvist, and Fabio
Ciravegna. Mining equivalent relations from linked data. In 51st Annual Meet-
ing of the Association for Computational Linguistics, ACL 2013, Sofia, Bulgaria,
Volume 2: Short Papers, pages 289–293. The Association for Computer Linguistics,
2013.