X3ML Framework: an Effective Suite for
Supporting Data Mappings
Nikos Minadakis1 , Yannis Marketakis1, Haridimos Kondylakis1 , Giorgos
Flouris1 , Maria Theodoridou1 , Gerald de Jong2 , and Martin Doerr1
1
Institute of Computer Science, FORTH-ICS, Greece
2
Delving B.V. The Netherlands
{minadakn,marketak,kondylak,fgeo,maria,martin}@ics.forth.gr
gerald@delving.eu
Abstract. The aggregation of heterogeneous data from different insti-
tutions in cultural heritage and e-science has the potential to create rich
data resources useful for a range of different purposes, from research to
education and public interests. In this paper, we present the architec-
ture and functionality of X3ML data exchange framework, that handles
effectively and efficiently the schema mapping, URI definition and gen-
eration, and data transformation steps of the provision and aggregation
process. The X3ML framework is based on the X3ML mapping definition
language that offers the building blocks for describing both schema map-
pings and URI generation policies, and the X3ML engine, that handles
the URI generation and the data transformation. The X3ML framework
supports the cognitive process of mapping and it has a lot of advantages
compared to other existing tools including that the schema mappings are
expressed in a declarative way, and are both human and machine read-
able allowing domain experts to understand them, the schema matching
and the URI generation policies comprise different distinct steps in the
exchange workflow, and follow different life cycles. Furthermore X3ML
is symmetric and potentially invertible allowing bidirectional interaction
between providers and aggregator and thus supporting not only a rich
aggregators’ repository but also corrections and improvements in the
providers’ data bases.
Keywords: Data Mappings, Data Aggregation, URI Generation
1 Introduction
Managing heterogeneous data is a challenge for cultural heritage institutions,
such as archives, libraries, and museums, but equally for research institutes of
descriptive sciences such as geology, biodiversity, clinical studies etc. These insti-
tutions host and develop various collections with heterogeneous material, often
described by different metadata schemas. In order to provide uniform access
to heterogeneous and autonomous data sources, complex query and integration
mechanisms have to be designed and implemented.
In order to allow data transformation and aggregation, it is required to pro-
duce mappings, to relate equivalent concepts or relationships from the source
schemata to the aggregation schema, i.e. the target schema, in a way that facts
Minadakis et al.
described in terms of the source schema can automatically be translated into
descriptions in terms of the target schema, or “enterprise model” as Calvanese
et al. [6] describe it. This is the mapping definition process and the output of
this task is the mapping, i.e., a collection of mapping rules.
In this paper we describe the X3ML framework, which is able to support
the data aggregation process by providing mechanisms of data transformation
and URI generation. Mappings are specified with the X3ML mapping definition
language which is a declarative, human readable language that supports the
cognitive process of a mapping. Unlike XSLT, that can only be understood by
IT technicians, X3ML can be understood by non-technical people, so a domain
expert is capable of testing the semantics, reading and validating the schema
matching. This model carefully distinguishes between mapping information from
the domain experts who know and provide the data and that created by the IT
technicians who actually implement data translation and integration solutions,
and serves as an interface between both.
A common problem of a schema matching and transformation process is that
the IT experts do not fully understand the semantics of the schema matching
and the domain experts do not understand how to use the technical solutions.
For this reason, in our approach the URI generation and the schema matching
processes are separated, so the schema matching can be fully performed by the
domain expert and the URI generation by the IT expert, and therefore solving
the bottleneck that requires that the IT expert understands the mapping. Fur-
thermore this keeps the schema mappings between different systems harmonized
since the schema mappings definitions do not change in contrast to the URIs
that may change between different institutions and are independent of the se-
mantics. XSLT and R2RML have tightly coupled the URI generation from the
schema matching processes.
Our approach completely separates the definition of the schema matching
from the actual execution. This is important because different processes might
have different life-cycles; in particular the schema matching definition has a
different life-cycle compared to the URI generation process. The former is subject
to more sparse changes compared to the latter.
The remainder of this paper is organized as follows: Section 2 discusses the
related work. Section 3 discusses about the background. Section 4 describes the
details of the X3ML framework. Section 5 enumerates different usages of the
framework and provides the evaluation results. Finally Section 6 concludes and
discusses about the future directions of our work.
2 Related Work
Mapping relational databases (RDB) to RDF became a quite active field the last
few years. This happens as the majority of data currently published on the web
are still stored in relational databases with local schemata and local identifiers.
There are several solutions that fall into this category. Direct Mapping [13] maps
automatically relational tables to RDF classes and attributes to RDF properties.
D2R MAP [5] is a declarative language to describe mappings between relational
2
X3ML Framework: An effective suite for supporting data mappings
databases and OWL/RDFS ontologies. Triplify [2] maps HTTP-URI requests
onto RDB queries and translates the resulting relations into RDF statements.
R2RML1 which is a mapping language proposed by W3C in order to standardize
RDB to RDF mappings.
In addition there are tools, that provide mappings from XML to RDF leading
to mappings in the syntactic level rather than in the semantic level. Tools in this
category include tools based on XSLT (i.e. Krextor [10], AstroGrid-D2 ), tools
based on XPATH (i.e. Tripliser3 ) and XQUERY (i.e. XSPARQL [4]). There are
also other approaches that exploit mapping technologies to publish their data as
linked data. For example the Smithsonian American Art Museum used KARMA
[17] to publish their data as linked data, a tool trying to automate the mapping
process allowing users to adjust the generated mappings. However, there is still
no clear distinction on the work of the domain and the IT experts which perplexes
the whole workflow. KARMA uses R2RML model so it inherits the issue of tight
coupling between the schema matching and the URI generation.
Finally there are similar works that map CSV files to RDF. XLWrap’s map-
ping language [11] provides conversions from CSV and spreadsheets to RDF data
model. Mapping Master’s M2 [14] converts data from spreadsheets into OWL
statements.
All these different approaches prove that there is no standard model to sup-
port mapping of data sources other than relational, the technologies used are
too complex to be used by the domain experts and the whole workflow is not
well-defined. Compared to these works our work (a) uses a simple model for
defining the mappings in a way that is comprehensible and readable from the
domain experts, (b) is generic because the mapping definitions are not tied to
the implementation of the data transformation engine, (c) supports incremental
changes of source and target schema, (d) supports customized URI generation
policies and (e) promotes the collaborative work of experts with different roles
on the mapping process.
3 Background
The main pillar of our work is the Synergy Reference Model (for short SRM)
which is an initiative of the CIDOC CRM Special Interest Group4 . It is a refer-
ence model for a better practice of data provisioning and aggregation processes,
primarily in the cultural heritage sector, but also for e-science. It is based on
experience and evaluation of national and international information integration
projects. It defines a consistent set of business processes, user roles, generic soft-
ware components and open interfaces that form a harmonious whole. Currently
a draft version of the model is available online5 , still being evolved and enriched.
The goal of SRM is to: (a) describe the provision of data between providers
1
http://www.w3.org/TR/r2rml/
2
http://www.gac-grid.de/project-products/Software/XML2RDF.html
3
http://daverog.github.io/tripliser/
4
http://www.cidoc-crm.org/who_we_are.html
5
http://www.cidoc-crm.org/docs/SRM_v1.4.pdf
3
Minadakis et al.
and aggregators including associated data mapping components, (b) address the
lack of functionality in current models (i.e. OAIS [12]) and practice, (c) incorpo-
rate the necessary knowledge and input needed from providers to create quality
sustainable aggregations and, (d) define a modular architecture that can be de-
veloped and optimized by different developers with minimal inter-dependencies
and without hindering integrated UI development for the different user roles
involved.
SRM aims at identifying, supporting or managing the processes needed to
be executed or maintained between a provider (the source) and an aggregator
(the target) institution. It supports the management of data between source and
target models and the delivery of transformed data at defined times, including
updates. This includes a mapping definition, i.e., specification of the parameters
for the data transformation process, such that complete sets of data records can
automatically be transformed. A graphical representation of the data provision-
ing workflow is shown in Fig. 1.
Fig. 1. The data provisioning workflow
The main steps of the data provisioning workflow are:
– Schema matching: source and target schema experts (a.k.a the domain ex-
perts) define a schema matching which is documented in a schema matching
definition file. This file should be human and machine readable and it is the
ultimate communication mean on the semantic correctness of the mapping.
– Instance generation specification: in this step the URI generation and
datatype conversion policies are defined for each instance of a target schema
class referred to in the matching. In this step only IT experts are involved
and domain experts have no interest or knowledge about it.
– Terminology mapping: the terminology mappings between source and
target data/terms are defined. Providers may use anything from intuitive
lists of uncontrolled terms up to highly structured third party thesauri.
4
X3ML Framework: An effective suite for supporting data mappings
– Transformation: once the mapping definition has been finalized (and all
syntax errors are resolved) the data needs to be transformed, producing a set
valid target records. The transformation process itself may run completely
automatically. In the case where any issues arise, the aggregator can resolve
them on a temporary or permanent basis but it is also possible that these
records are sent back to the provider for further analysis and resolution.
– Ingestion: once records are transformed, an automated translation for source
terms using a terminology map follows. The transformed records will then,
be ingested into the target system.
– Change detection: after the ingestion of the records all changes that may
affect the consistency of provider and aggregator data are monitored. SRM
describes 18-20 different updating and transformation reasons and is the only
framework at the moment which takes the maintance into account.
4 The X3ML framework
The X3ML framework comprises the X3ML Mapping Definition Language and
the X3ML Engine. Below we will describe them.
4.1 X3ML Mapping Definition Language
The X3ML mapping definition language is an XML based language which de-
scribes schema mappings in such a way that they can be collaboratively created
and discussed by experts. The X3ML language was designed on the basis of work
that started in FORTH in 2006 [9] and emphasizes on establishing a standard-
ized mapping description which lends itself to collaboration and the building of a
mapping memory to accumulate knowledge and experience. It was adapted pri-
marily to be compliant with the DRY principle (avoiding repetition) and to be
more explicit in its contract with the URI Generating process. X3ML separates
schema mapping from the concern of generating proper URIs so that different
expertise can be applied to these two very different responsibilities.
Schema matching: Schema matching is performed by domain experts who
need to be concerned only with the correct interpretation of the source schema.
The structure of X3ML is quite easy to understand consisting of: (a) a header
that contains basic information (title, description, contact persons), the source
and target schemata and sample record, and (b) a series of mappings each
containing a domain (the main entity that is being mapped) and a number of
links which consist of a path and a range. Each link describes the relation (path)
of the domain entity to the corresponding range entity.
The basic mapping scheme and the corresponding XML structure is shown in
Fig. 2. Each entity-relation-entity of the source schema is mapped individually
to the target schema and can be seen as self-explanatory, context independent
proposition. An X3ML structure consists of:
– the mapping between the source domain and the target domain
– the mapping between the source range and the target range
– the proper source path
5
Minadakis et al.
Fig. 2. The structure of an X3ML mapping
– the proper target path
– the mapping between source path and target path
The X3ML mapping definition language supports 1:N mappings and uses the
following special constructs:
– intermediate nodes used to represent the mapping of a simple source path
to a complex target path (a sequence of path-{entity-path}).
– constant expression nodes used to assign constant attributes (e.g. a con-
stant type) to an entity.
– conditional statements within the target node and target relation support
checks for existence and equality of values and can be combined into boolean
expressions.
– “Same as” variable used to identify a specific node instance for a given
input record that is generated once but is used in a number of locations in
the mapping.
– Join operator (==) used in the source path to denote relational database
joins
– info and comment blocks throughout the mapping specification bridge
the gap between human author and machine executor.
The tools that are currently used to produce the X3ML mapping definition
are restricted to consuming XML input records6 . As a result, XPath is used to
specify the source elements and paths which are evaluated within the context of
the source domain. There is ongoing work for an extended version that will also
support RDF input (see Section 4.5).
URI generation policy: The definition of the URI generation policy fol-
lows the schema matching and is performed usually by an IT expert who must
ensure that the generated URIs match certain criteria such as consistency and
6
http://www.ics.forth.gr/isl/3M/
6
X3ML Framework: An effective suite for supporting data mappings
uniqueness. A set of predefined URI generators (UUIDs, literals) and templates
are available but any URI generating function can be implemented and incor-
porated in the system. In the X3ML definition, the target domain and range
contain the functions that generate URIs or literals.
The result of the schema matching and URI generation policy steps is a
complete X3ML mapping definition file that will be fed to the X3ML engine for
the transformation of the data.
Fig. 3 shows how a simple relational database entry that specifies the weight
of a coin is mapped and expressed with respect to the CIDOC CRM schema[7].
The XML structure for the mappings of this example can be found online7 .
Fig. 3. Mapping relational db data to CIDOC CRM
4.2 X3ML Engine
The X3ML engine realizes the transformation of the source records to the target
format. The engine takes as input the source data (currently in the form of an
XML document), the description of the mappings in the X3ML mapping defini-
tion file and the URI generation policy file and is responsible for transforming
the XML document into a valid RDF document which is equivalent with the
XML input, with respect to the given mappings and policy. The engine has been
originally implemented in the context of the CultureBrokers project co-funded
by the Swedish Arts Council and the British Museum.
4.3 Design, Architecture and Implementation
The X3ML Engine has been designed with respect to the following design prin-
ciples:
– Simplicity. It is easier to create complicated things than it is to find the
simplicity in something that would otherwise be complex. One important
way to achieve simplicity and clarity is by carefully naming things so that
their meaning is as obvious as possible to the naked eye.
7
http://139.91.183.44/x3mlEditor/ViewPublished?type=Mapping&id=1
7
Minadakis et al.
– Transparency. The most important feature of X3ML is its general application
to mapping creation and execution and hopefully its longevity. People must
be able to easily understand how it works. The cleaner the core design of
the engine and X3ML language, and the clearer its documentation, the more
readily it will get traction and become the basis for future mappings.
– Re-use of Standards and Technologies. The best way to build a new software
module is to carefully choose its dependencies, and keeping them as small
as possible. Building on top of proven technologies is the quickest way to a
dependable result.
– Facilitating Instance Matching. This involves extracting semantic informa-
tion with the intent of generating correct instance URIs.
Fig. 4 depicts the main components of the engine. The Input Reader com-
ponent is responsible for reading the input data (currently we support XML
documents, however as we describe later in Section 4.5 more formats will be
supported using proper extensions). The X3ML Parser component is responsi-
ble for reading and manipulating the X3ML mapping definitions. The component
RDF Writer outputs the transformed data into RDF format. The Instance Gen-
erator component produces the URIs and the labels based on the descriptions
that exist in the mappings and finally the Controller component coordinates the
entire process.
cmp Component Model
X3ML Engine
Input Reader Instance RFC6570
Generator Processor
Jena RDF Writer
Controller X3ML Parser XStream
Fig. 4. The main components of X3ML Engine
The X3ML engine has been implemented in Java, producing a single artifact
in the form of a JAR file which contains the engine software. For supporting the
functionality of the main components we exploited a set of third-party software
libraries. For instance we used XStream8 for parsing XML-based documents,
Handy URI Templates9 to support the generation of valid URIs and Jena10 for
building the RDF output. The source code of the X3ML engine framework is
available under the Apache license and can be found at https://github.com/
delving/x3ml.
8
http://x-stream.github.io/
9
https://github.com/damnhandy/Handy-URI-Templates
10
https://jena.apache.org/
8
X3ML Framework: An effective suite for supporting data mappings
4.4 Functionality
The X3ML engine takes as input source XML records and generates RDF triples
consisting of subject, predicate, and object. The subject and the object are
“values”, generally consisting of URIs, but objects can also be labels or literal
values.
The generation of values (URIs, or literals) is being handled by the Instance
Generator component. The following block shows two configurations; for gener-
ating (a) URIs and (b) label values.
[arg-value]
...
[arg-value]
[language-code]
...
For each entity there must exist one instance generator and any number
of subsequent label generator blocks. The argument type allows for choos-
ing between xpath and constant and there is a special argument type called
position which gives the value generator access to the index position of the
source node within its context. The argument with the name language defines
the language tag of the generated value. If it is empty then it is implied that the
generated value will not have it (i.e. in the case of number values). The engine
provides default implementations for producing: (a) URIs, (b) UUIDs, (c) literal
values and (d) constant values.
The Instance Generator component is configured through an XML file (which
is given as input in the X3ML engine). When URIs are to be generated on the
basis of source record content, it is wise to leverage existing standards and re-
use the associated implementations. For template-based URI generation there
is available the RFC 6570 [8] standard. So, the component uses an existing
implementation library as described in Section 4.3. Whenever the required URIs
or labels cannot be generated by the default generators, the simple templates,
or the URI templates, it is always possible to insert a special generator in the
form of a class implementing the InstanceGenerator component interfaces.
4.5 Configuration/Extensibility
As already discussed the current version of the X3ML engine, takes as input the
source data in the form of an XML document. One extension (which is currently
under development) is to support other types of input. To this direction we have
started working on supporting RDF input. This requires several modifications
in the design and implementation of the engine. More importantly the basic
construct that we use for reading the source data will be an RDF model (i.e. Jena,
Sesame), so instead of XPATH we will be able to use SPARQL [16]. Furthermore
we will enhance the Instance Generator component since we will be able to carry
the URIs from the source data to the target data if needed.
9
Minadakis et al.
One apparent advantage of this approach is that the framework will sup-
port input and output of the same format. This sparked the light to investigate
another direction; that of invertible X3ML mappings. In an invertible X3ML
mapping, one can identify, in a unique manner, (and consequently regenerate)
the data in the source dataset that led to the creation of each piece of data in the
target dataset. Based on this idea, below we formalize the notion of invertibility,
by trying to identify how X3ML maps the source data to the target data.
In particular, we view an X3ML mapping as an association between a “pat-
tern” (say Ps ) in the source dataset with a “pattern” (say Pt ) in the target
dataset. This association essentially describes what to put in the target dataset
(Pt ) whenever Ps is encountered in the source dataset. Formally, we model Ps
and Pt as SPARQL graph patterns [15, 1] so an X3ML mapping m is just a pair
(Ps ,Pt ) of SPARQL graph patterns.
Then, given a set of X3ML mappings (say M ), we say that M is invertible
if and only if we can guarantee that whenever a pattern (say Pt ) is found in
the target dataset, we can identify in a unique manner the pattern Ps that
generated it (i.e., caused its inclusion in the dataset). To determine that, we
look at each Pt in M (and its corresponding Ps ), and identify those mappings
that can potentially lead to the same triples to be generated from different source
triples.
5 Evaluation and Usage
The X3ML engine is being exploited by several European projects. Specifically,
the ARIADNE project11 initiated several mapping activities using X3ML en-
gine, to convert existing schemata of archaeological data to CIDOC CRM and
its extension suite. The partners of ARIADNE project had extensively used
X3ML for the definition of mappings from various categories of databases, in-
cluding archeological museums, buildings, ancient Roman coins, and more. The
ResearchSpace project12 is developing a collaborative environment for human-
ities and cultural heritage research. The project has been using X3ML for the
mapping and transformation of the Rijksmuseum, the British Museum and the
Yale Center for British Art (YCBA) data. Specifically for the case of the Rijk-
sumuseum domain experts from both Rijksumuseum and the British Museaum
were able to succesfully map and transform their data without the assistance
of any IT expert. X3ML engine is also being exploited by the transformation
services of the Greek national implementation of the European LifeWatch [3]
infrastructure for biodiversity to transform biodiversity metadata/data such as
Darwin Core formats to a CIDOC CRM family semantic models.
To evaluate13 the performance of the X3ML engine we used an XML input
and a X3ML mapping example coming from the ARIADNE Project as a base to
11
http://www.ariadne-infrastructure.eu/
12
http://www.researchspace.org/
13
The experiments were carried out on a PC with an Intel i7 processor, 8GB RAM,
running Windows 7 32 bit.
10
X3ML Framework: An effective suite for supporting data mappings
produce synthetic data that was provided as input to the X3ML engine. Three
X3ML mapping files were created containing 10,100 and 1000 mappings and
4 XML input files containing 10,100,1000 and 10000 records. Fig. 5 displays
the evaluation results. We can observe that the overall time depends on both
the number of mappings and the size of the input. For example, as we can
see from the evaluation results, the time required for data transformation is
approximately one second when the size of the input is low (10 records) even
if the mappings are many(from 10 to 1000). As the size of the input increases
however, the overall time that is required increases as well. Note, that the total
number of output records is the total number of input records multiplied with
the number of mappings (i.e. 10 input records with 10 mappings will produce
100 output records). Concluding, we can see that the execution time is affected
equally by the number of the mappings and the records, and it is related with
the number of the links that are created during the transformation process.
Fig. 5. x3ml Engine Evaluation Results
6 Conclusion and Future Work
This paper presents a novel framework for the management of the core processes
needed to create, maintain and manage mapping relationships between different
data sources. We described the X3ML mapping definition language that offers
the building blocks for describing both schema mappings and URI generation
policies and X3ML engine, a tool that supports the transformation process and
the generation of URIs and values and is characterized by its scalability in terms
of number of providers, consistent mappings and related end up processes. We
demonstrated some of our experiences on using the aforementioned framework
and discuss about the evaluation results. In future we plan to continue work-
ing on the extended version of the framework that will support different types
on input (i.e. RDF documents) and investigate the invertible X3ML mappings
functionality.
11
Minadakis et al.
Acknowledgement
This work was partially supported by the project PARTHENOS (H2020 Re-
search Infrastructures, 2015-2019), the project ARIADNE (FP7 Research In-
frastructures, 2013-2017), and the LifeWatch Greece project (National Strategic
Reference Framework, 2012-2015).
References
1. M. Arenas, C. Gutierrez, and J. Pérez. On the Semantics of SPARQL. Springer,
2010.
2. S. Auer, S. Dietzold, J. Lehmann, S. Hellmann, and D. Aumueller. Triplify: light-
weight linked data publication from relational databases. In Proceedings of the
18th international conference on World wide web, pages 621–630. ACM, 2009.
3. A. Basset and W. Los. Biodiversity e-science: Lifewatch, the european infrastruc-
ture on biodiversity and ecosystem research. Plant Biosystems-An International
Journal Dealing with all Aspects of Plant Biology, 146(4):780–782, 2012.
4. S. Bischof, S. Decker, T. Krennwallner, N. Lopes, and A. Polleres. Mapping between
rdf and xml with xsparql. Journal on Data Semantics, 1(3):147–185, 2012.
5. C. Bizer. D2r map-a database to rdf mapping language. 2003.
6. D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, and R. Rosati. Description
logic framework for information integration. In KR, pages 2–13, 1998.
7. M. Doerr. The cidoc conceptual reference module: an ontological approach to
semantic interoperability of metadata. AI magazine, 24(3):75, 2003.
8. J. Gregorio, R. Fielding, M. Hadley, M. Nottingham, and D. Orchard. Rfc 6570:
Uri template. Internet Engineering Task Force (IETF) Request for Comments,
2012.
9. D. P. Haridimos Kondylakis, Martin Doerr. Mapping language for information
integration. Technical Report ICS-FORTH, 385, 2006.
10. C. Lange. Krextor-an extensible framework for contributing content math to the
web of data. In Intelligent Computer Mathematics, pages 304–306. Springer, 2011.
11. A. Langegger and W. Wöß. XLWrap–querying and integrating arbitrary spread-
sheets with SPARQL. Springer, 2009.
12. B. Lavoie. Meeting the challenges of digital preservation: The oais reference model.
OCLC Newsletter, 243:26–30, 2000.
13. T. B. Lee. Relational databases on the semantic web. Design Issues (published on
the Web), 1998.
14. M. J. O’Connor, C. Halaschek-Wiener, and M. A. Musen. Mapping master: A
flexible approach for mapping spreadsheets to owl. In The Semantic Web–ISWC
2010, pages 194–208. Springer, 2010.
15. J. Pérez, M. Arenas, and C. Gutierrez. Semantics and complexity of sparql. In
International semantic web conference, volume 4273, pages 30–43. Springer, 2006.
16. E. Prud’Hommeaux, A. Seaborne, et al. Sparql query language for rdf. W3C
recommendation, 15, 2008.
17. P. Szekely, C. A. Knoblock, F. Yang, X. Zhu, E. E. Fink, R. Allen, and G. Good-
lander. Connecting the smithsonian american art museum to the linked data cloud.
In The Semantic Web: Semantics and Big Data, pages 593–607. Springer, 2013.
12