=Paper=
{{Paper
|id=Vol-176/paper-6
|storemode=property
|title=Transforming Data from DataPile Structure into RDF
|pdfUrl=https://ceur-ws.org/Vol-176/paper8.pdf
|volume=Vol-176
|dblpUrl=https://dblp.org/rec/conf/dateso/Dokulil06
}}
==Transforming Data from DataPile Structure into RDF==
Transforming Data from DataPile Structure
Transforming Data from
into DataPile
RDF Structure into RDF
JiřJiří Dokulil
ı́ Dokulil
Faculty of
Charles Mathematics
University, and Physics,
Faculty Charles University
of Mathematics Prague
and Physics
dokulil@gmail.com
Malostranské nám. 25, 118 00 Praha 1, Czech Republic
dokulil@gmail.com
Abstract. Huge amount of interesting data has been gathered in the DataPile
structure since its creation. This data could be used in the development of RDF
databases. When limited to basic information stored in the DataPile the
transformation into RDF is straightforward. It still provides millions of RDF
triples with complex structure and many irregularities.
1 Introduction
While it is easy to find huge relational or XML data rich in structure there is still not
much data available in the RDF format. Such data could be obtained by simple
conversion from a relational database but this data would be simple and with a regular
structure.
In this paper we propose a transformation of data stored in the DataPile structure
[1] into the RDF [2]. We expect to receive huge amount of RDF triples with more
interesting structure and much less regular than data from relational databases. This
expectation is based on the way the DataPile structured is being used in practice.
The DataPile system was developed to integrate data from a heterogeneous set of
databases. Among main design goals were storage of historical versions of data and
easy adaptation to global schema changes.
First of all we present the DataPile and RDF models, then describe the
transformation of metadata and data and finally give results of an experimental
implementation of the transformation.
2 The Data Models
In this section we present the data models that take part in the transformation.
2.1 The DataPile
The terminology used in DataPile systems is different from those used in relational
and RDF databases. Entity is a rough equivalent of a table scheme. It has a name and
consists of attributes, which can be compared to column definitions. Each attribute
defines an attribute name and data type. The set of allowed data types had to be very
V. Snášel, K. Richta, J. Pokorný (Eds.): Dateso 2006, pp. 54–62, ISBN 80-248-1025-5.
2 Jiří Dokulil Transforming Data from DataPile Structure into RDF 55
limited because of implementation reasons. The only supported types are string,
number, timestamp, BLOB (Binary Large Object) and a typed reference (foreign
key). The DataPile system allows definition of multiple entities each having zero or
more attributes. One attribute can not be a member of multiple entities. On the other
hand, multiple attributes with the same name can exist, as long as they are members
of different entities.
Entities and attributes are metadata. They define structure of the actual data that
can be stored in the system.
The data consist of attribute values. An attribute value is one data item together
with type information (identifier of an entity attribute), validity period, relevance and
source of the value. Attribute values describing one object are grouped together into
an entity instance. Each entity instance is assigned a unique eighteen-digit number
called entity instance identifier. Only attributes of one entity can be used as types of
attribute values forming one entity instance. This entity is called a type of that entity
instance.
All entity instances of one type can be viewed as a relational table with each row
containing one entity instance. Then one attribute value would be one item of this
table.
All this information (matadata and data) is stored in a relational database with
special schema called the DataPile structure. This structure and a set of applications
and tools form the DataPile system.
In order to achieve the goals set for the DataPile the data could not be stored using
one table for each entity type. Instead, a special DataPile structure was created. The
center of the structure is one table called PILE capable of storing all attributes of all
entities along with their history was used. This table is supplemented by other tables
that store metadata, e.g. list of entities and their attributes.
One row of the PILE table contains entity instance identifier, attribute identifier,
attribute value, validity period, and other information used in the data integration
process. The attribute value is stored in more table columns. It requires one column
for each data type. This is the reason why only fixed and very limited set of data types
was allowed in the DataPile structure.
Let us look at an example. Consider a system for storing basic information about
people. Relational schema could look like this:
PERSON(id, first_name, last_name, date_of_birth).
In a data pile system this schema would require metadata containing one entity
called “PERSON” consisting of three attributes (first_name, last_name and
date_of_birth). Data type of first_name and last_last name would be a string in both
models. On the other hand there is no exact equivalent for “date” data type, which
would probably be the type of the date_of_birth column in a relational database. The
“timestamp” data type would have to be used.
Table 1. Example data to be stored in the DataPile
id First_name last_name date_of_birth
1 John Smith 5.8.1962
2 Jane Doe 23.2.1971
56 Jiřı́ Dokulil Transforming Data from DataPile Structure into RDF 3
We can now transform relational data from the Table 1 into the DataPile structure.
First of all, both records have to be assigned an entity instance identifier. Normally it
would have been an eighteen-digit number but for convenience we use 101 and 102 as
the identifiers.
Two instances of entity “PERSON” with identifier 101 and 102 have to be created.
Then the appropriate attribute values are to be created in the PILE table. PILE table
containing these attributes is displayed in the Table 2.
The table also contains an example of storing historical version of data. On
5.7.2005 the name of Jane Doe was changed to Joan Doe.
Table 2. PILE table with example data (simplified, some columns omitted). Ent_id stands for
entity instance identifier.
ent_id attribute string value time value valid from valid to
101 first_name John null 28.5.2005 null
15:31:20
101 last_name Smith null 28.5.2005 null
15:31:20
101 date_of_birth null 5.8.1962 28.5.2005 null
0:00:00 15:31:20
102 first_name Jane null 27.5.2005 5.7.2005
10:12:25 9:25:05
102 first_name Joan null 5.7.2005 null
9:25:05
102 last_name Doe null 27.5.2005 null
10:12:25
102 date_of_birth null 23.2.1971 27.5.2005 null
0:00:00 10:12:25
2.2 The RDF
One of the goals of the RDF is integration of data gathered about resources on the
World Wide Web. Such data tend to be rich in structure and often incomplete.
The RDF is used to make statements about resources. A RDF statement is a triple
consisting of a subject, a predicate and an object. This states that the subject has a
property (predicate) with a certain value (object). The statement is modeled as a graph
with one node for the subject, one node for the object and an arc for the predicate,
directed from the subject node to the object node.
A typical example looks like this:
”John Smith”
This states that the book identified by URI was
created by John Smith. The book is the subject, “created” is the predicate and John
Smith is the object of the triple. In this example we represent John Smith by a literal
4 Jiří Dokulil Transforming Data from DataPile Structure into RDF 57
“John Smith”. A literal is a constant expression that can be typed or untyped (plain).
They are used to represent values like numbers and dates by their lexical
representation. It is always possible to use URI instead of a literal, e.g.
. Then we could also make statements about
John Smith. Literals can only be used as objects while URI can take any place in a
triple.
URIs are represented by named nodes in the RDF graph. However we do not
always need direct access to every node in the graph. Some nodes are always accessed
using arcs from other nodes. These nodes do not need universal identifiers like URIs.
They can be created as blank nodes. These nodes can be used as subjects and objects.
Blank nodes are usually assigned a unique identifier when the graph is serialized to a
triples representation. Common way of writing such identifiers is _:identifier, e.g.
“_:blank123”. This identifier represents the same blank node in the whole
representation of the graph. Different identifiers represent different blank nodes.
3 The Transformation
The basic idea behind the transformation is that by making a projection of the PILE
table on the columns containing entity instance identifier, attribute and attribute value
we receive a set of triples representing statements very similar to RDF statements.
3.1 The Entity Instance Identifiers
All entity instances in the DataPile are assigned a unique eighteen digit number called
entity instance identifier.
We need a way to create nodes with unique names in the RDF graph that will
represent the objects we want to make statements about. The entity instance identifier
is ideal for this. It can be used either as a part of an URI represented by the node or as
an identifier of a blank node if we choose not to give a name to the node. In this paper
we describe the latter approach since we wanted to create data that would help in the
development of RDF databases and queries containing or returning blank nodes are an
important feature of the database we want to test.
If naming of the nodes is required then the transformation process can easily be
modified to create nodes with URIs.
3.2 The Metadata
Processing of the data in the DataPile is controlled by metadata that is stored in
relational tables. In order for the transformation to work at least some part of the
metadata must be stored in the RDF as well.
The most important piece of metadata to transform is attributes of the entities.
They serve as predicates (arcs of the RDF graph). The very basic RDF representation
of a single attribute looks like this (TURTLE notation [3]).
58 Jiřı́ Dokulil Transforming Data from DataPile Structure into RDF 5
@prefix rdf: .
@prefix rdfs: .
@prefix mt : .
mt:person__name rdf:type rdf:Property .
This defines http://example.org/stoh/metadata/person__name to be an attribute.
The original version in the DataPile was an attribute called “name” belonging to an
entity called “person”. We can represent this information in the RDF as well.
@prefix rdf: .
@prefix rdfs: .
@prefix mt : .
mt:person rdf:type rdfs:Class .
mt:person__name rdf:type rdf:Property .
mt:person__name rdfs:domain mt:person .
Alternatively we could name the attribute only by its name in the DataPile and
omit the name of the entity. This would allow us to make queries like “Give me the
name of all entity instances that have a name”. On the other hand it would complicate
type checking of the values. Because of this we chose the more specific names.
With the information about entities we can specify a type (entity) of an entity
instance.
_:568421369754123695 rdf:type mt:person .
The subject of the triple is a blank node with an eighteen digit identifier identical to
the entity instance identifier in the DataPile.
3.3 The Data Types
The DataPile uses a limited number of data types for the attributes. They are listed in
Table 1 together with their equivalents after the transformation.
Table 3. Data types in the DataPile and after the transformation
string http://www.w3.org/2001/XMLSchema#string
number http://www.w3.org/2001/XMLSchema#decimal
timestamp http://www.w3.org/2001/XMLSchema#dateTime
entity reference reference to a blank node
Using these data types we can extend the transformed metadata representation.
@prefix rdf: .
@prefix rdfs: .
6 Jiří Dokulil Transforming Data from DataPile Structure into RDF 59
@prefix mt : .
@prefix xsd:
mt:person rdf:type rdfs:Class .
mt:person__name rdf:type rdf:Property .
mt:person__name rdfs:domain mt:person .
mt:person__name rdfs:range xsd:string .
The entity references in the DataPile are typed references. One attribute can only
be used to reference one specified entity. This is equivalent to specifying one class as
a range of a property.
@prefix rdf: .
@prefix rdfs: .
@prefix mt : .
@prefix xsd:
mt:person rdf:type rdfs:Class .
mt:address rdf:type rdfs:Class .
mt:person__address rdf:type rdf:Property .
mt:person__address rdfs:domain mt:person .
mt:person__address rdfs:range mt:address .
3.4 Transforming the Data
After storing the necessary metadata in the RDF graph we can start transforming the
real data. Since every row of the PILE table contains entity instance identifier,
attribute identifier and typed value we can make a simple projection of the PILE table
on these columns and create one RDF triple from each row.
An example output could look like this:
@prefix xsd : .
_:568421369754123695 mt:person__name “John Smith” .
_:568421369754123695 mt:person__date_of_birth
“1980-08-14T00:00:00”^^xsd:dateTime .
_:568421369754123695 mt:person__height
“1.82”^^xsd:decimal .
_:568421369754123695 mt:person__father
_:684258941535789524 .
The last triple is a reference to another entity instance (a foreign key).
60 Jiřı́ Dokulil Transforming Data from DataPile Structure into RDF 7
3.5 Multilingual Attributes
Although the presented general transformation is capable of handling all types of date
stored in the DataPile there is one case that could be handled in a better way. Practical
application of the DataPile showed that it is sometimes necessary to handle string
values that need to be expressed in different languages. For instance name of a
department in Czech and English or same word in different cases.
Using the DataPile it was necessary to create two new entities causing this feature
to be hard to use. The RDF offers an easier way of achieving the same results. The
standard offers a way to specify a language tag for every string literal. The language
tags are defined by RFC 3066 [4] which is flexible enough to specify not only
language but different cases as well.
_:469751359754692454 rdf:type mt:department .
_:469751359754692454 mt:department__name
“Katedra softwarového inženýrství”@cs .
_:469751359754692454 mt:department__name
“Department of Software Engineering”@en .
_:954783125769542934 rdf:type mt:place .
_:954783125769542934 mt:place__name
“Praha”@cs-CZ-singular-nominative .
_:954783125769542934 mt:place__name
“v Praze”@cs-CZ-singular-locative .
3.6 Reification
On of the important features of the RDF is the ability to make statements about
statements. This is called reification. It can be used e.g. to specify an author of a
statement.
There is no such universal feature in the DataPile. On the other hand the
supplementary columns of the PILE table can be viewed as a special case of
reification with a fixed set of predicates. The columns contain information about
source of the value, its validity period etc.
Unfortunately, expressing reification in RDF is not very compact. It requires using
a new blank node and making at least four statements. The identifier of the blank
node can be generated from primary key of the PILE table. The primary key contains
sequential numeric value.
_:568421369754123695 mt:person__name “John Smith” .
_:r65413 rdf:type rdf:Statement .
_:r65413 rdf:subject _:568421369754123695 .
_:r65413 rdf:predicate mt:person__name .
_:r65413 rdf:object “John Smith” .
_:r65413 mt:valid_from “20050703T15:21:49” .
_:r65413 mt:valid_to “20050821T09:35:12” .
8 Jiří Dokulil Transforming Data from DataPile Structure into RDF 61
The example shows a triple stating a name of a person together with triples that
give validity period of the statement.
4 The Experimental Implementation
An experimental implementation of the presented transformation has been created and
tested on real data.
4.1 Limitations
The implementation does not include direct support for multilingual attributes nor
does it support reification.
4.2 The Data
The data for the experiment has been gathered into the DataPile from different
information systems at the Charles University in Prague. Variability of these systems
provided us with data that have not only complex schema but also greatly vary in their
completeness.
Because the implementation does not support reification the data was limited only
to records that are considered to be currently valid. Working with historical versions
of data requires access to supplementary columns of the PILE table which requires
reification. If all of the data was extracted without the supplementary information it
would have created multiple attribute values for one attribute of one entity instance
without a way to distinguish the valid values from the historical ones.
4.3 The Test Environment
The current implementation of the DataPile uses Oracle Database 10g for storage. The
database was running on a dual XEON P4 2.4 GHz with 2GB RAM and SCSI RAID.
The extractor itself was running on a separate machine with four XEON P4 2.5
GHz CPUs with 16GB RAM and SCSI RAID. It accessed the database directly using
Oracle Call Interface with thin abstraction layer on top of it.
The performance of the extraction process depends mostly on the performance of
the database. Processing of records returned from the database does not require much
memory or CPU time.
4.3 The Extraction
The extraction generated a TURTLE file with 26 813 044 RDF triples. We made two
runs of the extraction. In the first run the data was sorted by the entity instance
62 Jiřı́ Dokulil Transforming Data from DataPile Structure into RDF 9
identifier and attribute. The sorting of the data was done by the database system that
contains the DataPile. Although it is not required for the transformation to work it can
improve performance of further processing of the data and help with debugging.
In the second run the data was not sorted at all.
The sorted version finished in 1738 seconds while the unsorted took 1073 seconds
to complete.
5 Conclusion
Even the very basic version of the extraction provided great amount of interesting
data. Implementation of a version handling multilingual attributes is planned in the
near future.
We plan to use the extracted data in the development of an experimental RDF
database that uses a SPARQL language [5]. It will help us test and tune the
performance of such database. The data was gathered from systems that are used in
practice and so their schema, size and structure represent real requirements of such
systems. The test results should tell us how the database would behave when
deployed as a basis for large scale information system or a system integrating large
heterogeneous data.
References
1. Bednarek D., Obdrzalek D., Yaghob J., Zavoral F.: Data Integration Using DataPile
Structure. In proceedings of the 9th East-European Conference on Advances in Database
and Information Systems, Tallinn, Estonia, 2005
2. Carroll J. J., Klyne G.: Resource Description Framework (RDF): Concepts and Abstract
Syntax, W3C Recommendation, 10 February 2004
http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/
3. Beckett D.: Turtle - Terse RDF Triple Language
http://www.dajobe.org/2004/01/turtle/
4. Alvestrand H.: Tags for the Identification of Languages
http://www.ietf.org/rfc/rfc3066.txt
5. Prud'hommeaux E., Seaborne A.: SPARQL Query Language for RDF, W3C Working Draft,
23 November 2005
http://www.w3.org/TR/2005/WD-rdf-sparql-query-20051123/
Acknowledgement
This research was supported in part by the National programme of research
(Information society project 1ET100300419).