Transforming Data from DataPile Structure Transforming Data from into DataPile RDF Structure into RDF JiřJiří Dokulil ı́ Dokulil Faculty of Charles Mathematics University, and Physics, Faculty Charles University of Mathematics Prague and Physics dokulil@gmail.com Malostranské nám. 25, 118 00 Praha 1, Czech Republic dokulil@gmail.com Abstract. Huge amount of interesting data has been gathered in the DataPile structure since its creation. This data could be used in the development of RDF databases. When limited to basic information stored in the DataPile the transformation into RDF is straightforward. It still provides millions of RDF triples with complex structure and many irregularities. 1 Introduction While it is easy to find huge relational or XML data rich in structure there is still not much data available in the RDF format. Such data could be obtained by simple conversion from a relational database but this data would be simple and with a regular structure. In this paper we propose a transformation of data stored in the DataPile structure [1] into the RDF [2]. We expect to receive huge amount of RDF triples with more interesting structure and much less regular than data from relational databases. This expectation is based on the way the DataPile structured is being used in practice. The DataPile system was developed to integrate data from a heterogeneous set of databases. Among main design goals were storage of historical versions of data and easy adaptation to global schema changes. First of all we present the DataPile and RDF models, then describe the transformation of metadata and data and finally give results of an experimental implementation of the transformation. 2 The Data Models In this section we present the data models that take part in the transformation. 2.1 The DataPile The terminology used in DataPile systems is different from those used in relational and RDF databases. Entity is a rough equivalent of a table scheme. It has a name and consists of attributes, which can be compared to column definitions. Each attribute defines an attribute name and data type. The set of allowed data types had to be very V. Snášel, K. Richta, J. Pokorný (Eds.): Dateso 2006, pp. 54–62, ISBN 80-248-1025-5. 2 Jiří Dokulil Transforming Data from DataPile Structure into RDF 55 limited because of implementation reasons. The only supported types are string, number, timestamp, BLOB (Binary Large Object) and a typed reference (foreign key). The DataPile system allows definition of multiple entities each having zero or more attributes. One attribute can not be a member of multiple entities. On the other hand, multiple attributes with the same name can exist, as long as they are members of different entities. Entities and attributes are metadata. They define structure of the actual data that can be stored in the system. The data consist of attribute values. An attribute value is one data item together with type information (identifier of an entity attribute), validity period, relevance and source of the value. Attribute values describing one object are grouped together into an entity instance. Each entity instance is assigned a unique eighteen-digit number called entity instance identifier. Only attributes of one entity can be used as types of attribute values forming one entity instance. This entity is called a type of that entity instance. All entity instances of one type can be viewed as a relational table with each row containing one entity instance. Then one attribute value would be one item of this table. All this information (matadata and data) is stored in a relational database with special schema called the DataPile structure. This structure and a set of applications and tools form the DataPile system. In order to achieve the goals set for the DataPile the data could not be stored using one table for each entity type. Instead, a special DataPile structure was created. The center of the structure is one table called PILE capable of storing all attributes of all entities along with their history was used. This table is supplemented by other tables that store metadata, e.g. list of entities and their attributes. One row of the PILE table contains entity instance identifier, attribute identifier, attribute value, validity period, and other information used in the data integration process. The attribute value is stored in more table columns. It requires one column for each data type. This is the reason why only fixed and very limited set of data types was allowed in the DataPile structure. Let us look at an example. Consider a system for storing basic information about people. Relational schema could look like this: PERSON(id, first_name, last_name, date_of_birth). In a data pile system this schema would require metadata containing one entity called “PERSON” consisting of three attributes (first_name, last_name and date_of_birth). Data type of first_name and last_last name would be a string in both models. On the other hand there is no exact equivalent for “date” data type, which would probably be the type of the date_of_birth column in a relational database. The “timestamp” data type would have to be used. Table 1. Example data to be stored in the DataPile id First_name last_name date_of_birth 1 John Smith 5.8.1962 2 Jane Doe 23.2.1971 56 Jiřı́ Dokulil Transforming Data from DataPile Structure into RDF 3 We can now transform relational data from the Table 1 into the DataPile structure. First of all, both records have to be assigned an entity instance identifier. Normally it would have been an eighteen-digit number but for convenience we use 101 and 102 as the identifiers. Two instances of entity “PERSON” with identifier 101 and 102 have to be created. Then the appropriate attribute values are to be created in the PILE table. PILE table containing these attributes is displayed in the Table 2. The table also contains an example of storing historical version of data. On 5.7.2005 the name of Jane Doe was changed to Joan Doe. Table 2. PILE table with example data (simplified, some columns omitted). Ent_id stands for entity instance identifier. ent_id attribute string value time value valid from valid to 101 first_name John null 28.5.2005 null 15:31:20 101 last_name Smith null 28.5.2005 null 15:31:20 101 date_of_birth null 5.8.1962 28.5.2005 null 0:00:00 15:31:20 102 first_name Jane null 27.5.2005 5.7.2005 10:12:25 9:25:05 102 first_name Joan null 5.7.2005 null 9:25:05 102 last_name Doe null 27.5.2005 null 10:12:25 102 date_of_birth null 23.2.1971 27.5.2005 null 0:00:00 10:12:25 2.2 The RDF One of the goals of the RDF is integration of data gathered about resources on the World Wide Web. Such data tend to be rich in structure and often incomplete. The RDF is used to make statements about resources. A RDF statement is a triple consisting of a subject, a predicate and an object. This states that the subject has a property (predicate) with a certain value (object). The statement is modeled as a graph with one node for the subject, one node for the object and an arc for the predicate, directed from the subject node to the object node. A typical example looks like this: ”John Smith” This states that the book identified by URI was created by John Smith. The book is the subject, “created” is the predicate and John Smith is the object of the triple. In this example we represent John Smith by a literal 4 Jiří Dokulil Transforming Data from DataPile Structure into RDF 57 “John Smith”. A literal is a constant expression that can be typed or untyped (plain). They are used to represent values like numbers and dates by their lexical representation. It is always possible to use URI instead of a literal, e.g. . Then we could also make statements about John Smith. Literals can only be used as objects while URI can take any place in a triple. URIs are represented by named nodes in the RDF graph. However we do not always need direct access to every node in the graph. Some nodes are always accessed using arcs from other nodes. These nodes do not need universal identifiers like URIs. They can be created as blank nodes. These nodes can be used as subjects and objects. Blank nodes are usually assigned a unique identifier when the graph is serialized to a triples representation. Common way of writing such identifiers is _:identifier, e.g. “_:blank123”. This identifier represents the same blank node in the whole representation of the graph. Different identifiers represent different blank nodes. 3 The Transformation The basic idea behind the transformation is that by making a projection of the PILE table on the columns containing entity instance identifier, attribute and attribute value we receive a set of triples representing statements very similar to RDF statements. 3.1 The Entity Instance Identifiers All entity instances in the DataPile are assigned a unique eighteen digit number called entity instance identifier. We need a way to create nodes with unique names in the RDF graph that will represent the objects we want to make statements about. The entity instance identifier is ideal for this. It can be used either as a part of an URI represented by the node or as an identifier of a blank node if we choose not to give a name to the node. In this paper we describe the latter approach since we wanted to create data that would help in the development of RDF databases and queries containing or returning blank nodes are an important feature of the database we want to test. If naming of the nodes is required then the transformation process can easily be modified to create nodes with URIs. 3.2 The Metadata Processing of the data in the DataPile is controlled by metadata that is stored in relational tables. In order for the transformation to work at least some part of the metadata must be stored in the RDF as well. The most important piece of metadata to transform is attributes of the entities. They serve as predicates (arcs of the RDF graph). The very basic RDF representation of a single attribute looks like this (TURTLE notation [3]). 58 Jiřı́ Dokulil Transforming Data from DataPile Structure into RDF 5 @prefix rdf: . @prefix rdfs: . @prefix mt : . mt:person__name rdf:type rdf:Property . This defines http://example.org/stoh/metadata/person__name to be an attribute. The original version in the DataPile was an attribute called “name” belonging to an entity called “person”. We can represent this information in the RDF as well. @prefix rdf: . @prefix rdfs: . @prefix mt : . mt:person rdf:type rdfs:Class . mt:person__name rdf:type rdf:Property . mt:person__name rdfs:domain mt:person . Alternatively we could name the attribute only by its name in the DataPile and omit the name of the entity. This would allow us to make queries like “Give me the name of all entity instances that have a name”. On the other hand it would complicate type checking of the values. Because of this we chose the more specific names. With the information about entities we can specify a type (entity) of an entity instance. _:568421369754123695 rdf:type mt:person . The subject of the triple is a blank node with an eighteen digit identifier identical to the entity instance identifier in the DataPile. 3.3 The Data Types The DataPile uses a limited number of data types for the attributes. They are listed in Table 1 together with their equivalents after the transformation. Table 3. Data types in the DataPile and after the transformation string http://www.w3.org/2001/XMLSchema#string number http://www.w3.org/2001/XMLSchema#decimal timestamp http://www.w3.org/2001/XMLSchema#dateTime entity reference reference to a blank node Using these data types we can extend the transformed metadata representation. @prefix rdf: . @prefix rdfs: . 6 Jiří Dokulil Transforming Data from DataPile Structure into RDF 59 @prefix mt : . @prefix xsd: mt:person rdf:type rdfs:Class . mt:person__name rdf:type rdf:Property . mt:person__name rdfs:domain mt:person . mt:person__name rdfs:range xsd:string . The entity references in the DataPile are typed references. One attribute can only be used to reference one specified entity. This is equivalent to specifying one class as a range of a property. @prefix rdf: . @prefix rdfs: . @prefix mt : . @prefix xsd: mt:person rdf:type rdfs:Class . mt:address rdf:type rdfs:Class . mt:person__address rdf:type rdf:Property . mt:person__address rdfs:domain mt:person . mt:person__address rdfs:range mt:address . 3.4 Transforming the Data After storing the necessary metadata in the RDF graph we can start transforming the real data. Since every row of the PILE table contains entity instance identifier, attribute identifier and typed value we can make a simple projection of the PILE table on these columns and create one RDF triple from each row. An example output could look like this: @prefix xsd : . _:568421369754123695 mt:person__name “John Smith” . _:568421369754123695 mt:person__date_of_birth “1980-08-14T00:00:00”^^xsd:dateTime . _:568421369754123695 mt:person__height “1.82”^^xsd:decimal . _:568421369754123695 mt:person__father _:684258941535789524 . The last triple is a reference to another entity instance (a foreign key). 60 Jiřı́ Dokulil Transforming Data from DataPile Structure into RDF 7 3.5 Multilingual Attributes Although the presented general transformation is capable of handling all types of date stored in the DataPile there is one case that could be handled in a better way. Practical application of the DataPile showed that it is sometimes necessary to handle string values that need to be expressed in different languages. For instance name of a department in Czech and English or same word in different cases. Using the DataPile it was necessary to create two new entities causing this feature to be hard to use. The RDF offers an easier way of achieving the same results. The standard offers a way to specify a language tag for every string literal. The language tags are defined by RFC 3066 [4] which is flexible enough to specify not only language but different cases as well. _:469751359754692454 rdf:type mt:department . _:469751359754692454 mt:department__name “Katedra softwarového inženýrství”@cs . _:469751359754692454 mt:department__name “Department of Software Engineering”@en . _:954783125769542934 rdf:type mt:place . _:954783125769542934 mt:place__name “Praha”@cs-CZ-singular-nominative . _:954783125769542934 mt:place__name “v Praze”@cs-CZ-singular-locative . 3.6 Reification On of the important features of the RDF is the ability to make statements about statements. This is called reification. It can be used e.g. to specify an author of a statement. There is no such universal feature in the DataPile. On the other hand the supplementary columns of the PILE table can be viewed as a special case of reification with a fixed set of predicates. The columns contain information about source of the value, its validity period etc. Unfortunately, expressing reification in RDF is not very compact. It requires using a new blank node and making at least four statements. The identifier of the blank node can be generated from primary key of the PILE table. The primary key contains sequential numeric value. _:568421369754123695 mt:person__name “John Smith” . _:r65413 rdf:type rdf:Statement . _:r65413 rdf:subject _:568421369754123695 . _:r65413 rdf:predicate mt:person__name . _:r65413 rdf:object “John Smith” . _:r65413 mt:valid_from “20050703T15:21:49” . _:r65413 mt:valid_to “20050821T09:35:12” . 8 Jiří Dokulil Transforming Data from DataPile Structure into RDF 61 The example shows a triple stating a name of a person together with triples that give validity period of the statement. 4 The Experimental Implementation An experimental implementation of the presented transformation has been created and tested on real data. 4.1 Limitations The implementation does not include direct support for multilingual attributes nor does it support reification. 4.2 The Data The data for the experiment has been gathered into the DataPile from different information systems at the Charles University in Prague. Variability of these systems provided us with data that have not only complex schema but also greatly vary in their completeness. Because the implementation does not support reification the data was limited only to records that are considered to be currently valid. Working with historical versions of data requires access to supplementary columns of the PILE table which requires reification. If all of the data was extracted without the supplementary information it would have created multiple attribute values for one attribute of one entity instance without a way to distinguish the valid values from the historical ones. 4.3 The Test Environment The current implementation of the DataPile uses Oracle Database 10g for storage. The database was running on a dual XEON P4 2.4 GHz with 2GB RAM and SCSI RAID. The extractor itself was running on a separate machine with four XEON P4 2.5 GHz CPUs with 16GB RAM and SCSI RAID. It accessed the database directly using Oracle Call Interface with thin abstraction layer on top of it. The performance of the extraction process depends mostly on the performance of the database. Processing of records returned from the database does not require much memory or CPU time. 4.3 The Extraction The extraction generated a TURTLE file with 26 813 044 RDF triples. We made two runs of the extraction. In the first run the data was sorted by the entity instance 62 Jiřı́ Dokulil Transforming Data from DataPile Structure into RDF 9 identifier and attribute. The sorting of the data was done by the database system that contains the DataPile. Although it is not required for the transformation to work it can improve performance of further processing of the data and help with debugging. In the second run the data was not sorted at all. The sorted version finished in 1738 seconds while the unsorted took 1073 seconds to complete. 5 Conclusion Even the very basic version of the extraction provided great amount of interesting data. Implementation of a version handling multilingual attributes is planned in the near future. We plan to use the extracted data in the development of an experimental RDF database that uses a SPARQL language [5]. It will help us test and tune the performance of such database. The data was gathered from systems that are used in practice and so their schema, size and structure represent real requirements of such systems. The test results should tell us how the database would behave when deployed as a basis for large scale information system or a system integrating large heterogeneous data. References 1. Bednarek D., Obdrzalek D., Yaghob J., Zavoral F.: Data Integration Using DataPile Structure. In proceedings of the 9th East-European Conference on Advances in Database and Information Systems, Tallinn, Estonia, 2005 2. Carroll J. J., Klyne G.: Resource Description Framework (RDF): Concepts and Abstract Syntax, W3C Recommendation, 10 February 2004 http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/ 3. Beckett D.: Turtle - Terse RDF Triple Language http://www.dajobe.org/2004/01/turtle/ 4. Alvestrand H.: Tags for the Identification of Languages http://www.ietf.org/rfc/rfc3066.txt 5. Prud'hommeaux E., Seaborne A.: SPARQL Query Language for RDF, W3C Working Draft, 23 November 2005 http://www.w3.org/TR/2005/WD-rdf-sparql-query-20051123/ Acknowledgement This research was supported in part by the National programme of research (Information society project 1ET100300419).