=Paper=
{{Paper
|id=Vol-1426/paper-08
|storemode=property
|title=KR2RML: An Alternative Interpretation of R2RML for Heterogenous Sources
|pdfUrl=https://ceur-ws.org/Vol-1426/paper-08.pdf
|volume=Vol-1426
|dblpUrl=https://dblp.org/rec/conf/semweb/SlepickaYSK15
}}
==KR2RML: An Alternative Interpretation of R2RML for Heterogenous Sources==
KR2RML: An Alternative Interpretation of R2RML for Heterogeneous Sources? Jason Slepicka, Chengye Yin, Pedro Szekely, and Craig A. Knoblock University of Southern California Information Sciences Institute and Department of Computer Science, USA {knoblock,pszekely,slepicka}@isi.edu {chengyey}@usc.edu Abstract. Data sets are generated today at an ever increasing rate in a host of new formats and vocabularies, each with its own data qual- ity issues and limited, if any, semantic annotations. Without semantic annotation and cleanup, integrating across these data sets is difficult. Approaches exist for integration by semantically mapping such data using R2RML and its extension for heterogeneous sources, RML, into RDF, but they are not easily extendable or scalable, nor do they pro- vide facilities for cleaning. We present an alternative interpretation of R2RML paired with a source-agnostic R2RML processor that supports data cleaning and transformation. With this approach, it is easy to add new input and output formats without modifying the language or the processor, while supporting the efficient cleaning, transformation, and generation of billion triple datasets. 1 Introduction Over the past decade, many strides have been made towards enabling content creators to publish Linked Open Data by building on what they’re familiar with: RDFa, HTML5 Microdata, JSON-LD [11], and schema.org. There still exist countless messy data sets, both legacy and novel, that lack the semantic annotation necessary to be Linked Open Data. Numerous ETL suites attack the messy part of the problem, i.e. data cleaning, data transformation, and scale, but lack robust support for semantically mapping the data into RDF using ontologies. Other approaches provide support for semantically mapping data but cannot address messy data nor easily extend to support new formats. Our proposed interpretation of R2RML [12] and processor, implemented in the open source tool, Karma [6], addresses these concerns as captured by the following requirements. Multiple inputs and Output Formats A language for semantically mapping data should extend beyond just mapping relational databases to RDF. Solutions ? A KR2RML processor implementation is available for evaluation at https://github.com/usc-isi-i2/Web-Karma. 2 Slepicka et al. for relational databases can cleanly apply to tabular formats like CSV but break down or cannot fully capture the semantic relationships present in the data when applied to hierarchical sources like JSON and XML. A mapping should be able to capture these relationships. In terms of mapping output, there are many ways for a processor to serialize an RDF graph: RDF/XML, Turtle, N- Quads, RDF-HDT [4], and JSON-LD. A processor that can output nested JSON- LD is particularly valuable to developers and data scientists, because the same techniques required to serialize the graph as a collection of trees in JSON can be extended to any other hierarchical serialization format. Extensibility As serialization formats propagate and evolve to fulfill new niches, it’s important to be able to quickly support new ones. Consider JSON’s offspring: BSON [9] and HOCON [13], or Interface Description Languages like Google Pro- tocol Buffers [5] and Apache Avro [1]. Each format is JSON-like but requires its own libraries and query language, if one exists. A mapping language and its processor should reuse commonality between hierarchical formats and require few, if any, changes to be able to support them on ingestion and output. Transformations Integrating structured data without consideration for clean- ing and transformation as languages (both natural and artificial), vocabularies, standards, schemas, and formats proliferate is difficult. As such, design decisions for semantic modeling should not be made independently of data transformation. R2RML templates are insufficient for overcoming the complexities and messiness of hierarchical data. Instead, R2RML users rely on custom SQL queries or views to present a clean dataset for mapping. A better solution wouldn’t require the user to know anything about the idiosyncrasies or domain-specific language of the source at all, e.g. DOM vs SAX parsing and XSLT for XML. Scalability As the academic and commercial world regularly works with ter- abytes and even petabytes of data, any language’s processor must carefully weigh the implications of its algorithms and feature sets in regards to scalability. While an extensive feature set may be attractive in terms of offering a comprehensive working environment, some features may be better supported by external so- lutions. Other features may not even be representative of those needed by the most common workloads and serve only as distractions. 2 Approach To meet these requirements, we rely on the Nested Relational Model (NRM)[7] as an intermediate form to represent data. The NRM allows us to abstract away the idiosyncrasies of the formats we support, so that once we translate the input data into this intermediate form, every downstream step, e.g. cleaning, transformation, RDF generation, is reusable. Adding a new source format is as simple as defining how to parse and translate it to this model. This frees us KR2RML 3 from having to include references to input format specific considerations like XPath or XSLT in the language. Once the data is in the NRM, we can support and implement a set of common data transformations in the processor once, instead of for each input format. These transformations, e.g. split, fold/unfold, are common in the literature, but we extend their definitions as found in the Potter’s Wheel interactive data cleaning system [10] from a tabular data model to support the NRM. Our processor also provides an API for cleaning by example and transformations using User Defined Functions (UDFs) written in Python. After translating data to the NRM, cleaning it, and transforming it, we extend the R2RML column references to map to the NRM. This model allows us to uniformly apply our new interpretation of R2RML mapping for hierarchical data sources to generate RDF. Finally, for scalability, our processor does not supports joins across logical tables, which precludes using KR2RML as a virtual SPARQL endpoint for relational databases. Instead we support materializing the RDF by processing in massively parallel frameworks in both batch via Hadoop and incrementally in a streaming manner like Storm. 3 Nested Relational Model We map data into the Nested Relational Model by translating it into tables and rows where a column in a table can be either a scalar value or a nested table. Mapping tabular data like CSV, Excel, and relational databases is straightfor- ward. The model will have a one to one mapping of tables, rows, and columns, unless we perform a transformation like split on a column, which will create a new column that contains a nested table. To illustrate how we map hierarchical data, we will describe our approach to JSON first and introduce an example1 . The example describes an organization, its employees and its locations. 1 { 2 "companyName": "Information Sciences Institute", 3 "tags": ["artificial intelligence","nlp", "semantic web"] 4 "employees": [ 5 { 6 "name": "Knoblock, Craig", 7 "title": "Director, Research Professor" 8 }, 9 { 10 "name": "Slepicka, Jason", 11 "title": "Graduate Student, Research Assistant" 12 } 13 ], 14 "locationTable":{ 15 "locationAddress" :[ 1 The example worked in this paper is available at https://github.com/usc-isi-i2/iswc- 2015-cold-example 4 Slepicka et al. 16 "4676 Admiralty Way Suite 1001,Marina Del Rey, CA 90292", 17 "3811 North Fairfax Drive Suite 200,Arlington, VA 22203" 18 ], "locationName" :["ISI - West", "ISI - East" ] 19 } 20 } This document contains an object, which maps to a single row table in NRM with four columns for its four fields: companyName, tags, employees and loca- tionTable. Each column is populated with the value of the appropriate field. Fields with scalar values, like companyName, need no translation, but fields like tags, employees and locationTable, which have array values, do. The array values of tags, employees and locationTables are now mapped to their own nested tables. If the array contains scalar or object values, each array element becomes a row in the nested table. If the elements are scalar values like strings as in the tags field, a default column name “values” is provided, otherwise the objects in employees and locationTable and arrays in locationAddress and locationName are interpreted recursively using the rules just described. If a JSON document contains a JSON array at the top level, each element is treated like a row in a database table. Once the data has been mapped to the NRM, we can refer to nested ele- ments by creating a path to the appropriate column from the top level table. For example, the field referred to by the JSONPath $.employees.name in the source document is now the column [“employees”, “names”] in the NRM. An illustration of the NRM as displayed by the Karma GUI can be seen in Figure 1Other hierarchical formats follow from this approach, sometimes directly; we use existing tools for translating XML and Avro to JSON and then import the sources as JSON. XML elements are treated like JSON objects, its attributes are modeled as a single row nested table where each attribute is a column. Repeated child elements become a nested table as well. Fig. 1: Nested Relational Model example displayed in Karma GUI KR2RML 5 4 KR2RML vs. R2RML We propose the following modifications on how to interpret R2RML language elements to support large, hierarchical data sets. rr:logicalTable The R2RML specification allows the processor to access the input database referenced by the rr:logicalTable through an actual database or some materialization. For legacy reasons, we support the former for rela- tional databases with rr:tableName and rr:sqlQuery, but only the latter for non- relational database sources. rr:column To support mapping nested columns in the NRM, we no longer limit column-valued term maps to SQL identifiers. Instead, we allow a JSON array to capture the column names that make up the path to a nested column from the document root. As described previously, the $.employees.name field is mapped in the NRM to [“employees”, “name”]. $.companyName is unchanged from how R2RML would map tabular data, “companyName”, since it is not nested. We chose to forgo languages like XPath or JSONPath for expressing complex paths to nested columns, because supporting their full feature sets complicates the processor implementation. A simple, format agnostic path preserves the ability to apply a mapping to any format as long as the column names match. rr:template Often, applications require URIs that cannot be generated by con- catenating constants and column values for rr:templates without preprocessing from rr:sqlQuery. Instead, we allow the template to include columns that do not exist in the original input but are the result of the transformations applied by the processor. For example, in Figure 2a, [“employees”, “employeeURI”] is derived from “companyName” and [“employees”,“name”]. rr:joinCondition Joins are integral to how relational databases are structured, so they play an important role when mapping data across tables using R2RML. When dealing with either streaming data or data on the order of terabytes or greater, these join conditions across sources are impractical at best and require extensive planning and external support in the form of massively parallel pro- cessing engines like Spark or Impala to achieve reasonable performance. This is out of the scope of our processor, so we do not support them. We have rarely found a need for them in practice. km-dev:hasWorksheetHistory This is a tag introduced to capture the clean- ing, transformation and modeling steps taken by a Karma user to generate a KR2RML mapping. The steps are intended to be performed on the input be- fore RDF generation, but its presence is optional. Without these steps, there is nothing in a KR2RML mapping that a normal R2RML processor would not understand, except for the nested field references. 6 Slepicka et al. 5 Input Transformations One additional benefit of using the Nested Relational Model is that we can use it to create a mathematical structure with transformation functions that are closed on the universe of nested relational tables. Here, we can reuse the transformations outlined in a system like Potter’s wheel, but, instead of forcing operations like Split to create a cartesian product of rows, a transformation function can create a new set of nested tables. Figure 2 shows the result of applying the transformations to the example data shown in the Karma GUI. The blue columns are present in the source data, while the brown columns contain the transformation results. (a) URI Template generated from (b) Split Transformation Python Transformation (c) Glue Transformation (d) Python Transformations Fig. 2: Karma Transformations In Figure 2b, the split transformation breaks apart comma separated values in a column to so the job role titles can be mapped into individual triples. The NRM allows us to insert a new column with a nested table where each row has a single column for the title. The other transformations, Unfold, Fold, and Group By follow similarly. One unique challenge faced by the Nested Relational Model is transforming data from across nested tables. This can be accomplished in two ways. The first is demonstrated in Figure 2c. The locationAddress and locationName fields are mapped from the data as two nested tables but each row should be joined to its pair with the same index in the other table. This is accomplished by a Glue Transformation. Alternatively, it could be accomplished by executing a Python Transformation. Python Transformations allow the user to write Python functions to refer- ence, combine, and transform any data in the model and add the result as a new KR2RML 7 column in the NRM. This is akin to User Defined Functions (UDFs) in Cloudera Morphlines, Pig, and Hive, in the sense that the processor can take as input libraries of UDFs written in Python. In Figure 2d, the Python Transformations are used to extract and format the components of the locationAddress field for mapping. Python Transformations are most commonly used to augment tem- plates in R2RML because templates have limited utility as illustrated in Figure 2a. Finally, the processor supports using these same Python APIs to select, or filter, data from the model, by evaluating a UDF that returns a boolean for each column value indicating whether to ignore it during RDF generation. 6 RDF Generation The final step in this process is to apply the R2RML mapping to the Nested Relational Model and generate RDF. Without careful consideration of how to apply the mapping, the process could be computationally very expensive or be artificially limited in the relationships its allowed to express in the hierarchy as the order in which the input data is traversed to generate the RDF is very important. We approach this in three steps: translate the TriplesMaps of an R2RML mapping into an execution plan, evaluate the TriplesMaps by traversing the NRM, and finally serializing the RDF. An example R2RML mapping for this source is illustrated in Figure 3 and available online. The JSON document has been semantically modeled according to the schema.org ontology. The dark gray bubbles in Figure 3 correspond to TriplesMaps. Their labels are their SubjectMap’s class. The arcs between the gray bubbles are PredicateObjectMaps with RefObjectMaps, while the arcs from the gray bubbles to the worksheet are PredicateObjectMaps with ObjectMaps with column references. The light grey bubbles on the arcs are the Predicate. Fig. 3: KR2RML Mapping Graph mapped to NRM in Karma R2RML Execution Planning To create an execution plan, we first consider the R2RML mapping as a directed, cyclic multigraph where TriplesMaps are the vertices, V, and the PredicateObjectMaps with RefObjectMaps as edges, E. From this directed, cyclic multigraph, we break cycles by flipping the direction of the edges, until we can create a topological ordering over the TriplesMaps. This ordering then become an execution plan, starting from the TriplesMaps with no outgoing edges. The algorithm is presented as follows. 8 Slepicka et al. Generate Topological Ordering GenerateTopologicalOrdering is applied to each graph in the multigraph of the R2RML mapping. It returns a topologically sorted list of TriplesMaps for processing, along with edges that were flipped to break cycles or simplify processing. The resulting lists for a multigraph can be processed independently and in parallel. The root will be the last TriplesMap pro- cessed. Its selection has important implications for generating JSON-LD which will be discussed later. GenerateTopologicalOrdering(V,E, root) (spilled, flipped, V, E) := spill(V,E, root, {},{}) if(|V| > 0) flipped := DFSBreakCycles(V, E, root, {}, root, flipped) (spilled, flipped, V, E) = spill(V,E, root, spilled, flipped) return spilled, flipped Spill TriplesMaps Spill iteratively removes TriplesMaps for processing from the R2RML mapping until all nodes have outgoing edges or a cycle is found. The earlier a TriplesMap is spilled, the earlier it is processed. Spill(V, E, root, spilled, flipped) modifications := true while(|V| > 0 && modifications) modifications := false for( v in V) //v has only one edge or all edges are incoming if(deg(v) == 1 || outDeg(v) == 0) if(v == root && deg(v) > 0)//don’t spill root continue //flip edges because spilled nodes //are "lower" in hierarchy for(e in edges(v)) if(source(e) == v && v != root) flipped = flipped + e E = E - e if(e in edges(v) == 0)//remove edgeless vertexes V = V - v spilled = V + v modifications := true return spilled, flipped, V, E Break Cycles DFSBreakCycles arbitrarily break cycles in the graph by per- forming a Depth First Search of the TriplesMaps graph, starting at the root, flipping incoming edges if the target node is visited before the source DFSBreakCycles(V, E, root, visited, current, flipped) toVisit := {} KR2RML 9 visited := visited + current //sort edges so edges so outgoing edges are first sortedEdges := sort(edges(current)) for( e in sortedEdges) if(source(e) = v) toVisit := toVisit + target(e) else if(!visited(source(e))) flipped := flipped + e toVisit := toVisit + source(e) for( v in toVisit) if(!visited(v)) flipped, visited := DFSBreakCycles(V, E, root, visited, v, flipped) return flipped, visited If we apply GenerateTopologicalOrdering to the mapping illustrated in Fig- ure 3, it takes the following steps. – Spill removes PostalAddress, then Place, then stops because of a cycle – DFSBreakCycles starts at Organization, and sorts its edges, outgoing first – Root is the source for schema:employee, so Person is added to toVisit set – Root is not the source for schema:worksFor, so it is added to flipped set – Person is the target for schema:worksFor, but it already a part of toVisit – DFSBreakCycles visits the next node in the toVisit set, Person – DFSBreakCycles halts since Person’s edges point to root, which is in visited – The Spill algorithm is applied again and it removes Person – This leaves the root, Organization, without any links, so Spill removes it This results in a topological ordering of the TriplesMaps for the processor: Posta- lAddress, Place, Person, Organization. R2RML TriplesMap Evaluation After the transformations, the NRM is im- mutable, so topologically sorted TriplesMaps can be evaluated according to their partial order on massive data in parallel without interference. As the processor evaluates each TriplesMap, it first populates the SubjectMap templates. The pro- cessor populates templates by generating the n-tuple cartesian product of their column values, which it derives by finding the table reachable by the shortest common subpath from the root between the column paths and then navigating the NRM to the columns from there. For a tabular format, the cardinality of the n-tuple set is always 1, just like R2RML applied to a SQL table. For our JSON example, we consider each JSON object in the employee array individually, so the cardinality for the schema:Person in the first Organization is 2. This example lacks templates that cover multiple nested tables, but, for a mapping that does, the shortest common subpath between columns is captured by a concept called column affinity. Every column has an affinity with every other column. The closest affinity is with columns in the same row, then columns 10 Slepicka et al. descending from the same parent table row, then columns descending from a common row in an ancestor table, then columns with no affinity. The processor populates PredicateObjectMaps with ObjectMaps that have column and template values by navigating the NRM in relation to the elements that make up the SubjectMap templates, bound by the tightest affinity between the columns of the SubjectMap Templates and the columns of the ObjectMap templates. Values common to multiple subjects can be cached. The processor evaluates PredicateObjectMaps with RefObjectMaps by popu- lating the templates in a similar fashion to regular ObjectMaps. A predicateMap with a template, however, requires calculating an affinity for the template pop- ulation algorithm between the template and both of the Subject and Object templates to ensure only the right cartesian product is considered. Edges that have been flipped while generating the topological ordering are evaluated at the TriplesMap indicated by the RefObjectMap. Populating the flipped Subject and Object templates is the same as if the RefObjectMap’s TriplesMap was actually the Subject. The subject and object are just swapped on output. R2RML RDF Generation The processor supports RDF output in N-Triples (or N-Quads) as a straightforward output of the TriplesMaps evaluation. What is more interesting is that this evaluation order enables specifying a hierarchical structure to the output. It is worth noting that JSON-LD is often serialized as flat objects and requires a special framing algorithm to create nested JSON-LD. Instead, Karma automatically generates a frame from the KR2RML Mapping to output nested JSON-LD. It can also generate Avro schemas and data and is easily extendable to any other hierarchical format like Protocol Buffers or Turtle. Evaluating PredicateObjectMaps for each TriplesMap results in a JSON ob- ject for each generated subject and a scalar or an array depending on the car- dinality of the template evaluation of the ObjectMaps. Because of the output order, when generating RDF for any TriplesMap with a RefObjectMap that has not been flipped, the objects’ JSON objects will have already been outputted, so they can be nested and denormalized into the new JSON object. This recursively allows nesting all the way up until the root, which is shown in the output below. In this example, the user indicates to the processor that the root is Orga- nization. Any cycles are prevented by the flipped RefObjectMaps, so instead of nesting JSON objects, the URI is added. To finish formatting, a JSON-LD con- text can also be inferred from the R2RML mapping prefixes or can be provided. 1 { 2 "@context": "http://ex.com/contexts/iswc2015_json-context.json", 3 "location": [ 4 {"address": { 5 "streetAddress": "4676 Admiralty Way Suite 1001", 6 "addressLocality": " Marina Del Rey", "postalCode": "90292", 7 "addressRegion": "CA","a": "PostalAddress" 8 }, KR2RML 11 9 "name": "ISI - West","a": "Place","uri": "isi-location:ISI-West"}, 10 {"address": { 11 "streetAddress": "3811 North Fairfax Drive Suite 200", 12 "addressLocality": " Arlington", "postalCode": "22203", 13 "addressRegion": "VA","a": "PostalAddress" 14 }, 15 "name": "ISI - East", "a": "Place","uri": "isi-location:/ISI-East"} 16 ], 17 "name": "Information Sciences Institute","a": "Organization", 18 "employee": [ 19 {"name": "Knoblock, Craig","a": "Person", 20 "uri": "isi-employee:Knoblock/Craig", 21 "jobTitle": ["Research Professor","Director"], 22 "worksFor": "isi:company/InformationSciencesInstitute"}, 23 {"name": "Slepicka, Jason","a": "Person", 24 "uri": "isi-employee:Slepicka/Jason", 25 "jobTitle": ["Graduate Student"," Research Assistant"], 26 "worksFor": "isi:company/InformationSciencesInstitute"} 27 ], 28 "uri": "isi:company/InformationSciencesInstitute" 29 } 7 Related Work Approaches exist for semantically mapping relational databases like D2R [2] but cannot address messy data nor easily extend to support new formats. RML [3] was proposed as an extension to R2RML to support heterogeneous, and often hierarchical, sources, e.g. XML and JSON, but its support for additional data types requires a custom processor for extracting values from each data type. RML tries to deal with the ambiguity that surrounds identifying subjects in hierarchical sources by introducing an rml:iterator pattern. Iterators benefit from their precision but are an additional step that can often be inferred by using column affinities. When we cannot infer them, however, we either require a transformation or end up generating a subject for each tuple in the cartesian product of the corresponding TriplesMap’s PredicateObjectMaps that map to columns. XR2RML [8] is an extension of both RML and R2RML that supports NoSQL document stores. It agrees with our approach to leave format specific query language specification out of the language but still adopts the iterator pattern and also lacks supports for transformations. Mashroom [14], on the other hand, takes a similar approach with the Nested Relational Model for integrating and transforming hierarchical data, but it does have support for semantics nor a standard for the mapping. 12 Slepicka et al. 8 Conclusions and Future Direction Developing scalable solutions for data cleaning, transformation, and semantic modeling becomes more important every day as the amount of data available grows and its sources diversify. This alternative interpretation of R2RML and its processor are available under the Apache 2.0 license and have been embedded in Apache Hadoop and Apache Storm to generate billions of triples and billions of JSON documents in both a batch and streaming fashion and can be extended to consume any hierarchical format. It will soon be available for Spark and has well defined interfaces for adding new input and output formats. References 1. Apache Software Foundation. Avro. Version 1.7.6, 2014-01-22, https://avro. apache.org. 2. Bizer, C., and Cyganiak, R. D2r server-publishing relational databases on the semantic web. In Poster at the 5th International Semantic Web Conference (2006), pp. 294–309. 3. Dimou, A., Sande, M. V., Colpaert, P., Mannens, E., and de Walle, R. V. Extending r2rml to a source-independent mapping language for rdf. In Interna- tional Semantic Web Conference (Posters) (2013), vol. 1035, pp. 237–240. 4. Fernández, J. D., Martı́nez-Prieto, M. A., Gutiérrez, C., Polleres, A., and Arias, M. Binary rdf representation for publication and exchange (hdt). Web Semantics: Science, Services and Agents on the World Wide Web 19 (2013), 22–41. 5. Google Inc. Protocol buffers. Version 2.6.1, 2014-10-20, https://developers. google.com/protocol-buffers/. 6. Knoblock, C. A., Szekely, P., Ambite, J. L., Gupta, S., Goel, A., Muslea, M., Lerman, K., Taheriyan, M., and Mallick, P. Semi-automatically mapping structured sources into the semantic web. In Proceedings of the Extended Semantic Web Conference (Crete, Greece, 2012). 7. Makinouchi, A. A consideration on normal form of not-necessarily-normalized relation in the relational data model. In Proceedings of the 3rd VLDB Conference (1977), Citeseer, pp. 445–453. 8. Michel, F., Djimenou, L., Faron-Zucker, C., and Montagnat, J. Transla- tion of relational and non-relational databases into rdf with xr2rml. In 11th Web Information Systems and Technologies (WEBIST) (2015). 9. MongoDB Inc. Bson. Version 1.0.0, 2014-11-18, http://bsonspec.org/. 10. Raman, V., and Hellerstein, J. M. Potter’s wheel: An interactive data cleaning system. In VLDB (2001), vol. 1, pp. 381–390. 11. Sporny, M., Kellogg, G., Lanthaler, M., Group, W. R. W., et al. Json-ld 1.0 json-based serialization for linked data. W3C Working Draft (2014). http: //www.w3.org/TR/json-ld/. 12. Sundara, S., Cyganiak, R., and Das, S. R2RML: RDB to RDF mapping language. W3C recommendation, W3C, Sept. 2012. 13. Typesafe Inc. Hocon. Version 1.3.0, 2015-05-08, https://github.com/ typesafehub/config/blob/v1.3.0/HOCON.md. 14. Wang, G., Yang, S., and Han, Y. Mashroom: End-user mashup programming using nested tables. In Proceedings of the 18th International Conference on World Wide Web (New York, NY, USA, 2009), WWW ’09, ACM, pp. 861–870.