-

Pro ling the Web of Data

0 Hasso-Plattner-Institute , Potsdam , Germany

The Web of Data contains a large number of openly-available datasets covering a wide variety of topics. In order to bene t from this massive amount of open data such external datasets must be analyzed and understood already at the basic level of data types, constraints, value patterns, etc. For Linked Datasets such meta information is currently very limited or not available at all. Data pro ling techniques are needed to compute respective statistics and meta information. However, current state of the art approaches can either not be applied to Linked Data, or exhibit considerable performance problems. This paper presents my doctoral research which tackles these problems.

Over the past years, an increasingly large number of data sources has been published as part of the Web of Data1. At the time of writing the Web of Data comprised already roughly 1,000 datasets totaling more than 82 billion triples2, including prominent examples, such as DBpedia, YAGO, and DBLP. Furthermore, more than 17 billion triples are available as RDFa, Microdata and Microformats in HTML pages3. This trend, together with the inherent heterogeneity of Linked Datasets and their schemata, makes it increasingly time-consuming to nd and understand datasets that are relevant for integration. Metadata gives consumers of the data clarity about the content and variety of a dataset and the terms under which it can be reused, thus encouraging its reuse.

A Linked Dataset is represented in the Resource Description Framework (RDF). In comparison to other data models, e.g., the relational model, RDF lacks explicit schema information that precisely de nes the types of entities and their attributes. Therefore, many datasets provide ontologies that categorize entities and de ne data types and semantics of properties. However, ontology information is not always available or may be incomplete. Furthermore, Linked Datasets are often inconsistent and lack even basic metadata. Algorithms and tools are needed that pro le the dataset to retrieve relevant and interesting metadata analyzing the entire dataset. 1 The Linked Open Data Cloud nicely visualizes this trend: http://lod-cloud.net 2 http://datahub.io/dataset?tags=lod 3 http://webdatacommons.org

Data pro ling is an umbrella term for methods that compute metadata for describing datasets. Traditional data pro ling tools for relational databases have a wide range of features ranging from the computation of cardinalities, such as the number of values in a column, to the calculation of inclusion dependencies; they determine value patterns, gather information on used data types, determine unique column combinations, and nd keys.

Use cases for data pro ling can be found in various areas concerned with data processing and data management [ 12 ]: Query optimization is concerned with nding optimal execution plans for database queries. Cardinalities and value histograms can help to estimate the costs of such execution plans. Such metadata can also be used in the area of Linked Data, e.g., for optimizing SPARQL queries.

Data cleansing can bene t from discovered value patterns. Violations of detected patterns can reveal data errors, and respective statistics help measure and monitor the quality of a dataset. For Linked Data, data pro ling techniques help validate datasets against vocabularies and schema properties. Data integration is often hindered by the lack of information on new datasets. Data pro ling metrics reveal information on, e.g., size, schema, semantics, and dependencies of unknown datasets. This is a highly relevant use case for Linked Data, because for many openly available datasets only little information is available.

Schema induction: Raw data, e.g., data gathered during scienti c experiments, often does not have a known schema at rst; data pro ling techniques need to determine adequate schemata, which are required before data can be inserted into a traditional DBMS. For the eld of Linked Data, this applies when working with datasets that have no dereferencable vocabulary. Data pro ling can help induce a schema from the data, which then can be used to nd a matching vocabulary or create a new one.

Data Mining: Finally, data pro ling is an essential preprocessing step to almost any statistical analysis or data mining task. While data pro ling focuses on gathering structural metadata about a dataset, data mining is usually more concerned with gaining new insights about data. 2

Relevancy

There are many commercial tools, such as IBM's Information Analyzer, Microsoft's SQL Server Integration Services (SSIS), or others for pro ling relational datasets. However these tool were designed to pro le relational data. Linked Data has a very di erent nature and calls for speci c pro ling and mining techniques.

Finding information about Linked Datasets is an open issue on the constantly growing Web of Data due to the use cases mentioned above. While most of the Linked Datasets are listed in registries as for instance at the Data Hub (datahub.io), these registries usually are manually curated, and thus incomplete or outdated. Furthermore, existing means and standards for describing datasets are often limited in their depth of information. VoiDand Semantic Sitemapscover basic details of a dataset, but do not cover detailed information on the dataset's content, such as their main classes or number of entities. More detailed descriptions, e.g., information on a dataset's RDF graph structure, topics etc., is usually not available. Data pro ling techniques can help to ful l the need for information about, e.g., classes and property types, value distributions, or entity interlinking. 3

Related Work

While many general tools and algorithms already exist for data pro ling, most of them cannot be used for graph datasets, because they assume a relational data structure, a well-de ned schema, or simply cannot deal with very large datasets. Nonetheless, some Linked Data pro ling tools already exist. Most of them focus on solving speci c use cases instead of data pro ling in general.

One relevant use case is schema induction, because the lack of a xed and well-de ned schema is a common problem with Linked Datasets. One example for this eld of research is the ExpLOD tool [ 9 ]. ExpLOD creates summaries for RDF graphs based on class and property usage as well as statistics on the interlinking between datasets based on owl:sameAs links.

Li describes a tool that can induce the actual schema of an RDF dataset [ 11 ]. It gathers schema-relevant statistics like cardinalities for class and property usage, and presents the induced schema in a UML-based visualization. Its implementation is based on the execution of SPARQL queries against a local database. Like ExpLOD, the approach is not parallelized. Both solutions still take approximately 10h to process a 10 million triples dataset with 13 classes and 90 properties. These results illustrate that performance is a common problem with large Linked Datasets.

An example for the query optimization use-case is presented in [ 10 ]. The authors present RDFStats, which uses Jena's SPARQL processor to collect statistics on Linked Datasets. These statistics include histograms for subjects (URIs, blank nodes) and histograms for properties and associated ranges.

Others have worked more generally on generating statistics that describe datasets on the Web of Data and thereby help understanding them. LODStats computes statistical information for datasets from the Data Hub [ 2 ]. It calculates 32 simple statistical criteria, e.g., cardinalities for di erent schema elements and types of literal values (e.g., languages, value data types).

In [ 4 ] the authors automatically create VoID descriptions for large datasets using MapReduce. They manage to pro le the BTC2010 dataset in about an hour on Amazon's EC2 cloud, showing that parallelization can be an e ective approach to improve runtime when pro ling large amounts of data.

Finally, the ProLOD++ tool allows to navigate an RDF dataset via an automatically computed hierarchical clustering [ 5 ] and along its ontology class tree [ 1 ]. Data pro ling tasks are performed on each cluster or class dynamically and independently to improve e ciency.

Challenges

This section describes selected challenges that I identi ed as speci c to pro ling Linked Data and web data, as opposed to pro ling relational tables.

Pro ling along hierarchies

Vocabularies de ne classes and their relationships. Ontology classes usually are arranged in a taxonomic (subclass{superclass) hierarchy. While the Web of Data spans a global distributed data graph, its ontology classes build a tree with owl:Thing as its root. Analyzing datasets along the vocabulary-de ned taxonomic hierarchies yield further insights, such as the data distribution at different hierarchy levels, or possible mappings betweens vocabularies or datasets.

Keys are clearly of vital importance to many applications in order to uniquely identify individuals of a given class by values of (a set of) key properties. In OWL 2 a collection of properties can be assigned as a key to a class using the owl:hasKey statement [ 8 ].

Nevertheless it has not yet fully arrived on the Web of Data: only one Linked Dataset uses owl:hasKey [ 7 ]. Thus, actually analyzing and pro ling Linked Datasets requires manual, time consuming inspection or the help of tools.

Many languages have a so-called \unique names" assumption. On the web, such an assumption is not possible as real-world entities can be referred to with di erent URI references.

Heterogeneity

A common practice in the Linked Data community is to reuse terms from widely deployed vocabularies whenever possible, in order to increase homogeneity of descriptions and, consequently, easing the understanding of these descriptions. There are at least 416 di erent vocabularies to be found on the Web of Data4. Some datasets, however, also exist without any de ned or dereferencable vocabularies. And even if common vocabularies are used, there is no guarantee that the speci cations and constraints are followed correctly.

Nearly all datasets on the Web of Data use terms from the W3C base vocabularies RDF, RDF Schema, and OWL. In addition, 191 (64.75 %) of the 295 datasets in the Linked Open Data Cloud Catalogue use terms from other widely deployed vocabularies [ 3 ].

As Linked Datasets cover a wide variety of topics, widely deployed vocabularies that cover all aspects of these topics may not exist yet. Thus, data providers often de ne proprietary terms that are used in addition to terms from widely deployed vocabularies in order to cover the more speci c aspects and to publish the complete content of a dataset on the Web. Currently 190 (64.41 %) out of the 295 datasets use proprietary vocabulary terms with 83.68 % making the term URIs dereferenceable.

Topical pro ling

The Web of Data covers not only a wide range of topics, it also contains a number of topically overlapping data sources. Since it provides for data

4 http://lov.okfn.org/

coexistence, everyone can publish data to it, express their view on things, and use the vocabularies of their choice. Integrating topically relevant datasets requires knowledge on the datasets' content and structure.

The State of the LOD Cloud document ?? gives an overview of the Linked Datasets for each topical domain but there is no ne-grained topical clustering for Linked Datasets. With 504 million inter-dataset links the Web of Data is highly interlinked; 1.6% of all triples are links stating the relationship between the real-world entities in di erent datasets. Thus a huge topical overlap amongst the datasets is given.

Large scale pro ling

With more than 82 billion triples distributed among roughly 1,000 Linked Datasets and more than 17 billion triples available as RDFa, Microdata and Microformats, the need for e cient pro ling methods and tools is apparent.

The runtime of pro ling tasks as presented in Sec. 7 takes up to hours, e.g., for determining property co-occurrences [ 6 ]. Pro ling tasks often have the same preprocessing steps, e.g., ltering or grouping the dataset. Thus there is a large incentive and potential to optimize the execution of multiple scripts. 5

Research Questions

The main question in my doctoral research is:

What are the challenges that are speci c to pro ling Linked Data and web data, as opposed to pro ling relational tables?

After identifying four selected challenges, the following questions arise: Pro ling along hierarchies Does analyzing Linked Datasets along the vocabulary-de ned taxonomic hierarchies, such as the data distribution at different hierarchy levels, yield further insights?

Heterogeneity How does pro ling help analyzing the heterogeneity on the Web of Data?

Topical pro ling How can topical clusterings for unknown datasets on the constantly growing Web of Data be derived e cently?

Large scale pro ling How can these huge amounts of Linked Data be proled e ciently? 6

Approach

My approach to address the research questions is to tackle each of the identi ed challenges. The main goal is to reuse existing pro ling techniques and adapt them to the Linked Data world.

This section presents possible and if available developed solutions by me to the presented challenges.

Pro ling along hierarchies

One example of pro ling tasks along the class hierarchy is determining the uniqueness of properties as well as the unique property combinations, which can bring insights into the property distribution inside the dataset. It allows for nding relevant (key-candidate) properties for each level in the class hierarchy and see if the relevance is increasing or decreasing along hierarchy.

As I have found, due to the sparsity on the Web of Data, usually neither full key-candidates of properties nor unique property combinations can be retrieved using traditional techniques. Thus I de ned the concept of keyness as the Harmonic Mean of uniqueness and density of a property5, allowing to nd potential key candidates.

Heterogeneity

Data pro ling can be used to provide metadata describing the characteristics of a dataset, for instance its topic and more detailed statistics, like the main classes and properties. Furthermore, data pro ling can not only determine the usage of vocabularies but also the help understanding and reusing existing vocabularies. Additionally, it can assist when mapping vocabulary terms.

Topical pro ling

The rst pro ling task is, of course, to discover (and possibly label) these topical clusters. The discovery of which topics an unknown dataset is even about, is already a very helpful insight. Next, any pro ling task can be executed on data of a particular topic and compared against the metadata of other topics.

Large scale pro ling

The runtime of the pro ling tasks takes up to hours already on 1 million triples, e.g., for determining property co-occurrences [ 6 ]. A number of di erent approaches can be chosen when trying to optimize the execution time of algorithms dealing with RDF data in general and data pro ling tasks in particular. Algorithmic optimization : Pro ling tasks that have high computational complexity cannot be computed navely, e.g., it is infeasible to detect property cooccurrence by considering all possible property combinations. Such metrics require innovative algorithms for e ciently computing the targeted result. If such an algorithm can not be found, approximation techniques (e.g., sampling) may be required. Because these algorithms are often highly specialized for a speci c pro ling task, they usually do not bene t other tasks.

Parallelization: When dealing with large datasets, a good approach for improving performance is to perform calculations in parallel when possible [ 12 ]. This can be done on di erent levels: dataset, pro ling run, pro ling task and triples. Clusterbased parallelization based on MapReduce is a reasonable choice when working with Linked Data.

Multi-Query Optimization : A data pro ling run usually consists of a number of di erent tasks, which all have to be computed on the same dataset. Depending on the set of data pro ling tasks, di erent tasks may require the same prepro5 We de ne the uniqueness of a property as the number of unique values per number of total values for a given property; and the density of a property as the ratio of non-NULL values to the number of entities. cessing steps, or perform similar computation steps. Overall execution time can be reduced by avoiding duplicate computations. Similar computation steps may be interweaved to reduce runtime and I/O costs. If di erent tasks require similar intermediate results, these can be stored in materialized views. 7

Preliminary Results

Initially, I have de ned a set of 56 useful data pro ling tasks along various groupings to pro le Linked Datasets. The have been implemented as Apache Pig scripts and are available online6.

Furthermore, I illustrated the Web of Data's diversity with the results for four di erent Linked Datasets [ 6 ].

Pro ling along hierarchies

When analyzing the uniqueness in the class hierarchy for DBpedia, I found that there are properties that become more speci c by class level, thus their uniqueness gets higher for subclasses. For instance, dbpedia:team becomes more unique for athletes than it is for all persons. I also found properties that are generic, their uniqueness stays constant throughout the class hierarchy. For instance, dbpedia:birthDate is not speci c to persons or their subclasses.

Furthermore, I have de ned the concept of keyness of the property to gap the sparsity on the Web of Data and thus the possibility to nd potential key candidates where traditional approaches fail.

Large scale pro ling

We have addressed the di erent approaches to improve Linked Data proling performance and not only developed LODOP, a system for executing, benchmarking and optimizing Linked Data pro ling scripts on Hadoop but also developed and evaluated 3 multi-query optimization rules [ 6 ]. We experimentally demonstrated that they achieve their respective goals of optimizing the amount of MapReduce jobs or the amount of data materialized between jobs, thus reducing the pro ling tasks runtimes by 70%. 8

Evaluation Plan

For the evaluation, there are three main lines of interest.

Metadata The main goal is to provide comprehensive dataset metadata that helps analyzing the datasets. The metadata can be evaluated on quantity and quality wrt existing metadata on the Data Hub, VoiD and Semantic Sitemaps.

Usability Tools and techniques should have a high usability in terms of results being presented in both human and machine readable ways to achieve better decision making when working with datasets.

Performance evaluation Various aspects of the developed tools should be tested for performance, especially the for huge amounts of data as it is present on the Web of Data.

6 http://github.com/bforchhammer/lodop/ Re ections and Conclusion

The main di erence in my approach with existing work on Linked Data pro ling is to address the shortcomings mentioned in Sec. 3, in particular gathering comprehensive metadata in an e cient way. Within my research I am building on existing pro ling techniques for relational data and adapting them according to the di erent nature of Linked Datasets.

This paper has presented the outline and preliminary results of my doctoral research, in which I am focussing on pro ling the Web of Data.

So far I have speci ed and implemented a comprehensive set of Linked Data pro ling tasks and illustrated the Web of Data's diversity with the results for four di erent Linked Datasets. Furthermore I introduced three common techniques for improving performance of Linked Data pro ling and implemented three multi-query optimization rules, reducing pro ling taskruntimes by 70%.

Abedjan , T. Grutze,

Jentzsch , and

Naumann . Mining and pro ling RDF data with ProLOD++ . In Proceedings of the International Conference on Data Engineering (ICDE) , 2014 . Demo.

Auer ,

Demter ,

Martin , and J. Lehmann. LODStats { an extensible framework for high-performance dataset analytics . In Proceedings of the Int. Conf. on Knowledge Engineering and Knowledge Management (EKAW) , 2012 .

Bizer ,

Jentzsch , and

Cyganiak . State of the LOD Cloud , 2011 .

4. C. Bohm, J. Lorey, and

Naumann . Creating VoiD descriptions for web-scale data . Journal of Web Semantics , 9 ( 3 ): 339 { 345 , 2011 .

5. C. Bohm,

Naumann ,

Abedjan ,

Fenz , T. Grutze,

Hefenbrock ,

Pohl , and

Sonnabend . Pro ling Linked Open Data with ProLOD . In Proceedings of the International Workshop on New Trends in Information Integration (NTII) , 2010 .

Forchhammer ,

Jentzsch , and

Naumann. LODOP - Multi-Query Optimization for Linked Data Pro ling Queries . In ESWC Workshop on Pro ling & Federated Search for Linked Data (PROFILES) , 2014 .

Glimm ,

Hogan , M.

Krotzsch, and

Polleres . OWL: Yet to arrive on the Web of Data? In WWW Workshop on Linked Data on the Web (LDOW) , 2012 .

Hitzler , M. Krotzsch,

Parsia ,

P. F.

Patel-Schneider , and S. Rudolph, editors. OWL 2 Web Ontology Language: Primer. W3C Recommendation , 2009 .

Khatchadourian and

M. P.

Consens . ExpLOD: Summary-based exploration of interlinking and RDF usage in the linked open data cloud . In Proceedings of the Extended Semantic Web Conference (ESWC) , Heraklion, Greece, 2010 .

10.

Langegger and W. Wo . RDFStats { an extensible RDF statistics generator and library . In Proceedings of the International Workshop on Database and Expert Systems Applications (DEXA) , pages 79 { 83 , Los Alamitos, CA, USA, 2009 .

11.

Li . Data Pro ling for Semantic Web Data . In Proceedings of the International Conference on Web Information Systems and Mining (WISM) , 2012 .

12.

Naumann . Data pro ling revisited . SIGMOD Record , 42 ( 4 ), 2013 .