-

Pro ling the Linked (Open) Data

0 Universita degli Studi di Milano-Bicocca

The number of datasets published as Linked (Open) Data is constantly increasing with roughly 1000 datasets as of April 2014. Despite this number of published datasets, their usage is still not exploited as they lack comprehensive and up to date metadeta. The metadata hold signi cant information not only to understand the data at hand but they also provide useful information to the cleansing and integration phase. Data pro ling techniques can help generating metadata and statistics that describe the content of the datasets. However the existing research techniques do no cover a wide range of statistics and many challenges due to the heterogeneity nature of Linked Open Data are still to overcome. This paper presents the doctoral research which tackles the problems related to Linked Open Data Pro ling. We present the proposed approach and also report the initial results.

Linked Open Data Pro ling Data Quality Topical Classi cation

With 12 datasets in 2007, the Linked Open Data cloud has grown to more than 1000 datasets as of April 2014 [ 17 ], a number that is constantly increasing. The datasets to be published need to adopt a series of rules in a way that it would be simple for them to be searched and queried [ 3 ]. The datasets should be published adapting W3C standarts in RDF1 format and made available for SPARQL2 endpoint queries. Adapting these rules allow di erent data sources to be connected by typed links which are useful to extract new knowledge as linked datasets do not have the same information. Even though the Linked Open Data is considered a gold mine, its usage is still not exploited as understanding a large and unfamiliar RDF dataset is still a key challenge. As a result of a lack of comprehensive descriptive information the consumption of these dataset is still low. Data pro ling techniques support data consumption and data integration with statistics and useful metadata about the content of the datasets. While traditional pro ling techniques solve many issues these techniques can not be applied to heterogeneous data such as Linked Open Data. Data pro ling techniques in the context of Linked Open Data are very important for di erent tasks: 1 http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/ 2 http://www.w3.org/TR/rdf-sparql-query/ Complex schema discovery. Schema complexity leads to di culties to understand and access databases. Schema summaries provide users a concise overview of the entire schema despite its complexity.

Ontology / schema integration. Ontologies published on the Web, even for datasets in similar domains can have di erences. Data pro ling techniques can help understanding the overlap between ontologies and help in the process of ontology creation, maintenance and integration.

Big knowledge bases and provide a landscape view. Data pro ling techniques can help identifying some core knowledge patterns (KP) which reveal a piece of knowledge in a domain of interest.

Inspect large datasets to nd quality issues. Data pro ling tools allow the inspection of large datasets for detecting quality issues, by identifying the cases that do not follow business rules, outliers detection, residuals, etc. Data integration. To perform a data integration process, one should consider schema mapping, the process of discovering relationships between schemas. Pro ling techniques can reveal mappings between classes and properties, helping the integration process.

Entity summarization. Finding features that best represent the topic/s of a given dataset can help not only the topical classi cation of the dataset but also understanding the semantic of the information found in the data. Data visualization for summarization. Pro ling techniques can support data visualization tools to visualize large multidimensional datasets by displaying only a small and concise summary of the most relevant and important features, enhancing the comprehension of the user by allowing him to dig into the data by zooming in or out the provided summary.

In this proposal we will focus on the pro ling techniques to summarize the content of a dataset and reveal data quality problems. Moreover we will propose pro ling techniques combined with data mining algorithms to nd useful and relevant features to summarize the content of datasets published as Linked Open Data and also techniques that reveal quality issues in the data. The dataset summarization can be used not only to detect if the dataset is useful or not, but also to provide useful information to the cleansing and integration phase. 2

Related Works

Statistics and summaries can help to describe and understand large RDF data. Most of the existing pro ling tools, support traditional databases which are homogeneous and have a well-de ned schema. These techniques can not be applied to Linked Open Data due to their heterogeneity and the lack of a well-de ned schema. As it will be discussed most of the existing techniques to pro le Linked Open Data are limited in few statistics and summaries covering only one task.

Roomba [ 1 ] is a framework to automatically validate and generate descriptive dataset pro les. The extracted metadata are grouped into four categories (general, access, ownership or provenance) depending on the information they hold. After metadata extraction some validation and enrichments steps are performed. Metadata validation process identi es missing information and automatically corrects them when it is possible. As an outcome of the validation process, a report is produced which can be automatically sent to the dataset maintainer.

The ExpLOD [ 8 ] tool is used to summaries a dataset based on a mechanism that combines text labels and bisimulation contractions. It considers four RDF usages that describe interactions between data and metadata, such as class and predicate instantiation, class and predicate usage on which it creates RDF graphs. It also uses the owl:sameAs links to calculate statistics about the interlinking between datasets. The ExpLOD summaries are extracted using SPARQL queries or algorithms such as partition re nement.

RDFStats [ 9 ] generates statistics for datasets behind SPARQL endpoint and RDF documents. It is built on Jena Semantic Framework and can be executed as a stand-alone process, important to optimize SPARQL queries. These statistics include the number of anonymous subjects and di erent types of histograms; URIHistogram for URI subject and histograms for each property and the associated range(s). It uses also methods to fetch the total number of instances for a given class, or a set of classes and methods to obtain the UIRs of instances.

LODStats [ 2 ] is a pro ling tool which can be used to obtain 32 di erent statistical criteria for datasets from Data Hub. These statistics describe the dataset and its schema and include statistics about number of triples, triples with blank nodes, labeled subjects, number of owl:sameAs links, class and property usage, class hierarchy depth, cardinalities etc. These statistics are then represented using Vocabulary of Interlinked Datasets (VoID)3 and Data Cube Vocabulary4.

ProLOD [ 5 ] is a web based tool which analyzes the object values of RDF triples and generates statistics upon them such as data type and patterns distribution. In ProLOD the type detection is performed using regular expression rules and normalized patterns are used to visualize huge numbers of di erent patterns. ProLOD also generates statistics on literal values and external links. ProLOD++5 which is an extension of ProLOD is also a browser based tool which implements several algorithms with the aim to compute di erent pro ling, mining or cleansing tasks. In the pro ling task are included processes to nd frequencies and distribution of distinct subjects, predicates and objects, range of the predicates etc. ProLOD++ can also identify predicates combinations that contain only unique values as key candidates to distinctly identify entities. The implementation of mining tasks cover processes such as synonym and inverse predicate discovering, association rules on subjects, predicates and objects, etc. It also performs some cleansing tasks such as auto completions of new facts for a given dataset, ontology alignment in identifying predicates which are synonym or identifying cases where the pattern usage is over speci ed or underspeci ed.

Pro ling as the activity of providing insights through the data, is not only about providing statistics about value distribution, null values etc, but also is referred to the process of nding and extracting information patterns in the data. 3 http://www.w3.org/TR/void/ 4 http://www.w3.org/TR/vocab-data-cube/ 5 https://www.hpi.uni-potsdam.de/naumann/sites/prolod++/app.html

In the area of schema summarization Knowledge Patterns (KP) can be dened as a template to organise meaningful knowledge [ 6 ]. The approach in [ 15 ] identi es an abstraction named dataset knowledge architecture that highlights how a dataset is organized and which are the core knowledge patterns (KP) we can retrieve from that dataset. These KPs summarise the key features of one or more datasets, revealing a piece of knowledge in a certain domain of interest.

Encyclopedic Knowledge Patterns (EKP) [ 12 ] are some knowledge patterns introduced to extract core knowledge for entities of a certain type from Wikipedia page links. EKPs are extracted from the most representative classes describing a concept and containing abstraction of properties. The use of EKPs that supports exploratory search is showen in Aemoo6 to enrich query results with relevant knowledge coming from di erent data sources in the Web [ 13 ].

In order to understand complex datasets, [ 4 ] introduces Statistical Knowledge Pattern (SKP) to summarize key information about an ontology class considering synonymity between two properties of a given class. An SKP is stored as an OWL ontology and contains information about axioms derived or not expressed in a reference ontology but can be promoted applying some statistical measures.

As shown, the actual pro ling tools provide schema based statistics like the class/property usage, incoming/outgoing links etc, but none of the existing works is focused in providing summarization of the content of the dataset and also apply techniques to pro le its quality. Author in [ 7 ] propose an approach to pro le the Web of Data, but in di erence from this, the proposed approach pro les Linked Data in terms of its quality and summarize datasets in terms of its topic. 3

Research Plan

The contribution of this PhD in the area of Linked Open Data Pro ling covers (i) generating new statistics that are not covered by the state of the art techniques (ii) new algorithms to overcome the challenges to perform pro ling in the LOD, and (iii) the development of a methodology on how to perform pro ling tasks. In the following we will give an overview of the methodology which we want to follow in order to accomplish the contribution we want to make in the eld.

New statistics for Linked Data Pro ling

While much e ort is done as described in the state of the art, the generated statistics are limited in some basic statistics such as the number of triples, number of classes/ properties that are used in a dataset, the datatypes or sameAs links used, etc. Datasets hold much more interesting information which might be hidden, but at the same time, this information could be useful for the consumer of the dataset. As data pro ling is referred to the activity of providing useful descriptive information, new techniques on how to extract the hidden information should be developed. Our intent is to develop automatic approaches to generate new statistics and knowledge patterns to provide dataset summary and inspect its quality. Di erent data mining techniques, such as association rule mining, can be used to discover and extract patterns and dependencies in the dataset. These 6 http://wit.istc.cnr.it/aemoo/ patterns might provide useful information especially to detect errors and inconsistencies in spatial data (consistency quality dimension). Implementation of di erent approaches for outlier detection, like distance/deviation/depth-based, evolutionary techniques, etc. could provide insight about abnormalities in the underlying data. Other techniques such as clustering, classi cation, aggregation, dimensionality reduction or spatial data summarization might help to provide concise and accurate dataset summarization and inspect quality dimensions mentioned in [ 16 ]. We intend to further investigate the topical classi cation of LInked Open Data. The datasets published as LOD cover a wide range of topics but they lack metadata that describe the topical category, so the users have di culties deciding if the dataset is relevant for their interest or not. For each of the dataset published as LOD a label for the topical category was manually assigned [ 17 ]. The datasets have only one label for the topical category while often two or more topics are needed to describe a dataset. The actual topical classi cation of datasets in the LOD is limited to eight categories, while a more ne-grained topical classi cation might provide more useful information.

Overcoming Pro ling Challenges

As another contribution in this research we want to tackel the pro ling challenges described in [ 11 ]. Traditional pro ling task can not be applied to Linked Data due to their heterogeneity. Heterogeneity can appear in di erent forms such as di erent formats or query languages called syntactic heterogenity. Linked Open Data can be represented in di erent formats, stored in di erent storage architectures also the data encoding schemes may vary. This is referred to as schematic heterogeneity. Datasets published as LOD might use di erent vocabularies, to describe synonymous terms. [ 11 ] referred semantic heterogeneity as the discovery of semantic overlap of the data. Traditional data pro ling tools can not be used to pro le Linked Open Data as they suppose data to be homogeneous stored in a single repository, while Linked Open Data are neither homogeneous nor stored in a single repository. Also as the number of the datasets published is increasing the need to adapt and optimise pro ling techniques to support huge amount of data is also high. A good approach when dealing with large datasets, is to improve the pro ling performance running the calculation of statistics and patterns extraction in parallel. We also plan to adapt some data mining techniques to deal with high dimensionality data, such as Linked Open Data.

Methodology to Pro le Linked Open Data

As another contribution of this research we intend to develop a methodology on how to perform pro ling tasks. This methodology would classify pro ling tasks depending on the purpose and also provide guidelines to appropriate select the tasks needed by the user. 4

Preliminary Results

This PhD work is now at the second year. As a rst step we measured the value of Linked Open Data, pro ling the data published as Open Data from the Italian Public Administrations. In this work we pro led the adoption of Linked Open Data best practices and local laws by the Italian Public Administration calculating a compliance index considering three quality dimensions for the published data; completeness, accuraccy and timeliness [ 18 ].

As mentioned in the Sec. 3, the main contribution of this research is to provide new techniques for dataset summarization and new statistics about the data. ABSTAT7 is a framework which can be used to summarise linked datasets and at the same time to provide statistics about them. The summary consists of Abstract Knowledge Patterns (AKPs) of the form <subjectType, predicate, objectType> which represent the occurrence of triples <sub,pred,obj> in the data, such that subjectType is a minimal type of sub and objectType is a minimal type of obj. The ABSTAT summaries can help users comparing in which of two datasets a concept is described with richer and diverse properties, and also help detecting errors in the data such as missing or datatype diversity, etc [ 14 ]. ABSTAT can also be used to x the domain and range information for properties. Either the domain or the range is unspeci ed for 585 properties in DBpedia Ontology and AKPs can help us in determining at least one domain and one range for the unspeci ed properties. For example, for the property http:==dbpedia.org=ontology=governmentType in DBpedia we do not have information about the domain. With our approach we can derive 7 di erent AKPs meaning that we can derive 7 domains for this property.

We further investigated one of the challenges still present in the Linked Open Data datasets, topic classi cation. We built the rst automatic approach to classify LOD datasets into the topical categories that are used by the LOD cloud diagram. For the classi cation we considered eight feature sets; vocabulary, classes and properties usage, local class/property names, text from rdfs:lable, toplevel domain and in and out degree. In Table 1, are shown the results training three classi ers k-NN, Naive Bayes and Decision Tree on three balancing approaches, no sampling, down and up sampling and two normalization techniques considering the binary occurrence and the relative term occurence for each term or vocabulary. Our approach achieves an accuracy of 81,62% [ 10 ].

A deep literature study for the tools which are used to pro le LOD has been taken. We analyzed existing tools in terms of the goal they are used for, 7 http://abstat.disco.unimib.it/ techniques, input, output, approach, automatization information, license etc, with the aim to have a complete view of the existing approaches and techniques for pro ling which helps us in determining new statistics or new techniques. This deepen study will also help us for the third contribution classifying pro ling tasks and creating a general methodology for each task depending on the use case. 5

Lessons Learned, Open Issues and Future Work

The main contribution of this PhD work is to address the challenges mentioned in Sec. 3 to built a framework for pro ling the Linked Open Data in order to give insights of the data, despite their heterogeneous nature. To evaluate the validity of the proposed approach or the results achieved is very di cult as in the led of LOD pro ling there is no Gold Standard, thus is very di cult to compare with others. For this issue, we want to further explore how these new statistics or summarization allow to improve the performance of the actual proling techniques and tools, e.g. how pro ling tasks can improve full-text search etc. To evaluate the validity of the proposed pro ling techniques to summarise datasets, as pattern discovery is not trivial, humans will evaluate the validity of the summarization in terms of relatedness and informativeness. We intend to provide to users a list of statistics and ask them which in their opinion is more important to support pro ling of Linked Open Data. The evaluation of the performance of pro ling tasks is very di cult, which still remains an open issue on which I am currently working.

The ABSTAT framework provides some contributions in summarising Linked Open Data, and detecting quality issues. We are working to enrich this framework with other statistics and to apply it to unstructured data such as microdata.

Regarding the topical classi cation of LOD datasets, we will consider the problem for multi-label classi cation. As the datasets in the LOD cloud are unbalanced a two stage approach might help, while a classi ers chain which makes a prediction for one class after the other could address the multi-lable problem. Up till now in our experiments we have not exploited RDF links beyond datasets in and out degree, so link-based classi cation techniques could be applied to further investigate the content of a dataset.

Acknowledgements

This research has been supported in part by FP7/2013-2015 COMSODE (under contract number FP7-ICT-611358). I would like to thank my supervisor Assoc. Prof Andrea Maurino, my supervisor during my visiting period Prof. Dr Christian Bizer, Asst. Prof Matteo Palmonari, Dr. Anisa Rula for their priceless suggestions and also the anonymous reviewers for their helpful comments.

[1]

Assaf ,

Troncy , and

Senart . Roomba: An extensible framework to validate and build dataset pro les . In The 2nd International Workshop on Dataset PROFIling and fEderated Search for Linked Data (PROFILES '15) co-located with ESWC 2015, Portoroz , Slovenia, May 31 - June 1, 2015 ., pages 32 { 46 , 2015 .

[2]

Auer ,

Demter ,

Martin ,

and J.

Lehmann . Lodstats - an extensible framework for high-performance dataset analytics . In 18th International Conference, EKAW 2012 ,

Galway

City , Ireland, October 8- 12 , 2012 .

[3]

Bizer ,

Heath , and

Berners-Lee . Linked data - the story so far . Int. J. Semantic Web Inf. Syst. , 5 ( 3 ):1{ 22 , 2009 .

[4]

Blomqvist ,

Zhang ,

A. L.

Gentile , I. Augenstein , and

Ciravegna . Statistical knowledge patterns for characterising linked data . In Proceedings of the 4th Workshop on Ontology and Semantic Web Patterns co-located with ISWC 2013 , Sydney, Australia, October 21 , 2013 .

[5]

Bo hm,

Naumann ,

Abedjan ,

Fenz , T. Grutze,

Hefenbrock ,

Pohl , and

Sonnabend . Pro ling linked open data with prolod . In Workshops Proceedings of the 26th ICDE 2010, March 1-6 , 2010 , Long Beach, California, USA.

[6]

Gangemi and

Presutti . Towards a pattern science for the semantic web . Semantic Web , 1 ( 1 -2): 61 { 68 , 2010 .

[7]

Jentzsch . Pro ling the web of data . Proceedings of the 8th Ph. D. retreat of the HPI research school on service-oriented systems engineering, page 101 , 2014 .

[8]

Khatchadourian and

M. P.

Consens . Explod: Summary-based exploration of interlinking and RDF usage in the linked open data cloud . In ESWC 2010 , Heraklion, Crete, Greece, May 30 - June 3, 2010 , pages 272 { 287 , 2010 .

[9]

Langegger and

Wo . Rdfstats - an extensible RDF statistics generator and library . In Database and Expert Systems Applications , DEXA, International

Workshops

, Linz, Austria, August 31-September 4 , 2009 , pages 79 { 83 , 2009 .

[10]

Meusel ,

Spahiu ,

Bizer , and

Paulheim . Towards automatic topical classi cation of lod datasets . In Proceedings of the 24th International Conference on World Wide Web, LDOW Workshop , 2015 , Florence,Italy, May 18 -22, 2015 .

[11]

Naumann . Data pro ling revisited . SIGMOD Record , 42 ( 4 ): 40 { 49 , 2013 .

[12]

A. G.

Nuzzolese ,

Gangemi ,

Presutti , and

Ciancarini . Encyclopedic knowledge patterns from wikipedia links . In The Semantic Web - ISWC 2011 Bonn, Germany, October 23-27 , 2011 , Proceedings, Part

, pages 520 { 536 , 2011 .

[13]

A. G.

Nuzzolese ,

Presutti ,

Gangemi ,

Musetti , and

Ciancarini . Aemoo: exploring knowledge on the web . In Web Science 2013 ( co-located with ECRC) , WebSci '13 , Paris, France, May 2- 4 , 2013 , pages 272 { 275 , 2013 .

[14]

Plamonari ,

Rula ,

Porrini ,

Maurino ,

Spahiu , and

Ferme . Abstat: Linked data summaries with abstraction and statistics . In European Semantic Web Conferenze 2015 ( ESWC2015) Portoroz , Slovenia, 31th May - 4th June 2015 .

[15]

Presutti ,

Aroyo ,

Adamou ,

B. A. C.

Schopman ,

Gangemi , and

Schreiber . Extracting core knowledge from linked data . In Proceedings of the COLD 2011 , Bonn, Germany, October 23 , 2011 , 2011 .

[16]

Rula and

Zaveri . Methodology for assessment of linked data quality . In Proceedings of the 1st Workshop on Linked Data Quality co-located with 10th International Conference on Semantic Systems, LDQ@SEMANTiCS 2014 , Leipzig, Germany, September 2nd , 2014 ., 2014 .

[17]

Schmachtenberg ,

Bizer , and

Paulheim . Adoption of the linked data best practices in di erent topical domains . In The Semantic Web - ISWC 2014, Riva del Garda, Italy, October 19-23 , 2014 . Proceedings, Part

, pages 245 { 260 , 2014 .

[18]

Viscusi ,

Spahiu ,

Maurino , and

Batini . Compliance with open government data policies: An empirical assessment of italian local public administrations . Information Polity , 19 ( 3-4 ): 263 { 275 , 2014 .