-

Assessing Quantity and Quality of Links Between Linked Data Datasets

Ciro Baron Neto

cbaron@informatik.uni-leipzig.de 0 1 0 Leipzig University , AKSW/KILT 1 Source Dataset eagle-i @ Ponce - School of Medicine Radata na! eagle-i @ Vanderbilt University I-Choose The Cancer Genome Atlas

The Linked Data Web is growing and it becomes increasingly necessary to analyze the relationship between datasets to exploit its full value. LOD datasets can range from datasets with low cohesion { containing data from di erent Fully Quali ed Domain Names (FQDN) and namespaces { to highly cohesive datasets. This paper evaluates the quantity and quality of links between distributions, datasets and ontologies categorizing and de ning di erent types of links. We streamed and indexed 2.5 billion triples and extracted 0.5 billion links using probabilistic data structures. Our results show the analysis of datasets w.r.t. valid links, dead links, and number of namespaces described by distributions and datasets. Our results indicate that 7.9% of the links we indexed and veri ed are actually dead.

eol>Linked Open Data Linksets Dead Links RDF

In this paper, we present a thorough analysis of the links between the datasets participating in the 2014 LOD cloud[ 11 ] and Linked Open Vocabularies2. The aim of this paper was to evaluate the quality of links between these knowledge bases. The analysis was conducted with the engine of LODVader3, a real-time LOD Visualisation, Analytics and DiscovEry tool. Our novel approach based on Bloom- lters allows us to accurately measure the exact number of links between datasets and distributions, as well as identify dead and unveri ed links (cf. section 2) between datasets. The remainder of this work is structured as follows: We provide a description of metadata vocabularies, link granularity and linksets in Section 2, followed by the methodology used details in Section 3. Section 4 describes the results of our analysis and in Section 5 we present the related works. 1http://lod-cloud.net/#history 2http://lov.okfn.org/dataset/lov/ 3For the interface see http://svn.aksw.org/papers/2016/ WWW_LODVader_DEMO/public.pdf ID1

D1;1 D1;2 D1;3 . .

D1;j

Lreal L1

D2;1 D2;2 D2;3 . .

D2;k

Lreal

Dn;1 Dn;2 Dn;3 . .

Dn;m Finally, in Section 6 we present the future works and our conclusions.

BACKGROUND 2.1 Dataset Metadata Vocabularies

In order to identify which resources should be streamed and analyzed, this work relies on vocabularies such as DCAT [ 8 ], VoID 4 and DataID [ 2 ]. These vocabularies are used to represent metadata descriptions of datasets. They provide information about multiple properties of a dataset, including subsets and distributions. A subset is a distinct part of a dataset that can be di erentiated for a number of reasons, such as di erences in provenance, publication dates, accessibility or language5. Distributions describe the speci c les or resources by which the datasets might be accessed or acquired6.

2.2 Linkset Definition

Linksets are RDF descriptions of relations between datasets or distributions, represented by links. We adopted the DCAT and VoID vocabulary to describe the number of links, as well as source and target datasets. In order to clarify the de nition of the existing variables for a linkset, a brief explanation is given.

ID: a dataset, described by void:Dataset or dcat:Dataset; SID: the set of subsets, described by void:subset of given dataset ID < s; p; o >: the RDF triple which represents the subject s, predicate p and object o for a given relation dn: the n-th distribution consisting of a set of RDF triples.

DID: the set of distributions, described by dcat:distributions, of the dataset ID DSID : the set of distributions of subset S of dataset ID Lds!dt : the set of existing links between two distributions, having ds as source distribution and dt as target distribution. We de ne that a link occurs from a distribution ds to a distribution dt whenever ds contains < ss; ps; os > and dt contains < st; pt; ot > where os = st. We then call the triple < ss; ps; os > in the 4http://www.w3.org/TR/void/ 5http://www.w3.org/TR/void/#subset 6http://www.w3.org/TR/vocab-dcat/ #class-distribution source distribution a link (regardless of the used property) and say that the distributions are linked with each other (cf. Section 2.4). From this de nition it easily follows that linksets between distributions (subsets or datasets) can be aggregated in a straightforward manner. Consequently, a dataset IDs is linked to another dataset IDt, if a non-empty linkset from any distribution DSIDs to DSIDt exists.

Furthermore, we de ne the following notions in order to describe dead or unveri ed links. A dead link on the WWW is generally associated with a HTTP 404 Not Found response message. Analogously, we de ne "Not Found" between a distribution and a dataset:

N Sn(uri): The namespace of a URI, whereas N S0 refers to FDQN (incl. subdomain), N Sx refers to FDQN plus the URI path of length n and N S refers to the FDQN plus the path until and including the last '/' or '#'. In this paper, we work with N S or simply N S only, although other research would be interesting. SN SS(D) the set of N S (st) for all the subjects in all distributions of dataset D.

A partial dead link < s1; p1; o1 > between a distribution d1 and a dataset D exists if N S(o1) 2 SN SS(D) and @ triple t 2 D j o1 = st. Note that this de nition is based on the assumption that namespaces are unique to datasets. Given that there are several datasets with applicable namespaces, a total dead link or just dead link means that the respective object is not found as subject in any {already indexed{ dataset with overlapping namespaces.

An unveri ed link < s1; p1; o1 > exists if N S(o1) can not be found in any indexed dataset, i.e. there are no overlapping namespaces. As we are not investigating HTTP resolution, we have to assume bona de that we just have not indexed the target dataset yet.

2.3 Link Granularity

The LOD cloud diagram[ 11 ] assumes as the basis for a dataset de nition the Pay-Level Domain (PLD) [ 7 ]. It consequently only depicts inter-dataset relations as links. LODVader also o ers visualisation and analysis of intra-dataset relationships, for example between subsets and distributions, featuring a higher link granularity. Figure 1 shows an overview of links at di erent levels of granularity regarding a linkset representation. Datasets are represented by IDn, subsets are represented by Sn and distributions are represented by Dn. Lreal is a linkset containing links between two distributions which are measured on the intersection of subjects and objects (cf. Section 2.2 ). The linksets L1 to L4 can be generated by calculating the union of the linksets between all distributions of the respective subsets and datasets.

2.4 Linking Predicates

Common approaches for linking analysis rely on the inspection of the predicates. owl:sameAs has well-de ned formal semantics and is the predicate which is closest to traditional deduplication. For record linkage or object reconciliation in the database area, counting owl:sameAs links exclusively provides a very limited view of the Web of Data and does not provide a reliable model [ 6 ].

Several other properties have been proposed with rdfs:seeAlso and skos: { exact | close | broad | narrow | related} Match being the most common. In our work, we are tolerant and consider all predicates for linking. While for crawling link direction is important { although DBpedia is the largest authority [ 11 ], no backlinks are included { we argue that linking properties is often either symmetric (and highly unlikely to be asymmetric) or it is feasible to assume that an inverse property exists or could be easily created, i.e., following a birthplace$isBirthplaceOf pattern or simply birthplace 1.

To the best of our knowledge, we have not encountered predicates expressing negative links yet (i.e. notLinkedTo). Vocabulary Links. another aspect of linking properties that is often neglected are links to vocabularies and links between vocabularies. Especially, the linkage via rdf:type has not yet been visualized in a cloud diagram and is often not included in link analysis.

3. METHODOLOGY

We parsed description les from Linked Open Vocabularies7, DBpedia datasets and from the LOD cloud searching for instances of dcat:Distribution, henceforth called source distribution. The application then fetches the dcat:downloadURL or void:dataDump object. Before the download of the source distribution is started, it is checked whether the dataset has already been imported into the system. If the dataset is known, the system reads the Last-Modi ed date and ContentLength in the HTTP header to verify whether the dataset has not been changed. If there are modi cations, the old data is moved to an archive, in order to use it for versioning reasons. Once the streaming starts, we detect the serialization type, possibly decompress the stream and parse the RDF triples. It's important to emphasize that since LODVader is publicly available, more and more datasets are added and analyzed.

The process of Link Discovery is made on the y for each distributions streamed. For every triple, the Linking Analytics modules discards the predicate and takes only the subject and the object as input (< s; o >). If the object is a literal or a blank node the tuple is discarded. As a nal ltering step, we reject tuples with malformed IRIs. The tuples that pass the ltering step, enter a processing pipeline: 7http://lov.okfn.org/dataset/lov/ 1. Tuple splitting. subjects and objects of each tuple are separated and saved in two queues. The queues contain resources which will be compared with Bloom lters (BFs). 2. BF Fetching. we extract the namespace of each resource to compare and assign the resource to a respective BF which will represent a target distribution. For every namespace we encounter, we fetch all the existing BFs that are processed and stored in a cache memory. 3. Link Extraction. objects and subjects of the source distribution are compared with the in-memory BFs of the target distributions. If an object of the source distribution exists in the BF of the target distribution as a subject we count one link between the source distribution and the target distribution. If the opposite way happens, i.e. if subject of the source distribution exists in the BF of the target distribution as an object we count one link between the target distribution and the source distribution. The non-existence of link between a source distribution and a target dataset is counted as a dead link between the source distribution and the target dataset.

At the end of the pipeline two sets of BFs are created. A set containing all subjects and a second set containing all objects of the source distribution. These BFs will represent the current distribution and might be used later when other sources distributions are streamed.

It is important to stress that, although our model reads and retrieves RDF data, it does not store any RDF. Our implementation creates RDF on the y reading documents from MongoDB and using Apache Jena to create RDF models. All BF stored have the same size (each BF describes 5000 resources), making the time to query any resource from any distribution be quasi-linear time complexity. For big distributions with more than 5000 triples, multiple BFs are created. In addition, the BFs are not stored directly to the le system, but using GridFS8 to manage the BF les. A more detailed documentation in regard to the implementation can be found on the LODVader GitHub9 repository.

4. RESULTS

In order to make a general analysis of quantity and quality of Linked Data datasets, we streamed all datasets found in the metadata description le of the The Linking Open Data cloud diagram 201410, the DBpedia Core11 distributions and all vocabularies found on Linked Open Vocabularies12. At the time of writing, we discovered13 185 million veri ed links (out of 0.5 billion links in total) among 1408 datasets and 395 vocabularies, totalizing more than 2.5 billion triples. These numbers grow, since more users start to provide good metadata and it's possible for users to submit their datasets to our analysis. 8https://docs.mongodb.org/manual/core/gridfs/ 9https://github.com/AKSW/LODVader 10http://data.dws.informatik.unimannheim.de/lodcloud/2014/ISWC-RDB/ 11http://downloads.dbpedia.org/current/core/ 12http://lov.okfn.org/dataset/lov/ 13http://lodvader.aksw.org/#/stats

Name

Educational programs - SISVU statistics.data.gov.uk Farmers Markets Geo. Data (U.S.) VIVO Weill Cornell Medical College VIVO WUSTL . . . . . . eagle-i @ Dartmouth College TaxonConcept Knowledge Base eagle-i @ Montana State University The Living LOD Cloud Ontos News Portal Our result analysis consists of three steps. First, in order to know whether a dataset is suitable or not to describe certain resource (e.g., subjects or objects), we extracted all namespaces with their respective proportion on the datasets. Following, we calculated the number of indegree and outdegree per datasets, and nally, we calculated the indegree and outdegree of dead links among datasets. Our metric for indegree and outdegree are the number of datasets which contains one or more link to or from the current dataset. Several datasets describe a single namespace, however more than 70% of datasets describes two or more. Table 1 shows datasets with the biggest and smallest proportions of described namespaces. The column "# N S " contains the number of distinct namespaces for the dataset, and the last column shows the proportion of the predominant namespace. The top 5 rows show datasets with highly predominant namespaces, and the last 5 rows show the datasets with completely mixed namespaces. Table 4 and Table 5 shows the top 5 datasets with dead indegree links, and top 5 datasets with dead outdegree links. Dead indegree means that external datasets link to nonexisting resources of a dataset. Dead outdegree refers to dataset that link to external dead links. The in and out degree is aggregated at the dataset level and the links provides the total number of dead links.

Source Dataset

DBpedia Core eagle-i @ Dartmouth College eagle-i @ Uni. Alaska eagle-i @ Charles R. Drew Uni. TaxonConcept Knowledge Base

Target Dataset

The Living LOD Cloud TaxonConcept Knowledge Base VIVO Cornell eagle-i @ Jackson State University Traditional Korean Medicine Ont.

Outdegree

13 13 21 41 1

Target Dataset

The Media RDF Vocabulary Document Availability Information Ont VIVO Core Ontology An Ontology for vCards Conversion Ontology

Finally, Figure 2 provides an overview of the total correct links, dead links and unveri ed links. In total, we have found 302,855,189 unveri ed links, 12,430,800 dead links and 172,254,731 links. The large number of unveri ed links is due the fact that our coverage is not so broad, and it's still getting wider since new datasets are added. It is worth noting though that 7.9% of the veri ed links are dead links.

5. RELATED WORK

Most LD (link discovery) frameworks can only determine links based on owl:sameAs or equivalent instances. However, RDF-AI[ 5 ] is a framework which takes two datasets as input, and as outcome generates a new dataset where the content is a list of correspondences between equivalent resources of the input datasets. The system is composed of ve modules

Dead links 3% 62% 35% Unveri ed links

which allows pre-process, match, fusion, inter-link and postprocess RDF datasets.

Due to strong growth of the LOD cloud it is obvious that there is a demand for LOD cloud analytical frameworks. Some statistical information can be found together with the LOD cloud diagram [ 11 ] [ 3 ]. Unfortunately the statistical information are also static.

Another good example is Aether [ 9 ]. It supplies the user with many di erent statistical information for datasets when supplied with a SPARQL endpoint address. It is even possible to compare di erent SPARQL endpoints, which can be useful if two di erent endpoints should be analyzed. Although this framework supplies the user with great statistical information and pie charts, it is only developed for comparing the content between two SPARQL endpoints. LOD-Laundromat[ 1 ] provides an uniform way to publish and clean datasets. Di erent statistical data is published, like duplicated triples, amount of triples, dataset size and other. The LOD-Laundromat contains over 38 billion triples, however the issue is that they do not provide metadata regarding dataset labels, name or title, making the whole graph visualization a hard task.

6. CONCLUSIONS AND FUTURE WORK

This paper classi ed and evaluated links among more than 1,200 datasets w.r.t. dataset indegree and outdegree for different types of links. We discovered a total of 0.5 billion links out of which 12.5M were dead and we could not verify 302M links. This suggests that around 7.9% of the veri ed LOD links we indexed are dead. This number is based on current coverage of indexed datasets of our analysis. Indexing new datasets can raise this number (if more dead links are discovered) as well as lower it (if a dataset is indexed that contains link targets). However, we already invested a lot of e ort into discovering as many datasets as possible and assume that an average linked data consumer would not go to such lengths to retrieve data.

In order to expand the coverage of our analysis, we expect to work in collaboration with other approaches such as LODLaundromat[ 1 ]. We believe that at least the amount of unveri ed links might be reduced as more dataset will be added.

An area we would like to reasearch on is to identify authoritative namespaces for datasets. This would make it easier to identify if a resource is described in an authoritative dataset or a dataset hijacks a namespace. This could provide ways to further analyze the quality of links and would also help to de ne best practices based on de-facto linking. Acknowledgement. This paper's research activities were funded by grants from the FP7 & H2020 EU projects LIDER (GA-610782) and ALIGNED (GA 644055), FREME (GA644771), Smart Data Web (GA-01MD15010B) and CAPES foundation - Ministry of Education of Brazil (13204/13-0).

[1]

Beek ,

Rietveld ,

Bazoobandi ,

Wielemaker , and

Schlobach . Lod laundromat: A uniform way of publishing other people's dirty data . In ISWC 2014, Lecture Notes in Computer Science , pages 213 { 228 . Springer International Publishing, 2014 .

[2]

Bru mmer, C. Baron, I. Ermilov,

Freudenberg ,

Kontokostas , and S. Hellmann. DataID: Towards Semantically Rich Metadata for Complex Datasets . In Proceedings of the 10th International Conference on Semantic Systems, SEM '14 , pages 84 { 91 . ACM, 2014 .

[3]

A. J. Chris

Bizer and

Cyganiak . State of the lod cloud ., 2011 .

[4]

Fetterly ,

Manasse ,

Najork , and

Wiener . A large-scale study of the evolution of web pages . In Proceedings of the 12th International Conference on World Wide Web, WWW '03 , pages 669 { 678 , New York, NY, USA, 2003 . ACM.

[5]

C. Z.

Francois Schar e , Yanbin Liu. Rdf-ai: an architecture for rdf datasets matching, fusion and interlink . IJCAI , 2009 .

[6]

Halpin ,

P. J.

Hayes ,

J. P.

McCusker ,

D. L.

McGuinness ,

and H. S.

Thompson . When OWL: sameAs Isn't the Same: An Analysis of Identity in Linked Data . In ISWC , pages 305 { 320 . Springer, 2010 .

[7]

Lehmberg ,

Meusel , and

Bizer . Graph Structure in the Web: Aggregated by Pay-level Domain . In Proceedings of the 2014 ACM Conference on Web Science , WebSci '14 , pages 119 { 128 , New York, NY, USA, 2014 . ACM.

[8]

Maali and

Erickson . Data Catalog Vocabulary (DCAT). W3C recommendation, W3C , Jan. 2014 .

[9]

Ma kela. Aether { generating and viewing extended void statistical descriptions of rdf datasets . In Proceedings of the ESWC 2014 demo track , Springer-Verlag, 2014 .

[10]

SalahEldeen and M. L. Nelson . Losing my revolution: How many resources shared on social media have been lost? CoRR , abs/1209.3026, 2012 .

[11]

Schmachtenberg ,

Bizer , and

Paulheim . Adoption of the Linked Data Best Practices in Di erent Topical Domains . In ISWC 2014 , pages 245 { 260 , 2014 .