-

Assessing Trust with PageRank in the Web of Data

Jose M. Gimenez-Garc a

jose.gimenez.garcia@univ-st-etienne.fr 2

Harsh Thakkar

hthakkar@uni-bonn.de 0

Antoine Zimmermann

antoine.zimmermann@emse.fr 1 0 Enterprise Information Systems Lab, University of Bonn , Germany 1 Univ Lyon, MINES Saint-Etienne, CNRS, Laboratoire Hubert Curien UMR 5516 , F-42023 Saint-Etienne , France 2 Univ Lyon, UJM-Saint-Etienne, CNRS, Laboratoire Hubert Curien UMR 5516 , F-42023 Saint Etienne , France

While a number of quality metrics have been successfully proposed for datasets in the Web of Data, there is a lack of trust metrics that can be computed for any given dataset. We argue that reuse of data can be seen as an act of trust. In the Semantic Web environment, datasets regularly include terms from other sources, and each of these connections express a degree of trust on that source. However, determining what is a dataset in this context is not straightforward. We study the concepts of dataset and dataset link, to nally use the concept of Pay-Level Domain to di erentiate datasets, and consider usage of external terms as connections among them. Using these connections we compute the PageRank value for each dataset, and examine the in uence of ignoring predicates for computation. This process has been performed for more than 300 datasets, extracted from the LOD Laundromat. The results show that reuse of a dataset is not correlated with its size, and provide some insight on the limitations of the approach and ways to improve its e cacy.

linked data trust reuse interlinking PageRank metric assessment

The WDAqua project1 aims to advance the state of the art in data-driven question answering, with a special focus in the Web of Data. The Web of Data comprises thousands of datasets about varied topics, interrelated among them, which contain large quantities of relevant data to answer a question. Nonetheless, in an environment of information published independently by many di erent actors, data veracity is usually uncertain [ 17, 19 ], and there is always the risk of consuming misleading data. While some quality metrics have been proposed that

1 http://wdaqua.informatik.uni-bonn.de/

can help to identify good datasets [ 5 ], there is a lack of trust metrics to provide a con dence on the veracity of the data [ 23 ].

In this context, we argue that actual usage of data can be seen as an act of trust. In this paper we focus on reuse of resources by other datasets as an usage metric. We consider reuse of a resource of a dataset by any other given dataset as an outlink from the later to the former. Under this purview, we can compute the PageRank [ 18 ] value of each dataset and rank them according to their reuse. PageRank has been successfully used to obtain trust metrics on individual triples. In order to obtain a good measure of reuse, we perform the process on a large scale. We make use of the tools provided by the LOD Laundromat [ 20 ] to go beyond LOD Cloud, and process more than 38 billion triples, distributed in more than 600 thousand documents. The LOD Laundromat provides data from data dumps collected from the Internet, so it is not limited to dereferenceable linked data. However, what is regarded as a dataset is an important issue when dealing with data dumps. We make use of the concept of Pay-Level Domain (or PLD, also known as Top-Private Domain) to draw a distinction between datasets, and consider the in uence of ignoring predicates when extracting outlinks. We perform an agrupation of the triples in datasets according to their PLD and compute their PageRank values as a rst measure of trust. Finally, we discuss the results and limitations of the approach, suggesting improvements for future work.

This document is organized as follows: in Section 2, we rst discuss what should be considered a dataset in our context in order to clarify the problem we address; in Section 3 we present the tools we are using, namely the LOD Laundromat and the PageRank algorithm; Section 4 describes the experiments and results, which we further discuss; Section 5 presents relevant related work; nally, we provide some conclusions and directions for future work in Section 6. 2

Ranking the Web of Data

We would like to assess trust in datasets by measuring their popularity based on the reuse of resources from a dataset in another dataset. To do this, we rely on the PageRank algorithm (presented in more detailed in Section 3). To compute PageRank in a set of datasets, it is rst necessary to de ne what is considered a dataset and what is a link between datasets. RDF graphs, although formally de ned as a set of triples, can be seen as directed multigraphs in which predicates play the role of arcs. This view suggests that if a triple contains a resource of dataset A as subject, and a resource of dataset B as object, it can be seen as a link from dataset A to dataset B. However, the links formed by arcs in an RDF graph are irrelevant to the notion of dataset linking. In fact, only the presence of hyperlinks su ces to indicate a link between one source and destination, therefore any HTTP IRI in an RDF graph can be seen as a link. So the question is, what it means that a resource belongs to a dataset, and to what dataset a hyperlink \points to". A nave approach would be to consider that any IRI existing in a dataset belongs to the dataset and thus, that links connect two datasets having one same resource. However, this would imply, for instance, that any triple anywhere that uses a DBpedia IRI is considered to be linked to from the DBpedia dataset. As a result, any dataset that reuses a DBpedia IRI would increase their PageRank according to this de nition.

Alternatively, we could take advantage of the linked data principles which stipulate that IRIs should be addresses pointing to a location on the Web. Again, one could navely assume that the location that the address points to is what de nes the dataset, that is, the document retrieved when one gets the resource using the HTTP protocol. However, this would lead us, for instance, to de ne each DBpedia article as an individual dataset.

A second possibility would be to use the domain part of the URL, so datasets are grouped by the same publisher. This approach is taken by Ding and Finin [ 6 ] to characterize data in the Semantic Web. This way, it would be easy to determine what dataset is being linked to. Such approach would work well if all datasets were accessible from dereferenceable IRIs. However, there are large portions of the Web of Data that provide access to data dumps only [ 9, 16 ]. In this case, the domain of the dump does not necessarily match the domain of the individual IRIs found in the dataset. As an example, the DBpedia dumps are found at http://downloads.dbpedia.org/ while all DBpedia IRIs start with http://dbpedia.org/.

The last approach, is to use the the concept of PLD, i.e., the subdomain component of a URL followed by a public su x, to identify a dataset. Then, datasets are grouped not necessarily by the same publisher, but by the same publisher authority. This approach has already been used by other works [ 15, 22 ]. As an example, if a le found at http://download.dbpedia.org/ contains the following triple: <http://dbpedia.org/wiki/Europe> <http://www.w3.org/2002/07/owl#sameAs>

<http://sws.geonames.org/6255148/> we consider that the dataset having the PLD dbpedia.org is linking to the dataset with PLD geonames.org. It is important to notice that the source of the link (dbpedia.org) is obtained from the URL of the document that contains the triple (http://download.dbpedia.org/), not from the subject of the example. This approach enables us to extract outlinks from datasets published in dumps, and therefore access the majority of accessible semantic web data. De nition 1 (Dataset). A dataset is a non empty collection of triples that can be retrieved from a source accessible at a URL having a common Pay-Level Domain. The PLD identi es the dataset.

In the previous example, we see that the predicate IRI is linking to the standard OWL vocabulary. It is very likely that predicates in general will be linking to vocabularies that are extensively reused. However, our intent is to evaluate trust on actual data that can be used to answer questions, and not vocabularies used to describe the data. We predict that extracting outlinks from predicates will lead to higher values for datasets containing only vocabularies. For this reason, we perform the same experiment with and without taking predicates into consideration.

De nition 2 (Dataset link). There exists a link from a dataset A to a dataset B if and only if there exists a triple in a le at a location having the PLD that identi es A in which the PLD of its subject, its object, or both matches the PLD that identi es B.

This de nition is in line with the PageRank algorithm [ 18 ] where the number of links between the same two nodes is irrelevant. Note that since datasets must be non empty, links to PLDs that do not host RDF have to be ignored. In the next section, we describe the tools that we used in our experiments. 3

Preliminaries

In order to provide a realistic assessment of reuse in Linked Open Data, we exploited a large number of datasets by way of the LOD Laundromat (described in Section 3.1) from which we compute the dataset links that form the input of the PageRank algorithm (described in Section 3.2). 3.1

The LOD Laundromat and Frank

The LOD cloud2, and in general Linked Open Data, contains a wide variety of formats, publishing schemes, errors, that make it di cult to perform a largescale evaluation. Yet, to be accurate, our study requires to be comprehensive. Fortunately, the LOD Laundromat [ 1, 21 ] makes this data available by gathering dataset dumps from the Web, including archived data. LOD Laundromat cleans the data by xing syntactic errors and removing duplicates, and then makes it available through download (either as gzipped N-Triples or N-Quads, or HDT [ 10 ] les), a SPARQL endpoint, and Triple Pattern Fragments [ 24 ]. Using the LOD Laundromat is also a better solution than trying to use documents dereferenced by URIs, because most of datasets available online are data dumps [ 9, 16 ], thus not accessible by dereferencing.

Frank [ 20 ] is a command-line tool which serves as an interface of the LOD Laundromat, and makes it easy to run evaluations against very large numbers of datasets. 3.2

PageRank

PageRank [ 18 ] is the original algorithm developed by Page et al. that Google uses to rank their search results. It takes advantage of the graph structure of the web, considering each link from one page as a \vote" from the source to the destination. Using the links, the importance of a page is propagated across the graph, dividing the value of a page among its outlinks. This process is repeated

2 http://lod-cloud.net/

until convergence is reached. The nal result of PageRank corresponds to a stationary distribution, where each page value amounts to the probability for a random surfer to be at any moment in the page. 4

Experiments and Results

The process to compute PageRank involves the following steps, detailed further below and illustrated in Figure 1. The code and results are provided online3. 1. Extracting the document list from LOD Laundromat. 2. Parsing the content of each document to extract the outlinks. 3. Consolidating the results 4. Computing PageRank

LOD Laundromat

Parse Documents Parse Documents

… Parse Documents

Outlinks Outlinks …

Outlinks Extract List of Documents

List of Documents

Consolidate Results

Outlinks

Compute PageRank

PageRank Values We use the Frank command line tool [ 20 ] to obtain a snapshot of the contents of the LOD Laundromat. While the output of Frank can be directly pipelined to our process, the next step is performed in parallel in several machines. For this reason, we need that every machine reads the exact same input. An update in the contents of the LOD Laundromat during the next process could have impacted the results in that case. We retrieve the list of documents in the LOD Laundromat with the following command.

$ frank documents > documents.dat

This command retrieves a list of pairs (downloadURL-resourceURL), where the rst is the URL to download the gzipped datasets, and the second the resource identi er in the LOD Laundromat ontology. At the moment of the experiments, it retrieved 649,855 documents.

3 https://github.com/jm-gimenez-garcia/LODRank Parsing the content of each document to extract the outlinks.

A prototype tool4 has been developed to stream the contents of the documents end extract the outlinks. This tool reads the list of pairs (downloadURLresourceURL) by standard input, and accepts two optional parameters for partial: Step and Start. The rst one tells how many lines the process reads in every iteration, processing the last one, while the second denotes what line to use for the rst input. For each line processed, it queries the SPARQL endpoint to retrieve the URL where that datasets was crawled. This information can be found in the LOD Laundromat ontology connected to the resource, in the case the document was crawled as a single le, or connected to the archive that contains the document, if it was crawled compressed in a compressed le, possibly along other documents. In the rst case, we retrieve the URL with Query 1, in the second case we retrieve the URL using Query 2, where %s is substituted by the resourceURL. The Pay-Level Domain is then extracted and stored. This will be considered as the identi er of the dataset.

SELECT ?url WHERE {<%s> <http://lodlaundromat.org/ontology/url> ?url}

Query 1: Query to retrieve crawled URL of a non-archived document SELECT ?url WHERE { } ?archive <http://lodlaundromat.org/ontology/containsEntry> <%s> . ?archive <http://lodlaundromat.org/ontology/url> ?url

Query 2: Query to retrieve crawled URL of an archived document

Then, the gzipped le is streamed from the downloadURL and parsed the triples. The subject and object (in case it is a URI) are extracted the Pay-Level Domain and compared against their dataset PLD. If they have a valid PLD and is di erent from their dataset's Pay-Level Domain, the pair (datasetPLDresourcePLD) is stored as an outlink for the dataset. The output of each dataset is stored in a di erent le, which will be appended more pairs if a di erent document is identi ed as the same dataset.

4 https://github.com/jm-gimenez-garcia/LODRank/tree/master/src/com/

chemi2g/lodrank/outlink_extractor

This process makes use of Apache Jena5 v3.0.1 to query the SPARQL endpoint of the LOD Laundromat and Google Guava6 v19.0 to extract the Pay-Level Domain of the datasets.

In the experiments the process was launched in parallel in 8 virtual machines using Google Cloud Platform7 free trial resources, each one processing a di erent subset of the list downloaded in the previous step. A statistical description of the results of each process, with and without considering predicates, is detailed in Table 1. \Documents" correspond to the number of dump les in the LOD Laundromat, while \Datasets" are the number of PLDs that the process is dealing with. There can overlap in the datasets of several processes, so the total number of datasets is not equal to the sum. We can see that the number of triples processed by each process is not proportional to the number of documents processed.

Process Documents Triples Datasets (w. p.) Datasets (w/o. p.) 1 81,220 3,994,446,393 135 121 2 81,226 3,742,870,561 137 118 3 83,422 4,146,249,367 140 127 4 81,225 3,376,784,600 135 120 5 81,225 3,623,413,245 142 120 6 88,198 3,377,773,585 131 116 7 81,226 4,132,960,522 137 115 8 89,781 3,911,917,919 134 123

Table 1: Data extracted from the LOD Laundromat by each process 4.3

Consolidating the results

Once the outlinks have been extracted, the di erent les have to be appended and duplicated removed using a simple tool8. In the experiments, the data from each virtual machine was downloaded in a separate folder of a unique machine. Then les with the same name in each folder were concatenated and removed the duplicates. The total number of datasets after consolidating the results is 412 when considering predicates, and 319 when not. The result was again concatenated in a single le. 5 https://jena.apache.org/

6 https://github.com/google/guava 7 https://cloud.google.com/ 8 https://github.com/jm-gimenez-garcia/LODRank/tree/master/src/com/

chemi2g/lodrank/duplicate_remover

Computing PageRank

For PageRank computation we make use of the igraph R package [ 4 ]. The ordered PageRank values for all datasets can be seen in Figure 2 and Figure 3, with a logarithmic scale. The complete list of results is published online9. We can see that in both cases the top-ranked dataset are very much higher than the rest, then the slope becomes more regular until it reaches a plateau at the end, with a minimum value shared by several datasets that have no inlinks at all. Tables 2 and 3 show the 10 highest ranked datasets.

Discussion. Here we provide additional information about the datasets, especially the top-ranked ones, in order to understand how ranking correlates with other statistical values, such as number of triples, number of documents. We also discuss how our own choices in uenced the results.

The datasets appearing on the top 10 list are generally not surprising, with the only exception of holygoat.co.uk, the only domain in the top 10 owned by an individual person, Richard Newman, a computer scientist who wrote several ontologies in the early days of the Semantic Web. This is even more remarkable considering that the dataset has only 7 inlinks. The reason is that rdfs.org includes resources from holygoat.co.uk. Because this dataset has only 2 outlinks, half of its PakeRank score is forwarded to holygoat.co.uk, which accrues for 96% of its PageRank value.

As predicted, when including predicates the rst positions incorporate more datasets about vocabularies. When removing the predicates, w3.org, xmlns.com, schema.org, and ogp.me no longer appear in the top positions, and datasets with factual data move upwards. lodlaundromat.org seems to appear when considering predicates because the LOD Laundromat adds information about the cleaning process when processing the data. While not an optimum solution (considering that purl.org and rdfs.org are still in the top positions), ignoring the predicates proves to be a simple but useful technique.

We used two queries, (Query 3 and Query 4), to obtain the number of documents and triples for each PLD, from the LOD Laundromat.

PREFIX llo: <http://lodlaundromat.org/ontology/> PREFIX ll: <http://lodlaundromat.org/resource/> SELECT (COUNT(DISTINCT ?resource) AS ?count) WHERE { } { } ?resource llo:url ?url FILTER regex(?url, "[^/\\.]*\\.?%s/", "") ?archive llo:containsEntry ?resource ;

llo:url ?url

FILTER regex(?url, "[^/\\.]*\\.?%s/", "")

Query 3: Query to retrieve the number of documents per dataset

The result of the queries are given in Table 4 for all the datasets that appear in the 10 top of both experiments.

As we can see, popularity is not at all correlated with the size of the datasets. Indeed, a number of the top ten datasets have less that 200 triples, while dbpedia.org and europa.eu both have billions of triples.

The enormously high page rank of purl.org should be mitigated by the fact that purl.org does not actually host any data. It is a redirecting service that many data publishers are using. This result highlights a drawback in our heuristic for identifying datasets: the PLD is not always referring to a single dataset. To overcome this particular case, we could consider the PLD of the URL of the document obtained after dereferencing the IRI.

Another possible drawback of the approach is that triples with rdf:type in predicate position have their object pointing to a class in an ontology. This is in contradiction with our remark in Section 2 where we say that we want to

9 https://github.com/jm-gimenez-garcia/LODRank/tree/master/results

PREFIX llo: <http://lodlaundromat.org/ontology/> PREFIX ll: <http://lodlaundromat.org/resource/> SELECT (COUNT(DISTINCT ?resource) AS ?count) (SUM(?triples) as ?sum) WHERE { } UNION { ?resource llo:url ?url ;

llo:triples ?triples FILTER (?triples > 0)

FILTER regex(?url, "[^/\\.]*\\.?%s/", "") ?archive llo:containsEntry ?resource ;

llo:url ?url .

?resource llo:triples ?triples FILTER (?triples > 0)

FILTER regex(?url, "[^/\\.]*\\.?%s/", "") Query 4: Query to retrieve the number of documents with triples and number of triples rank instance data rather than terminological knowledge. This can have a major impact the results since purl.org is most often used to redirect to vocabularies more than datasets, and rdfs.org only hosts ontologies. 5

Related work

The authors of Semantic Web Search Engine (SWSE [ 15 ]) strongly advocate that the use of a ranking mechanism is very crucial for prioritizing data elements in the search process. Their work is inspired by the Google PageRank algorithm, which treats hyperlinks to other pages as a positive score. The PageRank algorithm is targeted for hyperlink documents and its adaptation to the LOD is however non-trivial, as we have seen. They point out that the primary reason for this is that LOD datasets may not have direct hyperlinks to other datasets but rather in most cases make use of implicit links to other web pages via the reuse of dereferenceable URIs. Here the unit of search becomes the entity and not the document itself. The authors brie y re-introduce the concept of naming authority, from their previous work [ 13 ] in order to rank structured data from an open distributed environment. They assume that the naming authority should match the Pay-level domain such that computing PageRank is performed on a naming authority graph where the nodes are PLDs. Their intuition therefore is in accordance with our reasoning from Section 2. They have discussed and contrasted the interpretation of naming authorities on a document level (e.g. http://www.danbri.org/foaf.rdf) and a PLD level (danbri.org). Also, they make use of a generalization for the method discussed in the paper [ 8 ] for ranking entities and carry out links analysis on the PLD abstraction layer.

The authors of Swoogle [ 7 ] develop OntoRank algorithm in order to rank documents. OntoRank, a variation of Google PageRank, is an iterative algorithm for calculating the ranks for documents built on references to terms (i.e., classes and properties) which are de ned in other documents.

In the paper [ 3 ], the authors calculate the rank of entities (or as they call them objects) based on the logarithm of the number of documents where that particular object is mentioned.

In their work [ 11 ] present LinkQA, an extensible data quality assessment framework for assessing the quality of linked data mappings using the network measures. For this, they assess the degree of interlinking of datasets using ve network measures, out of which two network measures are speci cally designed for Linked Data (namely, Open Same-As chains and Description Richness) and the other three standard network measures (namely, degree, centrality, and the clustering coe cient) in order to assess variation in the quality of the overall linked data with respect to a certain set of links.

In [ 2 ], PageRank is used to compute a measure that is in turn associated to individual statements in datasets for the purpose of incorporating trust in reasoning. Therefore, as in our own approach, they consider that PageRank is an indication of trustworthiness. However, they only compute PageRank on a per document basis, and report on the PageRank values of the top 10 documents obtained from their web crawl. 6

Conclusion & Future work

Data-driven question answering, the aim of project WDAqua mentioned in the introduction to this paper, requires quality data in which one can trust. Our aim has been to provide insight on how a trust measure can be based on dataset interlinking. To that end, we consider Pay-Level Domains as identi ers of unique datasets and compute PageRank on them. Our results show that the design choices greatly a ect the results. Whether taking into account or not predicates for outlink extraction impacts how vocabularies are ranked, and the choice of PLD as de nition of dataset seems questionable, as some PLDs group many data dumps. In order to improve this, we could associate well known datasets to IRI patterns, such as it.dbpedia.org for the Italian version of DBpedia.

In addition, we also intend to explore further applications of PageRank that may be useful for question answering. User interaction that provides trust values in a number of dataset could be used to compute PageRank values with those datasets as a teleport set, as suggested by Gyongyi et al. [ 12 ]. Also, Topic-Sensitive PageRank [ 14 ] could help a question-answering system to select di erent datasets when a question is identi ed to belong to a speci c topic.

Finally, this work is part of a broader objective that we want to pursue: to ascertain the relationship between the perceived trust on a dataset and its objective quality. We will explore this area in a future work where other data reuse metrics will be considered and compared against di erent quality metrics.

Acknowledgement

This project is supported by funding received from the European Unions Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No 642795. We would like to thank Elena Simperl, whose idea jumpstarted the project that lead to this article, and also Elena Demidova, Kemele Endris, and Christoph Lange for the useful discussions related to it.

[1] Beek , W. , Rietveld , L. , Bazoobandi , H.R. , Wielemaker , J. , Schlobach , S.: LOD laundromat: A uniform way of publishing other people's dirty data . In: The Semantic Web - ISWC 2014 - 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23 , 2014 . Proceedings, Part I. pp. 213 { 228 ( 2014 )

[2] Bonatti , P.A. , Hogan , A. , Polleres , A. , Sauro , L. : Robust and scalable Linked Data reasoning incorporating provenance and trust annotations . Journal of Web Semantics 9 ( 2 ), 165 { 201 ( 2011 )

[3] Cheng, G., Qu , Y. : Searching Linked Objects with Falcons: Approach, Implementation and Evaluation . International Journal of Semantic Web and Information Systems 5 ( 3 ), 49 { 70 ( 2009 )

[4] Csardi , G. , Nepusz , T. : The igraph software package for complex network research . InterJournal, Complex Systems 1695(5) , 1{ 9 ( 2006 )

[5] Debattista , J. , London~o, S. , Lange , C. , Auer , S. : Quality assessment of linked datasets using probabilistic approximation . In: Gandon, F. , Sabou , M. , Sack , H., d'Amato , C. , Cudre-Mauroux , P. , Zimmermann , A . (eds.) The Semantic Web . Latest Advances and New Domains - 12th European Semantic Web Conference, ESWC 2015 , Portoroz, Slovenia, May 31 - June 4, 2015 . Proceedings. Lecture Notes in Computer Science , vol. 9088 , pp. 221 { 236 . Springer ( 2015 ), http://dx.doi.org/10.1007/978-3- 319 -18818-8_ 14

[6] Ding , L. , Finin , T. : Characterizing the semantic web on the web . In: International Semantic Web Conference. Lecture Notes in Computer Science , vol. 4273 , pp. 242 { 257 . Springer ( 2006 )

[7] Ding , L. , Finin , T. , Joshi , A. , Pan , R. , Cost , R.S. , Peng , Y. , Reddivari , P. , Doshi , V. , Sachs , J.: Swoogle: a search and metadata engine for the semantic web . In: Proceedings of the thirteenth ACM international conference on Information and knowledge management . pp. 652 { 659 . ACM ( 2004 )

[8] Ding , L. , Pan , R. , Finin , T. , Joshi , A. , Peng , Y. , Kolari , P. : Finding and ranking knowledge on the semantic web . In: The Semantic Web{ISWC 2005 , pp. 156 { 170 . Springer ( 2005 )

[9] Ermilov , I. , Martin , M. , Lehmann , J. , Auer , S. : Linked open data statistics: Collection and exploitation . In: Knowledge Engineering and the Semantic Web - 4th International Conference, KESW 2013 , St . Petersburg, Russia, October 7- 9 , 2013 . Proceedings. pp. 242 { 249 ( 2013 )

[10] Fernandez , J.D. , Mart nez-Prieto, M.A. , Gutierrez , C. , Polleres , A. , Arias , M. : Binary RDF Representation for Publication and Exchange (HDT) . Journal of Web Semantics ( 2013 ), http://dataweb.infor.uva. es/wp-content/uploads/2013/01/jws2013.pdf

[11] Gueret , C. , Groth , P.T. , Stadler , C. , Lehmann , J. : Assessing linked data mappings using network measures . In: The Semantic Web: Research and Applications - 9th Extended Semantic Web Conference, ESWC 2012 , Heraklion, Crete, Greece, May 27 -31, 2012 . Proceedings. pp. 87 { 102 ( 2012 )

[12] Gyongyi, Z. , Garcia-Molina , H. , Pedersen , J.O. : Combating web spam with trustrank . In: (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases , Toronto, Canada, August 31 - September 3 2004 . pp. 576 { 587 ( 2004 )

[13] Harth , A. , Kinsella , S. , Decker , S.: Using naming authority to rank data and ontologies for web search . In: The Semantic Web - ISWC 2009 , 8th International Semantic Web Conference, ISWC 2009 , Chantilly , VA , USA, October 25 - 29 , 2009 . Proceedings. pp. 277 { 292 ( 2009 )

[14] Haveliwala , T.H. : Topic-sensitive pagerank . In: Proceedings of the Eleventh International World Wide Web Conference, WWW 2002, May 7-11 , 2002 , Honolulu, Hawaii. pp. 517 { 526 ( 2002 )

[15] Hogan , A. , Harth , A. , Umbrich , J. , Kinsella , S. , Polleres , A. , Decker , S. : Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine . Journal of Web Semantics 9 ( 4 ) ( 2011 )

[16] Hogan , A. , Umbrich , J. , Harth , A. , Cyganiak , R. , Polleres , A. , Decker , S.: An empirical survey of Linked Data conformance . Journal of Web Semantics 14 , 14 { 44 ( 2012 )

[17] Liu , S. , d'Aquin , M. , Motta , E.: Towards linked data fact validation through measuring consensus . In: Rula, A. , Zaveri , A. , Knuth , M. , Kontokostas , D . (eds.) Proceedings of the 2nd Workshop on Linked Data Quality co-located with 12th Extended Semantic Web Conference (ESWC 2015 ), Portoroz, Slovenia, June 1, 2015 . CEUR Workshop Proceedings , vol. 1376 . CEURWS.org ( 2015 ), http://ceur-ws. org/ Vol- 1376 /LDQ2015_paper_04.pdf

[18] Page , L. , Brin , S. , Motwani , R. , Winograd , T. : The PageRank citation ranking: bringing order to the web . ( 1999 ), http://ilpubs.stanford.edu: 8090 /422/1/1999- 66 .pdf

[19] Paulheim , H. , Bizer , C. : Improving the quality of linked data using statistical distributions . Int. J. Semantic Web Inf. Syst . 10 ( 2 ), 63 { 86 ( 2014 )

[20] Rietveld , L. , Beek , W. , Schlobach , S.: LOD lab: Experiments at LOD scale . In: The Semantic Web - ISWC 2015 - 14th International Semantic Web Conference , Bethlehem, PA, USA, October 11 - 15 , 2015 , Proceedings, Part II. pp. 339 { 355 ( 2015 )

[21] Rietveld , L. , Verborgh , R. , Beek , W. , Sande , M.V. , Schlobach , S. : Linked data-as-a-service: The semantic web redeployed . In: The Semantic Web. Latest Advances and New Domains - 12th European Semantic Web Conference, ESWC 2015 , Portoroz, Slovenia, May 31 - June 4, 2015 . Proceedings. pp. 471 { 487 ( 2015 )

[22] Schmachtenberg , M. , Bizer , C. , Paulheim , H.: Adoption of the linked data best practices in di erent topical domains . In: Mika, P. , Tudorache , T. , Bernstein , A. , Welty , C. , Knoblock , C.A. , Vrandecic , D. , Groth , P.T. , Noy , N.F. , Janowicz , K. , Goble , C.A . (eds.) The Semantic Web - ISWC 2014 - 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23 , 2014 . Proceedings, Part I. Lecture Notes in Computer Science , vol. 8796 , pp. 245 { 260 . Springer ( 2014 ), http://dx.doi.org/10. 1007/978-3- 319 -11964-9_ 16

[23] Thakkar , H. , Endris , K.M. , Gimenez-Garc

, J.M. , Debattista , J. , Lange , C. , Auer , S. : Are linked datasets t for open-domain question answering? a quality assessment ( 2016 )

[24] Verborgh , R. , Sande , M.V. , Colpaert , P. , Coppens , S. , Mannens , E., de

Walle

, R.V.: Web-scale querying through linked data fragments . In: Bizer, C. , Heath , T. , Auer , S. , Berners-Lee , T . (eds.) Proceedings of the Workshop on Linked Data on the Web co-located with the 23rd International World Wide Web Conference (WWW 2014 ), Seoul, Korea, April 8, 2014 . CEUR Workshop Proceedings , vol. 1184 . CEUR-WS.org ( 2014 ), http://ceur-ws. org/ Vol- 1184 /ldow2014_paper_04.pdf