Towards Linked Data Fact Validation through Measuring Consensus

Approach

Subject Links Crawling and Cleaning. The first task addressed in this subsection deals with the process of automatically collecting the resource or subject links equivalent to the subject of the input fact(s). We approach the problem in two steps. Firstly, the values of the property owl:sameAs1 of the subject of a fact are retrieved. It can be achieved by querying the underlying dataset of the input fact. Secondly, we fetch the equivalent subject links via querying the http://sameas.org service.

There may be duplicated and non-resolvable subject links in the results obtained via owl:sameAs and the http://sameas.org service. The duplication cases can happen since two separate services are used and the resources that they provide may overlap. It can also be due to the fact that the underlying dataset contains multilingual versions of the same resources and link them together via owl:sameAs. In addition, there are several reasons for non-resolvable subject links. The resources may have been deleted from the underlying dataset while the value of the relevant owl:sameAs property not being updated coordinately. The services of publishing the datasets may be down or have retired.

The erroneous subject links need to be cleaned before the next task can be performed effectively and efficiently. We follow the following steps for cleaning the errors. First, all subject links are verified by "pinging" the corresponding URIs. If a valid response is received within a given timeout, the subject links are considered as resolvable. Second, duplicated subject links are removed if they have the identical URIs. Finally, multilingual versions of the same resource are removed from the result set.

In our approach the reliability of the subject links are determined according to the provenance of the subject links, i.e., the methods or services used to retrieve the links, for example, the DBpedia owl:sameAs property and the http: //sameas.org service. Details of how to determine the reliability of the subject links are addressed later. The provenance information of the subject links are retained for calculating the confidence score of an input fact.

Predicate Links and Objects Retrieving. The next task of fact validation is collecting all triples that use the collected resources as the subject links. This problem cannot be tackled by simply dereferencing the URIs of the collected subject links. 2 There are three reasons. First, not all of the corresponding URIs can be dereferenced such as the URI of the mosquito Aedes vexans. 3 Second, some dereferenceable URIs may not return the real data of the resources since they were redirected to somewhere else, e.g. yago:Borough of Buckingham. 4Finally, the content types of the representation of the information resources obtained via dereferencing can be different.

The non-dereferenceable URIs are removed from the set of subject links as a result of performing the subject links cleaning task. For those dereferenceable URIs, a combination of methods are applied to extract the desired predicates and objects, and convert them to a uniform format for performing the subsequent tasks.

The first method used in our approach is HTTP GET with the resource URI and content negotiation. It allows to obtain the RDF facts of an information resource in most cases. Programming libraries such as the Jena API5 can be used to extract the desired data from the RDF data. The second method is HTTP GET with a SPARQL query to a dataset endpoint. This method is adopted when the resource URIs cannot return the real data of that resources, and there is a SPARQL endpoint associated with that knowledge base. Last but not the least, when there are only dumps of data available from the knowledge bases, e.g. Wikidata, 6 , particular toolkits can be developed to extract desired data from the dumps. Predicate Similarity Measurement. After completing the beforementioned tasks, a large amount of triples with subjects being equivalent to the subject links of the input facts are collected. The objective of the next task is selecting the evidence triples that have predicates matching the predicates of the input facts.

We choose to measure the predicate similarity based on the semantic similarity between the predicates of the input facts and the collected triples. String similarity measures such as the Trigram similarity metric [1] are not used since they cannot effectively detect predicates which are composed of different words but actually have the same meaning. For example, the property dbpedia-owl:popu-lationTotal and the property yago:hasNumberOfPeople should be identified as highly related.

There are a number of semantic relatedness measures available including Jiang & Conrath [2], Resnik [6], Lin [4], and Wu & Palmer [7]. They rely mas-sively on the enormous store of knowledge available in WordNet. 7 The principle of our approach for detecting highly related predicates is applying a suitable semantic relatedness measure on the predicates of the evidence triples. In addition, our method is based on WS4J8 which can generate a matrix of pairwise similarity scores for two input sentences, according to selected semantic relatedness measures. WS4J implements several semantic similarity algorithms described earlier.

Many predicates use compound words such as dbpedia-owl:populationTotal and yago:hasNumberOfPeople. Thus, our method should be able to handle predicates of compound words as well as predicates composed of single words. Our method consists of three parts. First, a compound word splitter is used to transform predicate names into space separated words (i.e. sentences). Second, a matrix of pairwise similarity scores are generated for two input sentences by the means of WS4J. Finally, formulas are defined to measure the semantic similarity of the input sentences (i.e. the predicates) using the pairwise similarity matrix. Table 2 provides an example of the pairwise similarity matrix for the sentences "population Total" and "has Number Of People" (as generated by WS4J). Let r be the number of rows of a similarity matrix and c the number of columns of the matrix. The scores in the n th row or column are represented by the sets S row (n), S column (n) respectively. For each word in the shorter sentence (either r ≤ c or r > c), we choose the max score in the row or column where the word lies as the semantic similarity score of that word, noted as W (n). This leads to the following formula:

W (n) = max ( S row ( n ) ) if r ≤ c max ( S column ( n ) ) if r > c(1)

Moreover, let Φ(W ) be the set of similarity scores of the words in the shorter sentence of a similarity matrix, and k the number of values in the set. If any word in the shorter sentence has a value of similarity greater than the threshold θ, then the two input sentences may have similar meaning. Thus we define the average of the scores belonging to Φ(W ), P , as the semantic similarity score for the two input sentences (i.e. the predicates). Thus, it leads to the following formula:

P = W ∈Φ(W ) W k with ∃ W ∈ Φ(W ) and W > θ(2)

If no word in the shorter sentence has a value of similarity greater than the threshold θ, then the two input sentences can not have similar meaning. In this case, the value of the similarity score for the two input sentences is assigned to zero.

To obtain the set of matched predicates for the predicate of the input facts, a threshold is applied, e.g., all predicates with P ≥ 0.5 are considered as matched predicates.

Confidence Calculation. As mentioned in the first task above, the reliability of the subject links collected are determined according to the provenance of the subject links (i.e., owl:sameAs and http://sameas.org service). A weighting factor is assigned to the subject links of the evidence triples to represent their reliability. The value of a weighting factor ranges from 1 to 5. The greater the value, the more reliable the subject link is.

We define a confidence score for the input fact to represent the degree to which the evidence triples agree with the input fact (or triple). The confidence of the input fact is based on the weighted average of the values of the objects of the evidence triples, represented as γ.

The values of the objects, defined as ν, are considered to be literal values (either numerical or string). If the type of the objects is string, string similarity scores of the objects for the input facts and the evidence tripes are applied as the values of ν. If the type of the objects is numerical, the numerical values of the objects are directly used. The weight ω is the product of the reliability of the subject link and the similarity of the predicate link of an evidence triple. Additionally, let m be the number of evidence triples collected through the abovementioned tasks. Thus, γ is represented as: 3) is applied to represent the confidence score of an input fact where the value of the objects of the evidence triples are the type of string.

γ = m i=1 ω i • ν i m j=1 ω j (3) Formula (

Furthermore, the following formula is applied to represent the confidence score of the input fact, denoted as Γ when the values of the objects are numerical. In Formula (4) x represents the numerical value of the object of the input fact while γ is the weighted average number calculated via formula (3).

Γ = 1 − (x − γ) 2 γ(4)

Based on Formula (4), a smaller difference in the numerical values of the objects between the input fact and the weighted average value will lead to a higher confidence score.

In order to test the feasibility of the approach described in the previous section, we conducted an experiment with a property from DBpedia (dbpedia:popu-lationTotal) and a sample of facts using this property as the predicate. This property was selected since the type of its values are numerical.

We made a query to the DBpedia SPARQL endpoint for obtaining all towns in Milton Keynes that have a population of more than 10,000. The resulting 18 triples were utilised as the input facts. The subjects of these facts were used as seeds to crawl equivalent subject links from other knowledge bases.

The number of subject links retrieved for a single fact ranges from dozens to several hundred. For example, dbpedia:Stantonbury has 23 subject links found while dbpedia:Buckingham has 232 subject links retrieved. The number of the cleaned subject links is reduced greatly, ranging from a few to several tens.

We selected a representative resource dbpedia:Buckingham to examine the correctness of the subject links cleaning process. A total of 207 noise subject links were found for the resource dbpedia:Buckingham. It consisted of 172 nonresolvable links, and 35 duplicate links. We manually examined the causes of the non-resolvable links, and corrected 56 out of 172 as valid links (Figure 1). Initially the 56 links were identified as invalid links due to a small value of the read timeout field set for the tool used for the subject links cleaning process. It allowed us to adjust the timeout field for a suitable value. We also found that different data access services were provided by the knowledge bases where the subject links originated from. Accordingly, we needed to adopt different methods to deal with this diversity in terms of retrieving the predicate links and objects from these knowledge bases.

In addition, the compound word splitter9 was utilised in the predicate similarity measurement process. It could split compound predicate names into sen- tences. The Wu & Palmer [7] semantic similarity measure (WUP) was selected since the result similarity scores are normalised from 0 to 1. We also tested other measures such as Lin [4]. The WUP measure demonstrated the highest rate of correctness (threshold θ ≥ 0.8). The distribution of the predicate similarity scores generated is provided in Figure 2. Furthermore, 45% of the sample facts (i.e. statements about the population of the 18 subjects) were assigned to a confidence score and 55% were not (as no evidence triples were found). Figure 3 demonstrates the distribution of the confidence scores generated for the sample facts. 22% of the facts were identified highly reliable (Γ ≥ 0.9). Two facts were assigned to very low confidence scores (0.04 and -68.58). We manually examined the causes of the low confidence values, and discovered that a matched triple for each fact had a very large or small population number. It caused the weight average of the object values of the evidence triples to be too large or small. It was due to the fact that the subject links of the erroneous triples (retrieved from sameas.org service) were pointed to resources not identical to the subjects of the facts (wrong subject links). We corrected the errors by removing the erroneous triples from the set of evidence triples. It leaded to the fact (initially with 0.04 confidence) to get a much higher confidence (0.94), and no confidence score produced for the fact (initially with -68.58 confidence) because no evidence triples are found. Based on this experiment, we plan to extend our approach to verify abnormal evidence triples with "fake" subject links in future work.