Evaluating Class Assignment Semantic Redundancy on Linked Datasets Leandro Mendoza Alicia Dı́az CONICET, Argentina LIFIA, Facultad de Informática, LIFIA, Facultad de Informática, UNLP, Argentina UNLP, Argentina Abstract dancy is related with the concept of extensional conciseness which has been defined in (Zaveri et In this work we address the concept of se- al., 2015) as “the case when the data does not mantic redundancy in linked datasets con- contain redundant objects at instance level”. In sidering class assignment assertions. We this scenario, redundancy means that a resource is discuss how redundancy can be evaluated specified as member of a class when it is not nec- as well as the relationship between redun- essary, either because the information is explicitly dancy and three class hierarchy aspects: duplicated or because it can be derived from infor- the number of instances a class has, num- mation that already exists. Current works that have ber of class descendants and class depth. dealt with semantic redundancy on linked datasets Finally, we performed an evaluation on the implement algorithms based on graph pattern dis- DBpedia dataset using SPARQL queries covering techniques. In contrast, our work pro- for data redundancy checks. poses a simplified approach based on SPARQL queries and considering class assignment asser- 1 Introduction tions. We discuss how redundancy can be eval- The amount of interlinked knowledge bases built uated and perform an evaluation over the DBpe- under Semantic Web technologies and following dia dataset (Lehmann et al., 2015) in order to un- the linked data (Heath and Bizer, 2011) princi- derstand the relationship between redundancy and ples has increased significantly last years. These three class hierarchy aspects: the number of in- knowledge bases (also known as linked datasets) stances a class has, the class depth and its number contain information that associates Web entities of descendants. This approach may be useful for (called resources) with well-defined semantics linked data users who need to measure semantic that specifies how these entities should be in- data redundancy in a practical way, understand its terpreted. In most linked datasets a substantial origin and detect when it may be useful (e.g. to amount of data corresponds to class assignment improve performance) or when it can affect nega- assertions, that is, information that specifies re- tively the knowledge base (e.g. misuse of classes sources (or individuals) as instances of certain when typifying resources). The following sections classes. In this sense, resources are typified us- are organized as follows: sections 2 and 3 give ing classes usually defined through ontologies and some background definitions and related work, re- organized into class hierarchies. Several differ- spectively. Section 4 introduces the redundancy ent ontologies can be combined to classify re- definition adopted and discusses some of the alter- sources within the same dataset giving rise to huge natives to address it on linked datasets. Section 5 and complex interlinked structures that can suffer shows the evaluation results. Finally, some con- from data quality problems (Hogan et al., 2010). clusions and further work are given in section 6. Thus, the use of practical mechanisms to handle knowledge conciseness becomes increasingly im- 2 Background portant to improve the overall dataset quality and In the linked data context, datasets are knowl- the study of redundancy on class assignments as- edge bases described using the RDF1 data model sertions aims to contribute in this way. From a data quality perspective, class assignment redun- 1 https://www.w3.org/RDF/ 103 and published following the linked data princi- analysis method to identify graph patterns that can ples2 . These datasets are collection of assertions be used to remove redundant triples and calculate about resources specified following the “subject the volume of semantic redundancy. In (Joshi et predicate object” pattern. Assertions are RDF al., 2013) authors employ frequent itemset (fre- triples and resources may be anything identifi- quent pattern) mining techniques to generate a able by an HTTP URI. Knowledge representation set of logical rules to compress RDF datasets and mechanisms like RDFS3 and OWL4 extend RDF then use these rules during decompression. Both and allow datasets to be augmented with more ex- works mention the idea of semantic compression pressive semantics. For example, it is possible by removing derivable knowledge. Regarding the to describe ontologies by specifying classes and use of SPARQL for quality assessment, (Fürber relationships between them (e.g. “SoccerPlayer and Hepp, 2010) and (Kontokostas et al., 2014) rdfs:subClassOf Atlhete”) and to specify re- use query templates to detect some quality prob- sources as member of those classes (e.g. “Li- lems but semantic redundancy is not included. onel Messi rdf:type SoccerPlayer”). From an Inspired on the ideas of these works, we use a overall perspective, information contained in these SPARQL query oriented approach to evaluate re- datasets can be split into two levels: schema level dundant class assignments and make this informa- and instance level. Schema level refers to ter- tion explicit to users. minological knowledge (known as TBox), for ex- ample, classes, properties and their relationships. 4 Redundant class assignments On the other hand, instance level refers to asser- tional knowledge (known as ABox), that is, propo- As we know, linked datasets are basically sets sitions about entities of a specific domain of in- of RDF triples and knowledge is specified using terest. An important type of assertional knowl- mechanisms provided by RDFS and OWL, each edge corresponds to class assignments, that is, one with its own well-defined semantics (Hitzler et RDF statements of the form “resource rdf:type al., 2009). In this way, schema and instance level class” used to specify resources as members of assertions can be considered as propositions to for- certain classes. The most common way to retrieve mally describe the notion of derivable knowledge. this information from linked datasets is through a For example, the notation {p1 , p2 } |= {p3 , p4 } SPARQL5 endpoint. These endpoints are web ser- (where |= is called entailment relation) states that vices that accept SPARQL queries and return in- propositions p3 and p4 (also p1 and p2 ) are log- formation that match with a given pattern. In this ical consequences of propositions p1 and p2 ob- work we will use this mechanism to detect redun- tained under a certain set of rules (logic). Con- dant class assignments. sidering this, the concept of data redundancy can be associated to what is known in mathematical 3 Related Work logic literature as independence, that is, the abil- ity to deduce a proposition from other proposi- In the linked data literature, redundancy is re- tions. Formally, given a logic L (semantics) and a lated with the data quality dimension of concise- set of propositions P , it is defined as independent ness (Zaveri et al., 2015) and has been studied if for all proposition pi 2 P does not hold that and categorized from syntactic to semantic and {P pi } |= pi . In this way, a non-independent from schema to instance levels (Pan et al., 2014). set of propositions can be considered redundant From a syntactic perspective most of the existing since it contains extra information that may not be compression techniques focus on RDF serializa- necessary because if it is removed from the initial tion. On the other hand, from a semantic perspec- set, it can be obtained from the remaining proposi- tive, just a few works addressed redundancy. In tions applying an inference mechanism and keep- (Wu et al., 2014) authors propose a graph based ing the same logical consequences. Similarly, an 2 independent set of propositions can be considered https://www.w3.org/DesignIssues/ LinkedData.html non-redundant. 3 https://www.w3.org/TR/rdf-schema/ As we mentioned in section 2, class assign- 4 https://www.w3.org/standards/techs/ ments are instance level assertions (or proposi- owl\#w3c\_all 5 https://www.w3.org/TR/ tions) that specify resources as members of cer- rdf-sparql-query/ tain classes. Thus, given a resource r, its class as- 104 signment set (CASr ) contains all the propositions get all the ancestors (or depth) of a given class (as that specify the classes to which r belongs. The shows example of listing 3). idea of the previous paragraph can be applied to class assignments to define semantic redundancy SELECT DISTINCT ?c since the non-redundant class assignment set of WHERE { a resource r (N RCASr ) is the independent set of ?c rdf:subClassOf* CASr . Then, the redundant class assignment set } (RCASr ) can be considered as the difference of those sets. In the following subsections, we dis- Listing 2: SPARQL query to get class descendants cuss some techniques that can be used to compute N RCASr on linked datasets. Then, we will use one of these techniques to perform our evaluation. SELECT DISTINCT ?c 4.1 Using SPARQL queries WHERE { rdf:subClassOf* ?c Using SPARQL, a simple query can be imple- } mented to get the N RCAS. For example, query in listing 1 can be used to get the non-redundant set of classes of a given resource (specified by re- Listing 3: SPARQL query to get class ancestors source URI). It is important to highlight that the performance SELECT DISTINCT ?c of SPARQL queries depends on its implemen- WHERE { tation and the dataset size. Although complex rdf:type ?c SPARQL queries can become unacceptably slow FILTER regex(str(?c),"ont_URI","i") when working with large amounts of data, it is FILTER NOT EXISTS { currently the most practical mechanism to access rdf:type ?sc . FILTER regex(str(?sc),"ont_URI","i") linked datasets. ?sc rdfs:subClassOf ?c } 4.2 Using graph based algorithm and } reasoners Given a class hierarchy, a resource r and its CASr , Listing 1: SPARQL query example to get non- a way to compute redundancy is by interpreting redundant class assignments the class hierarchy as a directed acyclic graph “G” in which each node is a class and each edge is Note that the mentioned query example consid- the relation rdfs:subClassOf. A node “A” ers only one ontology (filtered by ontology URI) of “G” can be considered a class if there exist a and does not implement any inference mechanism triple with the form “A rdfs:subClassOf x”, at instance or schema level. This means that the “x rdfs:subClassOf A” or “x rdf:type query will work while all class assignments and A”. Then, a class “B” is subclass of a class “A” relationships between the involved classes will be if node “A” is reachable from node “B” in “G”, specified explicitly on the dataset. If this is not the that is, if exists a path between B and A in the case and a transitive closure of sub/super classes graph. Considering this, given a proposition set is needed, it is necessary to implement an algo- Q that specifies the classes to which an instance i rithm that iterates recursively over these queries belongs, a naive algorithm can be implemented to until it gets the required classes. Using SPARQL compute a non-redundant proposition set R: first property paths (e.g. rdfs:subClassOf* or set R=Q, then for each element q in Q check if rdfs:subClassOf+) it is possible to check there is a path from some of the remaining propo- connectivity of two classes by an arbitrary length sitions in Q to q, if so, q is deleted from R. Finally, path (route through a graph between two graph the algorithm returns R which is then the non re- nodes)6 . For example, it can be used to get all dundant class assignments set of r. Removing the classes that are descendant (or subclasses) of redundancies can be associated with the problem a given class (as shows example of listing 2) or to known as transitive reduction (Aho et al., 1972) 6 https://www.w3.org/TR/sparql11-property-paths/ which has an unfortunate complexity class if it is 105 implemented naively. If the given graph is a fi- further analysis was done per class groups but nite directed acyclic graph current approaches that only considering the DBpedia class hierarchy in solve this problem are close to the upper bound order to keep the number of classes manageable. O(n2.3729 ) but if the graph has cycles the problem For each class group, we analyzed the relationship belongs to NP-hard class. between redundant class assignments (RCA) and Another alternative to compute redundancy is three class hierarchy characteristics: class depth, by using Semantic Web reasoners that have been class descendants and number of class assign- implemented based on decidable fragments of ments per class. RDFS and OWL semantics. These tools imple- ment inference mechanisms that can be used to 5.1 Overall redundancy evaluation deduce if a resource belongs to a given class (in- The first overall evaluation was made by retriev- stance checking). The main advantage of using ing all resources that belong to some class of reasoners is that the potential of the underling se- the DBpedia ontology (6,729,604 resources of mantics can be exploited (e.g. several ontologies 453 classes) and then we did the same with the can be combined to get implicit knowledge). On YAGO ontology (2,886,306 resources of 369,144 the other hand, the disadvantage of using these classes). For each resource we compute its CAS, tools in linked data scenario is its complexity: in- its N RCAS and its RCAS (see section 4) con- ference techniques work well for small examples sidering both ontologies separately. Information with limited knowledge but they turn unacceptably about resources and its CAS and N RCAS were slow for large-scale datasets. Besides, when multi- obtained through SPARQL queries (see section ples ontologies are combined, inconsistencies can 4.1) and RCAS was obtained by computing the arise affecting the inference process and hamper- difference CAS N RCAS. Results can be ing the detection of redundant propositions. viewed in table of figure 1. Each element of each CAS was counted as a different class assign- 5 Evaluation ment (nbCA column), each element of N RCA To perform our evaluation we selected the En- was counted as a non-redundant class assignment glish version of DBpedia7 and set up a local mir- (nbNRCA column) and each element of RCA was ror using a Virtuoso8 server (version 7.2) . The counted as a redundant class assignment (nbRCA mechanisms implemented to compute redundant column). As we can see in chart of figure 1, class assignment avoid the use of complex graph considering classes of the DBpedia ontology al- based algorithms or RDFS/OWL reasoners and most half class assignments are redundant. On the use a SPARQL query oriented approach (see sec- other hand, considering the YAGO ontology 80% tion 4.1). Although resources in DBpedia are clas- of class assignments are redundant. In the latter sified using several classes of different schema we case, the amount of class assignments is higher only considered the DBpedia9 core and YAGO10 and the amount of concepts in the class hierar- ontologies because information about the involved chy increases considerably. These results sug- class hierarchies (subclass relationships) can be gest a relationship between the number of classes, obtained directly from queries through the dataset class assignments and redundancy: as the number SPARQL endpoint. DBpedia ontology is a shal- of classes and class assignments increases, so the low cross-domain ontology that covers more than probability of redundancy. 600 classes and was created based on Wikipedia 5.2 Redundancy and class depth infoboxes. YAGO is a taxonomy used in the To analyze the relationship between redundant YAGO knowledge base that currently covers more class assignments and class depth we categorized than 350,000 classes. The evaluation is organized classes into groups from 0 to 6 according to their in the next subsections as follows: we first per- depth in the DBpedia class hierarchy (the distance formed an overall redundancy evaluation consid- from the root to that class) and then we count ering DBpedia and YAGO ontologies and then a how many class assignments refer to those classes. 7 http://wiki.dbpedia.org/Downloads2015-04 Classes with depth 0 are the most general and 6 8 http://virtuoso.openlinksw.com/ 9 is the max depth found in the class hierarchy. A http://wiki.dbpedia.org/services-resources/ontology 10 https://www.mpi-inf.mpg.de/departments/databases- class assignment refers to (or belongs to) a class C and-information-systems/research/yago-naga/yago/ if it is a triple of the form (resource rdf:type 106 column) of depth 0 (Depth column), 5,037,966 class assignments (nbCA column) that refer to those classes and 3,940,920 of them are redundant (nbRCA column). Examples of classes that belong to that group are Agent, Place, Work, etc. Figure 1: DBpedia and YAGO redundancy evalu- ations C). To compute the depth of a class we used a SPARQL query to count the number of ancestors (see section 4.1 listing 3). Results can be viewed in table 1, chart of figure 2 shows the relationship between the class depth and the percentage of re- Figure 3: RCA distribution considering the class dundant class assignments (%RCA) and chart of depth figure 3 shows how these redundant class assign- ments are distributed. Depth nbClasses nbCA nbRCA 0 32 5,037,966 3,940,920 5.3 Redundancy and class descendants 1 86 3,941,230 2,877,434 2 110 5,385,831 1,156,327 3 180 1,154,660 118,886 To analyze the relationship between redundant 4 39 118,891 5,777 class assignments and class descendants we cat- 5 5 5,777 0 6 1 889 0 egorized classes into 10 groups according to the number of descendants that they have. To compute Table 1: Redundancy and class depth evaluation. the descendants we used a SPARQL query to get the subclasses of a given class (see section 4.1 list- ing 2). Results are showed in table 2 and chart of As we can see on chart of figure 2, as the class figure 4 shows the relationship between the num- depth increases (more specific a class is), the num- ber of class descendants and the percentage of re- ber of redundant class assignments decreases. dundant class assignments (%RCA). Chart of fig- ure 5 shows how these redundant class assign- ments are distributed. Classes that do not have descendants (330 classes) are the most specific and class assignments that belong to that group are not redundant. As we can see in chart of fig- ure 4, when the number of descendant per class increases, the number of redundant class assign- ments also increases. For example, group named “1 to 5” refers to classes that have between 1 to 5 descendants (78 classes) and redundancy is rela- Figure 2: RCA vs class depth tively low. On the other hand, classes with several descendants (e.g. class Agent) has a high level of Chart of figure 3 shows that more than 80% of redundancy. As chart of figure 5 shows, only 3 redundant class assignments refer to more general classes have more than 100 descendants (Agent, classes (with less depth). For example, in table Person and Place) and they concentrates the 65% 1 we can see that there are 32 classes (nbClasses of redundant class assignments. 107 nbDesc nbClasses nbCA nbRCA 0 330 3,250,543 0 1 to 5 78 3,209,728 321,078 6 to 10 16 948,187 524,262 10 to 20 13 666,630 531,848 21 to 30 8 773,316 751,488 31 to 50 2 535,606 502,892 51 to 70 3 809,171 784,103 71 to 100 1 220,219 211,415 101 to 200 2 2,860,585 2,117,098 More than 200 1 5,231,844 4,472,258 Table 2: Redundancy and class descendants eval- uation. Figure 5: RCA distribution considering class de- scendants less than 10K class assignments. Besides, as this number increases the number of classes in- volved decreases but the percentage of redundant class assignments increases. For example, classes that have more than 500K class assignments (e.g. Figure 4: RCA vs class descendants Agent, Place, Person, etc.) concentrate most of them (9,292,013) and 48% are redundant. We also observe that the second group of most used classes 5.4 Redundancy and number of class (between 200K and 500K class assignments) con- assignments sists of about 8 classes with high levels of redun- To analyze the relationship between redundant dancy (between 70% and 95%). class assignments and the number of class as- signments per class we categorized classes into 10 groups according to the number of class as- signments that refers to a class. Table 3 shows the evaluation results and chart of figure 6 shows the relationship between redundancy and the num- ber of class assignments (or instances) per class. Columns nbCA-acc and nbRCA-acc show the total number of class assignments and redundant class assignments in each group. nbCA nbClasses nbCA-acc nbRCA-acc Figure 6: RCA vs number of class assignments 0 to 10K 315 599,261 101,861 10K to 20K 35 484,931 120,512 20K to 30K 34 588,845 267,460 30K to 40K 25 384,601 220,270 6 Conclusions and future work 40K to 30K 11 900,290 36,156 100K to 200K 7 906,730 365,085 In this work we addressed the concept of seman- 200K to 300K 5 1,274,010 1,222,844 tic redundancy considering class assignments as- 300K to 400K 1 396,046 395,804 sertions in linked datasets. Based on a formal 400K to 500K 2 934,843 681,445 More than 500K 6 9,292,013 4,472,258 definition we discussed how redundant (and non- redundant) class assignments sets can be detected. Table 3: Redundancy and number of class assign- Inspired in previous related work, we conducted ments evaluation. an evaluation over the English version of DBpedia based on SPARQL queries. We analyzed the re- As we can see, most of classes (315) have lationship between redundancy and three class hi- 108 erarchy characteristics: the number of instances a Christian Fürber and Martin Hepp. 2010. Using se- class has, its depth and its number of descendants. mantic web resources for data quality management. In Proceedings of the 17th International Conference Regarding the evaluation results, they suggest on Knowledge Engineering and Management by the that there is a relationship between redundancy Masses, EKAW’10, pages 211–225, Berlin, Heidel- and depth of a class in a hierarchy: as more gen- berg. Springer-Verlag. eral a class is (less depth), more redundant class Tom Heath and Christian Bizer. 2011. Linked Data: assignments can be found that refer to that class. Evolving the Web into a Global Data Space, vol- In a similar way, as the number of descendants ume 1 of Synthesis Lectures on the Semantic Web: per class increases so does the number of redun- Theory and Technology. Morgan Claypool, 1st ed., dant class assignments related to that class. Par- html version edition, February. ticularly, we noted that this also occurs in most Pascal Hitzler, Markus Krötzsch, and Sebastian populated classes (with more class assignments). Rudolph. 2009. Foundations of Semantic Web Tech- In this sense, datasets that use complex and large nologies. Chapman & Hall/CRC. class hierarchies to typify their resources in uncon- Aidan Hogan, Andreas Harth, Alexandre Passant, Ste- trolled environments (such as crowsourced gener- fan Decker, and Axel Polleres. 2010. Weaving the pedantic web. In Linked Data on the Web Workshop ated content) may be more prone to class assign- (LDOW2010) at WWW’2010, volume 628, pages ments redundancy. Considering this, we can make 30–34. CEUR Workshop Proceedings. the following observations: Amit Krishna Joshi, Pascal Hitzler, and Guozhu Dong, • SPARQL queries as a mechanism to evalu- 2013. The Semantic Web: Semantics and Big Data: 10th International Conference, ESWC 2013, ate redundancy offer practical ways to imple- Montpellier, France, May 26-30, 2013. Proceedings, ment quality checks and get some statistics chapter Logical Linked Data Compression, pages of linked datasets. However, SPARQL in- 170–184. Springer Berlin Heidelberg, Berlin, Hei- ference capabilities are limited: we can dis- delberg. cover just some graph patterns using queries Dimitris Kontokostas, Patrick Westphal, Sören Auer, but other implicit knowledge will be unreach- Sebastian Hellmann, Jens Lehmann, Roland Cor- able since we can not exploit all the semantic nelissen, and Amrapali Zaveri. 2014. Test-driven evaluation of linked data quality. In Proceedings of capabilities or expressiveness of the used lan- the 23rd International Conference on World Wide guages. Web, WWW ’14, pages 747–758, Republic and Canton of Geneva, Switzerland. International World • Redundancy analysis can be used to detect Wide Web Conferences Steering Committee. class assignments patterns and data publish- Jens Lehmann, Robert Isele, Max Jakob, Anja ers behaviors. For example, in our evaluation Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, we detected that when a resource is assigned Sebastian Hellmann, Mohamed Morsey, Patrick van to a very specific class, it is also assigned ex- Kleef, Sören Auer, and Christian Bizer. 2015. DB- plicitly to the ancestors of that class. Discov- pedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web Journal, ering these kinds of patterns may be useful to 6(2):167–195. improve the linked data generation process or even to understand how classes described in a Jeff Z Pan, JM Gómez-Pérez, Yuan Ren, Honghan Wu, and Man Zhu. 2014. Ssp: Compressing rdf given ontology are used on a specific dataset. data by summarisation, serialisation and predictive encoding. Technical report, Technical report, 07 Future work will be focused on the development 2014. Available as http://www. kdrive-project. eu/re- and assessment of semantic redundancy metrics sources. that support our results on other linked datasets. Honghan Wu, Boris Villazn-Terrazas, Jeff Z. Pan, and Besides, we plan to study how redundancy can Jos Manul Gmez-Prez. 2014. How redundant is affect other data quality dimensions particularly it? - an empirical analysis on linked datasets. In those related with semantic accuracy. Olaf Hartig, Aidan Hogan, and Juan Sequeda, edi- tors, COLD, volume 1264 of CEUR Workshop Pro- ceedings. CEUR-WS.org. References Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ri- cardo Pietrobon, Jens Lehmann, and Sören Auer. A. V. Aho, M. R. Garey, and J. D. Ullman. 1972. The 2015. Quality assessment for linked data: A survey. transitive reduction of a directed graph. SIAM Jour- Semantic Web Journal. nal on Computing, 1(2):131. 109