Automatic Weight Generation and Class Predicate Stability in RDF Summary Graphs Mehmet Aydar, Serkan Ayvaz, and Austin Melton Kent State University, Department of Computer Science, 241 Math and Computer Science Building. Kent, OH 44240, USA {maydar,sayvaz1,amelton}@kent.edu http://www.kent.edu/cs Abstract. In this current study, we use graph localities and neighbor- hood similarity to enhance the summary graph generation approach for building a summary graph structure for intelligent exploration of seman- tic data. The key improvements to what we have previously proposed in- clude the addition of a string similarity measure for the literal neighbors, development of a stability measure to evaluate the accuracy of class rela- tions, the addition of auto-generated property weights, and the detection of noise properties. Keywords: Semantic Web, RDF, Graph Summarization, Automatic Property Weight 1 Introduction In the recent years, there has been significant progress in publishing semantic data in the Web as an ever-growing number of organizations adopt Semantic Web technologies. The Linked Open Data Initiative [4] along with several other Semantic Web projects has promoted publishing various open datasets in RDF model by using a standard methodology, with the links between data items from different data sources on the Web. While this has made available thousands of general purpose datasets including DBpedia [2], FreeBase [5] and GeoNames [1], and many domain-specific data sources in the Resource Description Framework (RDF) data model, there is still a long way to go as Linked Open Data is still a small portion of the information available on the Web. An RDF graph consists of a set of RDF triples. In Semantic Web, the size of RDF graphs can be very large, and processing an entire graph for each query can be costly in terms of time and resources. A summary graph consists of the type classes, members of the type classes, and the relations between the type classes with each type class representing a collection of RDF resources having the same type. The summary graph structure demonstrating the inferred class types and class relations can be beneficial for intelligent explorations of semantic data as it helps understand underlying structure and provides an intermediate index structure for semantic searches to avoid unnecessary traversals of entire RDF graphs. 2 Automatic Weight Generation and Class Predicate Stability In our previous work, we proposed an efficient algorithm for auto-generating a summary graph structure from an RDF dataset; our main goal was faster computations. In this current study, we focus on key improvements in summary graph generation process for potentially more accurate results. 1.1 Contribution and Outline The main contributions of present study include the following: – We auto-generate the importance weight of each property and each string word for each of the reference IRIs, and we apply the weights in the pairwise similarity calculation. – We add a string similarity measure when two graph vertices are literal type. – We generate the summary graph along with the classes and class relations with a stability measure for each class relation. And we propose that the sta- bility measures can also be utilized in semantic search algorithms to generate more accurate results. The rest of the paper is organized as follows. We briefly review the graph sum- marization problem and discuss in detail the key improvements to our existing solution. Then, we present the results of the evaluations. Finally, we review the related work and follow this with our conclusion and future work. 2 Augmenting Graph Summary Computations The Linked Data consist of a collection of RDF statements that intrinsically rep- resents a labeled, directed multi-graph with which the resources are expressed unambiguously. RDF statements describe resources in the form of triples, con- sisting of subject-predicate-object expressions that describe a resource, the type of a resource (type triple), or a relationship between two resources [9]. The sub- ject in an RDF triple is either an Internationalized Resource Identifier (IRI) or a blank node, the predicate is an IRI, and the object is either an IRI, a literal or a blank node. The subjects and objects of triples in the RDF graph form RDF nodes. Each RDF node that corresponds to a unique RDF entity is represented with a unique IRI, and the values such as strings, numbers and dates are represented by literal nodes. A literal node can consist of two or three elements: a lexical form, a datatype IRI and a language tag. The language tag in a literal node is included if and only if the datatype IRI of the literal node corresponds to rdf:langString [6]. A predicate in an RDF triple is also called a property of the RDF subject node. A predicate can be one of two types: a DatatypeProperty where the subject of the triple is an IRI and the object of the triple is a literal or an ObjectProperty where both the subject and object of the triple are IRIs. Each object of a subject node is called a neighbor of that subject node. Automatic Weight Generation and Class Predicate Stability 3 2.1 Summary Graph Generation A summary graph of a data graph is the directed graph such that each node in the summary graph is a subset of the original graph nodes of the same type. Let G = (V, L, E) be a data graph such that V is the finite set of vertices; L denotes the finite set of edge labels; and E is the set of edges of the form l(u, v), with u ∈ V , v ∈ V and l ∈ L. Note that an edge l(u, v) represents the RDF triple (u, l, v). We define a summary graph as G0 = (V 0 , L0 , E 0 ), such that V 0 contains equivalence classes of V . E 0 and L0 are, respectively, the sets of edges and labels in the graph G0 . As we will see, L0 ⊂ L, and the elements of E 0 are defined by the elements in the equivalence classes in V 0 and the edges in E. There exists several methods to obtain a summary graph: (1) A summary graph can be obtained from the dataset ontology, if the dataset is already tied to an ontology. (2) Another way to obtain the summary graph is to locate the type triples in the dataset and to organize the type classes and relations accord- ingly, if the data set is published using a standard vocabulary [11]. (3) Or the summary graph can be built automatically by inferring the class types based on the similarity of the RDF nodes. Our graph summarization approach is based on method 3. In our previous study, we proposed an algorithm for building a summary graph structure based on pairwise similarity matrices of the graph entities[3]. An efficient graph node pair similarity metric was introduced utilizing the graph localities and neighborhood similarity within the Jaccard measure context [13] in conjunction with RoleSim similarity [15], without relying on the existence of a common vocabulary such as rdf:type or owl:sameas. The intuition is that the nodes that have similar predicates connected to similar neighbors tend them- selves to be similar nodes; thus, they should be in the same class. The properties of the entities were treated as the dimensions of the entities when measuring the entity similarity. The direct similarities of the entities were taken into account along with the similarity of the neighbors with which they interact. Therefore, our summary graph generation algorithm initially calculates the similarity of the IRI node pairs that share at least one common edge label. Consequently the algorithm generates the distinct classes based on a given threshold such that the nodes u and v get put into the same class if their dissimilarity is less than the defined threshold: . More details of our summary graph generation approach can be found in [3]. In this current study, we enhance our core summary graph generation ap- proach [3] by incorporating literal node similarities, applying auto-generated importance weights of the IRI node descriptors and developing a measure de- scribing the degree of confidence of the summary graph class relations. In the following sections, we describe these key improvement points in detail. 2.2 Literal Node Similarity As stated in [3], our graph summarization approach is based on calculating the similarity of entities by utilizing the predicates of the IRI nodes. Our premise 4 Automatic Weight Generation and Class Predicate Stability is that similar nodes tend to have similar properties and interact with similar neighbor nodes, which are either IRIs or literals. It is challenging to infer the semantics of literal nodes. An effective literal node similarity metric is needed for calculating the similarity of pairs of IRI nodes when some of the neighbors of the IRI nodes are literal nodes. We think that incorporating literals when com- puting the similarity of pairs can be beneficial when identifying similar entities, particularly in datasets where the entities are commonly described using literals. Thus, we are taking the similarity of literal neighbor nodes into account when doing similarity calculations in present study. While incorporating literals in the computation of the similarity of IRI node pairs, we are assuming that all the literals are in the same language, as the same literals may have totally different meanings in different languages. Thus there is only one value for the rdf:langString component of the literal nodes if present. In this work, we disregard the third component of rdf:langString if present, which means we work only with the lexical form and the data type URI component of a literal node. Both the lexical form and the data type URI component clearly impact the similarity of a pair of literal nodes. Thus, they indirectly impact the similarity of IRI nodes in the calculation of neighborhood similarity, and their impacts need to be weighted. Since calculating the similarity of literal nodes when data types are different is meaningless, we only give weight to the data type factor when the two data types are equal. For the lexical form components of the literal nodes, we use a string similarity technique based on common words within the two lexical forms along with their auto-generated importance weights. More details about the importance weights will be given in the following section. 2.3 Descriptor Importance and Automatic Detection of Noise Labels An IRI node is described through its predicates and the collection of literal neighboring nodes in the lexical form. We call these the descriptors of the IRI nodes. The similarity of two IRI nodes is calculated from their descriptor sim- ilarities including the similarities of their neighbors. The accuracy of pairwise graph node similarity is often impacted by the weight of a property associated with the graph nodes when the nodes are object nodes or with the weight of a string literal word referenced by the graph nodes when the nodes are literal type. Each descriptor may have a different impact on an IRI node. Therefore, identifying appropriate metrics for generating weights for the IRI descriptors to be utilized in the pairwise graph nodes similarities is a formidable yet significant task. In this paper, we investigate the factors that can impact the weight of a descriptor. We propose an approach for generating the importance weights of the IRI node descriptors automatically. Our approach is based on two premises: (1) the weight of a descriptor may differ for each IRI for which it is a descriptor and (2) the weight increases proportionally by the number of times a descriptor appears in the reference IRI, but it is offset by the frequency of the descriptor in the entire RDF dataset. It is a similar notion to the term frequency-inverse Automatic Weight Generation and Class Predicate Stability 5 document frequency (tf-idf) [16, 20], a commonly used technique in information retrieval, indicating that some words may be important in some documents but not as important in other documents. More exactly, the importance of a word in a document increases by its frequency in the document but its importance decreases by its frequency in the corpus [18]. We apply the tf-idf concept to the properties and nodes in RDF graphs to compute the weight of properties. tf-idf is calculated as follows: tf − idf (p, u, G) = tf (p, u) × idf (p, G). (1) where the term frequency (tf) [16] represents the frequency of a proposition p with respect to a graph subject node u. More exactly, when u ∈ V and p ∈ L, then f (p, u) = |{v ∈ V : p(u, v) ∈ E}|. (2) Equivalently, f (p, u) is the number of RDF triples with subject u and prop- erty p. To define tf (p, u), it is helpful to have a notation for the set of all properties with subject u. Thus, for u ∈ V , L(u) = {q ∈ L : ∃v ∈ V with q(u, v) ∈ E}. Then f (p, u) tf (p, u) = P . (3) q∈L(u) f (q, u) The inverse document frequency (idf) [20] represents the frequency of a prop- erty usage across all graph nodes, and it is defined as |V | idf (p, G) = ln . (4) |{u ∈ V : p ∈ L(u)}| We apply a similar approach to calculate the weight of word importance in literal nodes, which can consist of a set of words. A string literal is a range for a DatatypeProperty. We assert that the weight of word importance depends upon the source subject node, the frequency of the word within the triple collection for each subject node, and the frequency of the word within the entire data set. We calculate the property importance and assign weights depending on the degree of distinctiveness of a property describing an entity. With property dis- tinctiveness, we mean the uniqueness of a property in describing the key charac- teristics of an entity type. For instance, if a property is specific to an entity type, it is a distinguishing character of the type from other types. When a property exists in all entity types, its quality of being distinctive is low. The noise labels tend to be common for a majority of entities if not for all entities. By increasing importance weights of properties with a higher degree of distinctiveness, we re- duce the importance of noise labels automatically. As a result, the noise labels have significantly less impact on the overall similarity measures. 6 Automatic Weight Generation and Class Predicate Stability 2.4 Class Relation Stability Metric The summary graph from an RDF dataset is built automatically, and the con- structed summary graph is also represented in RDF in our approach. The IRI nodes that have similarity higher than a defined threshold are considered to be of the same type, and they are categorized in the same class in the summary graph. A class relation between a class c1 and a class c2 is generated as a predicate and represented as l(c1, c2) when there is at least one relation l(u, v) such that u and v are IRIs in the dataset, G = (V, E, L), and u ∈ c1 and v ∈ c2 with both c1 and c2 being type classes in the summary graph, G0 = (V 0 , E 0 , L0 ). Then we have l ∈ L0 and l(c1, c2) ∈ E 0 . However, automatically generated summary graphs can be error prone. Therefore, a metric to measure the degree of confidence of a relation between classes in the summary graph would be beneficial. We call this metric Class Predicate Stability (CPS). The CPS is similar to the stability concept introduced by Paige and Tarjan [17]. For a triple (c1, p, c2) in the summary graph G0 with c1 and c2 being type class IRI nodes and p being a predicate between them, the CPS metric is calculated as the number of the IRI nodes u in class c1 having a triple of the form (u, p, v) with u ∈ c1 and v ∈ c2 divided by the total number of the IRI nodes in c1 in the summary graph. CP S(c1, p, c2) is formulated as |(u, p, v) : u ∈ c1, v ∈ c2}| CP S(c1, p, c2) = (5) |c1| where |c1| is the number of IRI nodes in the class c1. Note that |c1| > 0. We define full CPS as follows: for two classes in the summary graph either all the IRI nodes from c1 are connected with a predicate p to at least one IRI node in c2 or none of the IRI nodes in c1 are connected with the predicate p to an IRI node in c2. The CPS value for a triple (c1, p, c2) in the summary graph indicates how strongly connected and how coarsely partitioned the type classes c1 and c2 are with the predicate p. Thus, the average of all the CPS values in the summary graph is a measure of accuracy for the generated summary graph. CP S(G0 ) is formulated as 0 |E P| CP S(c1i , pi , c2i ) 0 i=1 CP S(G ) = (6) |E 0 | where G0 = (V 0 , E 0 , L0 ) is the summary graph and pi (c1i , c2i ) ∈ E 0 , and thus |E 0 | > 0. Another advantage of calculating the CPS metric is that it can further be utilized in semantic search algorithms. In traditional semantic search algorithms, the relations between two different type classes are assumed to be tightly coupled [21]. In real situations this assumption may not always be true, especially if the summary graph is auto-generated as in our study. We propose that the CPS metric can be used as an impact factor between two type classes and utilized in the semantic search graph traversal for more accurate results. Automatic Weight Generation and Class Predicate Stability 7 3 Evaluations In the evaluations, we assessed the effectiveness of the proposed improvements on three datasets: a subset of DBpedia [2]; a subset of SemanticDB [10], a Se- mantic Web content repository for Clinical Research and Quality Reporting; and a subset of Lehigh University Benchmark (LUBM) [12], a benchmark for OWL knowledge base systems. ure:215 re :1 75 5 Vascula rPro ced V a sc Va scular re :1 6 Va s :8 3 Va s :5 5 9 u la rP cu la lP ro ce du d u re :1 6 Va ce d u Surgica lPro ced cu la u re 14 Va u re scu 06 rP ro Proced ur S u rg a lP ro ce ce d : ro ce d Va u re rP ro scu a lP ro 2 ed 3 la re : d u e :2 3 sc Va ce d u :1 3 Su ica lP ro rPr ed la r ro c Su rg ica du ce d 03 u la sc Su rg ic lPro c P u re :1 o ce u re : S u rg ic ure:2 r Su g ica lP a lP o ce :2 0 du ic ro c d u re :1 u la rP 22 u re 4 re :1 re e:8 S u rg lP o ce e: ro c d u du rP Su rg ic a lPr e Su rg ica ur d 2 ro ce e re : r 1 9 r ed e: ce ro c r Ev ro du :6 4 3 lP o ce ica en re :9 re Ev a r u 2 t:3 ic 17 rg d Ev en t:2 9 2 rg ca lP o ce re : en Su rg i a lPr ce d u Ev t:3 1 1 u S rg ic ro re :7 en 98 lP u t: Su rg ica ed Ev en 352 ro c C­V Su g ica lP ­1 E ve t:3 5 3 66 SP r ra ft:1 P n t:2 Su e ryG C­ 48 E ve n t:3 n a ryA rt ra ft :1 0 36 Co ro G E ve n A rte ry t:2 5 4 n a ry 42 C­E Co ro ryGra ft: E ve nt:3 ­2 aryA rte 07 Coron Even t:2 82 eryGraft:55 Co ron aryArt Event:3 16 Corona ryArteryGraft:3 2 Event:2 60 Corona ryArteryGraft:1 07 C­CA G G' Co ron aryArt eryGraft:14 00 Coron 8 du re :1 aryA rte lP ro ce :9 9 Co ro ryGra ft: S urgica e d u re 8 ­2 n a ry A rte ry 72 ica lP ro c 3 SP Co ro S u rg u re :1 7 C­ Co n a ry Gra ft :9 ro ce d :1 0 ro A rte ryG ica lP u re 4 Co n a ryA ra ft:8 S u rg ro ce d re :2 2 r Co o n a r rte r y 5 a lP u 2 yA G C­E­1 rg ic ed 16 Co ro n a rte ra ft Su ro c u re : 9 0 ro r ryG :1 4 0 ic a lP c e d re :1 2 7 n a yArte ra f rg ro Ev ryA ryG t : 11 Su a lP d u e :2 6 3 ce r en rte r ryG a ft:1 rg ic ro d u e :1 t:1 Ev e n t 1 5 9 re :4 3 Su a lP o ce d u r 11 ra 50 E v e n t: 3 30 e n :1 8 g ic d u u re r r e ft: E v n t:8 lP 14 t:1 :2 Su ca oc d E v n t:2 t:1 3 i r 5 58 ce 1 Ev 85 rg lP e t:1 E ve 8 ica ro Su E ve 12 e n t:2 E ve n t:1 1 4 4 lP ce Even t:6 2 en rg Eve nt:5 3 Eve nt:1 50 Event:36 ica Even t:4 n t:1 n t:7 Su ro en Ev n t:2 E ve a lP rg Ev t:1 8 3 E ve n Su E ve 9 ic 7 rg Su Fig. 1. A figure consisting of different types of entities and elements belonging to the class types. We ran the datasets for summary graph generation with the core summary graph generation algorithm that was proposed in our previous work [3] and with the improvements suggested in this work. The improvements include taking the literal neighbor similarity and the dynamic property weight assignment into account in type generation. The goal of our evaluations was to investigate the impact of the improvements in real world datasets. The reason for selecting these three datasets was that they represent different aspects of real world semantic data. Thus, we tested the applicability of our ap- proach in different types of datasets. SemanticDB is a domain specific semantic data repository in Healthcare. It provides structured type information for the entities that we utilized as the ground truth for automatic verification of the accuracy in the evaluations. Lehigh University Benchmark (LUBM) is a struc- 8 Automatic Weight Generation and Class Predicate Stability Table 1. A Sample of RDF Triples from Each Dataset Dataset Subject Predicate Object SemanticDB SurgeryProcedure:236 hasCardiacValveAnatomyPathologyData CardiacValveAnatomyPathologyData:70 SemanticDB SurgeryProcedure:236 hasCardiacValveRepairProcedureData CardiacValveRepairProcedureData:16 SemanticDB SurgeryProcedure:236 SurgeryProcedureClass ”cardiac valve” SemanticDB SurgeryProcedure:236 CardiacValveEtiology ”other” SemanticDB SurgeryProcedure:236 CardiacValveEtiology Event:184 SemanticDB SurgeryProcedure:236 belongsToEvent Event:184 SemanticDB SurgeryProcedure:236 SurgeryProcedureDescription ”pulmonary valve repair” SemanticDB SurgeryProcedure:236 CardiacValveStatusologyData ”native” SemanticDB SurgeryProcedure:104 hasCardiacValveAnatomyPathologyData CardiacValveAnatomyPathologyData:35 SemanticDB SurgeryProcedure:104 SurgeryProcedureClass ”cardiac valve” SemanticDB SurgeryProcedure:104 CardiacValveEtiology ”rheumatic” SemanticDB SurgeryProcedure:104 belongsToEvent Event:81 SemanticDB SurgeryProcedure:104 SurgeryProcedureDescription ”mitral valve repair” SemanticDB SurgeryProcedure:104 CardiacValveStatus ”native” LUBM Student49 telephone ”xxx-xxx-xxxx” LUBM Student49 memberOf http://www.Department3.University0.edu LUBM Student49 takesCourse Course32 LUBM Student49 name ”UndergraduateStudent49” LUBM Student49 emailAddress ”Student49@Department3.University0.edu” LUBM Student49 type UndergraduateStudent LUBM Student10 telephone ”xxx-xxx-xxxx” LUBM Student10 memberOf http://www.Department3.University0.edu LUBM Student10 takesCourse Course20 LUBM Student10 name ”UndergraduateStudent10” LUBM Student10 emailAddress ”Student10@Department3.University0.edu” LUBM Student10 type UndergraduateStudent DBPedia Allen Ginsberg wikiPageUsesTemplate Template:Infobox writer DBPedia Allen Ginsberg influenced John Lennon DBPedia Allen Ginsberg occupation ”Writer, poet”@en DBPedia Allen Ginsberg influences Fyodor Dostoyevsky DBPedia Allen Ginsberg deathPlace ”New York City, United States”@en DBPedia Allen Ginsberg deathDate ”1997-04-05” DBPedia Allen Ginsberg birthPlace ”Newark, New Jersey, United States”@en DBPedia Allen Ginsberg birthDate ”1926-06-03” DBPedia Allen Ginsberg deathPlace ”New York City, United States”@en DBPedia Albert Camus wikiPageUsesTemplate Template:Infobox philosopher DBPedia Albert Camus influenced Orhan Pamuk DBPedia Albert Camus influences Friedrich Nietzsche DBPedia Albert Camus schoolTradition Absurdism DBPedia Albert Camus deathPlace ”Villeblevin, Yonne, Burgundy, France”@en DBPedia Albert Camus deathDate ”1960-01-04” DBPedia Albert Camus birthPlace ”Drean, El Taref, Algeria”@en DBPedia Albert Camus birthDate ”1913-11-07” Automatic Weight Generation and Class Predicate Stability 9 tured and well-known benchmark dataset, which has type information available. However, the entities can have multiple types. Unlike SemanticDB, LUBM data has hierarchical types. For instance, an entity can have both types: Student type and Graduate Student type. Therefore, we performed a manual verification pro- cess for the ground truth to ensure the accuracy of evaluations. Lastly, DBPedia is a commonly used general purpose dataset, which is a central source in the Linked Open Data Cloud [4]. Type information is not always present for entities in DBPedia. Moreover, some entities have several types, including hierarchical types, which makes it problematic for automatic verification of accuracy results. Therefore, we manually verified the accuracy of the ground truth in the evalu- ations. Table 1 demonstrates a sample of RDF triples from each dataset in the evaluations. Table 2. An Excerpt from Dynamically Assigned Weights of Descriptors Dataset Node Pair Descriptor Type Descriptor Weight LUBM (Student49,Student10) Property memberOf 14.7% LUBM (Student49,Student10) Property takesCourse 44.1% LUBM (Student49,Student10) Property emailAddress 14.0% LUBM (Student49,Student10) Property type 5.7% LUBM (Student49,Student10) Property name 7.5% LUBM (Student49,Student10) Property telephone 14.0% SemanticDB (Procedure:236,Procedure:104) Literal ”cardiac” 13.6% SemanticDB (Procedure:236,Procedure:104) Literal ”native” 15.2% SemanticDB (Procedure:236,Procedure:104) Literal ”other” 14.3% SemanticDB (Procedure:236,Procedure:104) Literal ”pulmonary” 22.8% SemanticDB (Procedure:236,Procedure:104) Literal ”repair” 17.2% SemanticDB (Procedure:236,Procedure:104) Literal ”valve” 16.9% DBPedia (Allen Ginsberg,Albert Camus) Property wikiPageUsesTemplate 2.2% DBPedia (Allen Ginsberg,Albert Camus) Property influences 58.3% DBPedia (Allen Ginsberg,Albert Camus) Property deathDate 2.2% DBPedia (Allen Ginsberg,Albert Camus) Property birthDate 2.4% DBPedia (Allen Ginsberg,Albert Camus) Property birthPlace 2.1% DBPedia (Allen Ginsberg,Albert Camus) Property deathPlace 2.1% DBPedia (Allen Ginsberg,Albert Camus) Property influenced 30.7% We also evaluated the performance of dynamic assignment of descriptor weights. Table 2 shows a sample of dynamically assigned descriptor weights from each dataset. As expected, the algorithm assigned higher weights to the prop- erties with a higher degree of distinctiveness describing the resource type. For 10 Automatic Weight Generation and Class Predicate Stability instance in LUBM dataset, takesCourse property is more descriptive of the Stu- dent type than the name property, which is a common property for all class types in the dataset. Thus, takesCourse was assigned a weight of 44.1% as compared to the weight of 7.5% for name. We observed that higher class dissimilarity threshold results in more coarse classes, whereas the classes become more granular when the threshold is chosen smaller. The beta factor and the class dissimilarity threshold can be tuned differ- ently in various datasets. Their optimum values depend on the characteristics of the datasets. For each dataset, we kept the beta factor and the class dissimilarity threshold the same in both evaluations; core algorithm and algorithm with the improvements. We found that the class dissimilarity threshold ranging between 0.3 to 0.6 in combination of the beta factor of 0.15 appeared to work well in our evaluations. It is clear that the evaluation with the suggested improvements generates a summary graph with better accuracy and stability, as demonstrated in Table 3. We noticed that the literal similarity improves the class generation accuracy in datasets that have frequently used terminology as in the case of SemanticDB and LUBM. On the other hand, it may have an adverse effect in datasets with lengthy and diverse vocabulary of literals as in the example of DBPedia. Table 3. Evaluation Results Dataset Algorithm #Triples Class Threshold #Iterations Stability Accuracy SemanticDB Core 6,450 0.5 4 61.0% 87.3% SemanticDB With Improvements 6,450 0.5 4 68.2% 94.1% LUBM Core 6,484 0.3 3 67.8% 90.7% LUBM With Improvements 6,484 0.3 3 78.4% 98.6% DBPedia Core 10,000 0.6 3 82.4% 92.8% DBPedia With Improvements 10,000 0.6 3 89.1% 92.2% Figure 1 illustrates a small sample set of entities in the RDF graph from SemanticDB and their corresponding class types in the summary graph. As demonstrated in Figure 1, the classes C-E1 and C-E2 represent the entities that are patient event types. They are classified in two different classes because when compared with the original dataset we observed that the entities in C-E1 are more specifically patient surgery-related event types while the entities in C-E2 are patient-encounter related event types. Also, the classes E-SP1 and E-SP2 are surgical procedure types. More specifically, the entities in E-SP1 are coronary artery and vascular procedure-related procedures while the entities in E-SP2 are cardiac valve related-procedures. The classes C-VP and C-CAG represent the en- tities that are related to vascular procedures and coronary artery grafts, respec- tively. We implemented a basic algorithm to name the classes based on the class Automatic Weight Generation and Class Predicate Stability 11 member IRIs. The classes C-E1, C-E2, C-SP1, C-SP2, C-VP and C-CAG are named as C-Event-1, C-Event-2, C-SurgicalProcedure-1, C-SurgicalProcedure- 2, C-VascularProcedure and C-CoronaryArteryGraft, respectively. Fig. 2. An excerpt from the generated summary graph. The summary graph is generated along with the classes and the class relations with a stability measure for each relation. Figure 2 shows an excerpt from the summary graph representing the class relations from SemanticDB dataset. The percentage values beside the predicates are the stability (CPS) measure. 4 Related Work Many methods have been proposed for calculating the graph node similarities in an RDF data set, including our previous study [3]. While most of the similarity calculations do not take the property weights into account, for example, [3] and [14], there are some studies that try to calculate the property weights and apply them in similarity calculations. 12 Automatic Weight Generation and Class Predicate Stability H-Match[7] tried to detect the property weights using the distinct value based weight generation, assigning higher weight to a property that references more distinct values. However, a training set of instances may not always be available. In [8] the authors suggest that properties with a maximum or an exact car- dinality of 1 have a higher impact in instance matching, thus having a higher property weight. This assumption does not work well in instance type discovery. For instance, in a university related dataset a more specific property hasPresi- dent should have more impact in type discovery than a more general property hasName, even though both of the properties have the cardinality of one. In this case, the assumption would misleadingly assign the same weight to both of the properties. On the other hand, [19] considers the ratio of the number of distinct val- ues of a property to the number of instances in a dataset in addition to the number of distinct values referenced by the property. However, they primarily focus on instance matching, where property weights naturally yield precedence to properties that make the instances more unique. Unlike the instance matching approach, we emphasize the properties that would help describe the entity types more distinctively. In [17] the authors defined the stability concept to be used in a coarsest partitioning problem. They utilized the stability concept on directed graphs. In our work, we leverage the stability concept to be used in a summary graph which is in RDF model. 5 Conclusion In this paper, we described enhancements to our pairwise graph node similar- ity calculation with the addition of the property and string word importance weights. We introduced the Class Predicate Stability metric, which allows eval- uation of the degree of confidence of each class predicate in the summary graph. We experimented with the enhanced method applied in our previous core sum- mary graph generation technique. The results show that our enhanced method can yield more accurate results over the pure summary graph generation tech- nique. Future work will focus on improvement of the scalability of the proposed method. Furthermore, our plan is to investigate obtaining the optimum value of the class dissimilarity threshold automatically and improving the class genera- tion algorithm to discover the hierarchy of the class types. Acknowledgments The authors would like to thank the Kent State University Semantic Web Research Group (SWRG) members for their helpful feedback. References 1. GeoNames, June 2015. http://www.geonames.org/. 2. Sren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. Dbpedia: A nucleus for a web of open data. Springer, 2007. Automatic Weight Generation and Class Predicate Stability 13 3. Serkan Ayvaz, Mehmet Aydar, and Austin C Melton. Building summary graphs of rdf data in semantic web. In Computer Software and Applications Conference (COMPSAC), 2015 IEEE 39th International. IEEE, 2015. 4. Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data-the story so far. International journal on semantic web and information systems, 5(3):1–22, 2009. 5. Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowl- edge. In Proceedings of the 2008 ACM SIGMOD international conference on Man- agement of data, pages 1247–1250. ACM, 2008. 6. Dan Brickley and R. V. Guha. RDF Schema 1.1. W3c Recommendation, February 2014. 7. Silvana Castano, Alfio Ferrara, Stefano Montanelli, and C Quix. H-match: an algorithm for dynamically matching ontologies in peer-based systems. In SWDB, pages 231–250. Citeseer, 2003. 8. Keith Cortis, Simon Scerri, Ismael Rivera, and Siegfried Handschuh. Discovering semantic equivalence of people behind online profiles. In In Proceedings of the Resource Discovery (RED) Workshop, ser. ESWC, 2012. 9. Richard Cyganiak, David Wood, and Markus Lanthaler. RDF 1.1 Concepts and Abstract Syntax. W3c Recommendation, February 2014. 10. Christopher D Pierce, David Booth, Chimezie Ogbuji, Chris Deaton, Eugene Black- stone, and Doug Lenat. Semanticdb: A semantic web infrastructure for clinical research and quality reporting. Current Bioinformatics, 7(3):267–277, 2012. 11. Songyun Duan, Anastasios Kementsietsidis, Kavitha Srinivas, and Octavian Udrea. Apples and oranges: a comparison of rdf benchmarks and real rdf datasets. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 145–156. ACM, 2011. 12. Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. Lubm: A benchmark for owl knowl- edge base systems. Web Semantics: Science, Services and Agents on the World Wide Web, 3(2):158–182, 2005. 13. Anil K Jain and Richard C Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988. 14. Glen Jeh and Jennifer Widom. SimRank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538–543. ACM, 2002. 15. Ruoming Jin, Victor E Lee, and Hui Hong. Axiomatic ranking of network role similarity. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 922–930. ACM, 2011. 16. Hans Peter Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of research and development, 1(4):309–317, 1957. 17. Robert Paige and Robert E Tarjan. Three partition refinement algorithms. SIAM Journal on Computing, 16(6):973–989, 1987. 18. Anand Rajaraman and Jeffrey David Ullman. Mining of massive datasets. Cam- bridge University Press, 2011. 19. Md Hanif Seddiqui, Rudra Pratap Deb Nath, and Masaki Aono. An efficient met- ric of automatic weight generation for properties in instance matching technique. International Journal of Web & Semantic Technology, 6(1):1, 2015. 20. Karen Sparck Jones. A statistical interpretation of term specificity and its appli- cation in retrieval. Journal of documentation, 28(1):11–21, 1972. 14 Automatic Weight Generation and Class Predicate Stability 21. Thanh Tran, Haofen Wang, Sebastian Rudolph, and Philipp Cimiano. Top-k ex- ploration of query candidates for efficient keyword search on graph-shaped (rdf) data. In Data Engineering, 2009. ICDE’09. IEEE 25th International Conference on, pages 405–416. IEEE, 2009.