An Empirical Study on Property Clustering in Linked Data? Saisai Gong, Haoxuan Li, Wei Hu, and Yuzhong Qu State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China ssgong.nju@gmail.com, hxli.nju@gmail.com, whu@nju.edu.cn, yzqu@nju.edu.cn Abstract. Properties are used to describe entities, a part of which are likely to be clustered together to constitute an aspect. However, existing automated approaches to property clustering remain far from satisfac- tory for an open domain like Linked Data. In this paper, we firstly inves- tigated the relatedness between properties using five different measures. Then, we employed three clustering algorithms and two combination methods for property clustering. We empirically studied the property clustering on a moderate-sized sample of Linked Data and found that a proper combination of different measures gave rise to the best result. 1 Introduction With the development of Linked Data, billions of RDF triples have been pub- lished to describe numerous entities. An entity usually involves multiple aspects and its property-values may focus on different aspects. For instance, graduate from and work at reveal the career information of a person, while parent, spouse and child deliver her family information. Therefore, it is natural to cluster prop- erties into meaningful groups based on the aspects that they intend to describe. Property clustering is useful for many applications such as entity browsing, on- tology editing, query completion, etc. It makes the presented information more formatted and understandable and significantly enhances the capability of users to consume the large-scale Linked Data. However, automated property cluster- ing for an open domain like Linked Data remains far from satisfactory due to the multi-sourced and heterogeneous vocabularies used. In this paper, we empirically studied the property clustering in Linked Data. We tried our best in this study to provide answers to the following questions: Q1. What is the most effective measure(s) for measuring property relatedness? Q2. What is the most effective algorithm(s) for clustering properties ? Q3. Can the combination method(s) improve the property clustering and how largely? Q4. Are there any general principles or guidelines for using the property cluster- ing in practice? ? Funded by the National Natural Science Foundation of China (No. 61370019). 2 Property Relatedness Measures To achieve property clustering, we measure the relatedness between properties from the following five perspectives. – Lexical similarity between property names, denoted by RI , is based on the common characters of property names. For example, both mouth position and mouth elevation describe the mouth information of a river. We calculated RI using the I-Sub string similarity [3]. – Semantic relatedness between property names, denoted by RW , leverages WordNet to measure the semantic relatedness between properties. We used the average Lin’s WordNet relatedness [2] of word pairs in property names to calculate RW . – Distributional relatedness between properties, referred to as RU , is based on the property co-occurrence in the context of an entity’s RDF description, i.e. both properties are used together to describe the entity. Symmetrical uncertainty coefficient was used to compute the distributional relatedness. To estimate the probabilities of co-occurrence, the Billion Triples Challenge (BTC) 2011 dataset1 was used, in which the descriptions of coreferent URIs were merged. – Range relatedness between properties, referred to as RT , is based on the class relatedness of property ranges. For example, if two properties have the ranges delicious food and handicraft respectively, both of them deliver the tourist information of a tourist city. The range relatedness is calculated using the maximum WordNet-based relatedness RW of class pairs in property ranges. – Overlap of property values, denoted by RO , leverages the common values of two properties to compute the relatedness. The text of each property value is firstly collected, e.g. local names of URIs and lexical forms of literals after normalization, and all the terms in the text are used to construct a term frequency vector. RO is then computed using the cosine similarity of the corresponding vectors. 3 Clustering Algorithms and Combination Methods We employed the following three well-known clustering algorithms: DBSCAN (denoted by CD ), Single linkage clustering (CL ) and Spectral clustering (CS ). Combining various relatedness measures helps obtain a better clustering. We employed two typical combination methods. The first one is to first compute property relatedness using a linear combination of different measures for each property pair and then carry out clustering. The second one is to first conduct clustering based on individual measures and then aggregate these individual results using ensemble clustering. We selected consensus clustering to realize ensemble clustering and calculated it using CC-Pivot [1]. 1 http://km.aifb.kit.edu/projects/btc-2011/ Table 1. Average performance w.r.t. relatedness measures and clustering algorithms (a) Precision (b) Recall (c) F-Score CD CL CS CD CL CS CD CL CS RI .235 .235 .184 RI .273 .273 .449 RI .253 .253 .261 RW .215 .215 .198 RW .266 .266 .337 RW .238 .238 .250 RU .242 .242 .177 RU .433 .433 .410 RU .310 .310 .248 RT .170 .170 .215 RT .381 .381 .329 RT .235 .235 .260 RO .247 .247 .188 RO .137 .138 .427 RO .176 .177 .261 (d) Rand Index (e) NMI CD CL CS CD CL CS RI .549 .549 .500 RI .387 .387 .229 RW .672 .672 .584 RW .441 .441 .231 RU .644 .644 .503 RU .507 .507 .224 RT .547 .547 .628 RT .364 .364 .255 RO .709 .708 .516 RO .520 .520 .216 4 Empirical Study We report our study of the relatedness measures, clustering algorithms and com- bination methods. Their clustering performance w.r.t. the golden standard was evaluated using the following five metrics: Precision, Recall, F-Score, Rand Index and Normalized Mutual Information (NMI). All the parameters were set as the ones achieving the highest harmonic mean of F-Score. We sampled 20 entities of different types in Linked Data, each of which was integrated from a DBpedia URI with its coreferent ones from 12 different sources2 . Every entity has at least 51 properties while the maximum number is 574. The golden standard was built based on Freebase. Freebase divides proper- ties describing similar aspects into types and groups similar types into domains. We invited three PhD candidates in the field of Linked Data to assign each property to the most relevant /domain/type. The properties that were assigned to the same /domain/type were clustered together to form the golden standard. The Fleiss’ κ inter-rater agreement score is 0.895, showing the strong agreement. Table 1 depicts the average performance achieved w.r.t. different measures using clustering algorithms. Overall, no measure achieves the highest values for every clustering algorithm on all the measures. RI and RU generally generate better clusterings in terms of F-Score. Besides, from the third column of each table, we saw that CD is similar to CL and CS is greatly different from them. CD and CL usually generate better clustering results in terms of Rand Index and NMI. Table 2 shows the harmonic means of Precision, Recall, F-Score, Rand Index and NMI achieved by using single measures, linear measure combinations 2 These sources are DBpedia, DBTune, Freebase, GeoNames, LinkedGeoData, Linked- MDB, New York Times, OpenCyc, Project Gutenberg, RDF Book Mashup, The World Factbook and YAGO Table 2. Comparison on single relatedness measures and two combination methods Clustering algorithm: CD Precision Recall F-Score Rand Index NMI RI .235 .273 .253 .549 .387 RW .215 .266 .238 .672 .441 RU .242 .433 .310 .644 .507 RT .170 .381 .235 .547 .364 RO .247 .137 .176 .709 .520 .3RI + .7RU .218 .757 .339 .471 .379 .5RI + .5RO .209 .619 .313 .411 .265 .6RU + .4RO .214 .716 .330 .477 .375 .3RI + .5RU + .2RO .211 .883 .341 .398 .318 .3RI + .5RU + .1RT + .1RO .205 .878 .333 .372 .277 .2RI + .1RW + .2RU + .5RO .216 .790 .339 .438 .344 .2RI + .1RW + .15RU + .1RT + .45RO .207 .899 .337 .364 .268 RI , RU .287 .148 .196 .732 .563 RI , RO .331 .051 .089 .744 .566 RU , RO .290 .066 .108 .755 .575 RI , RU , RO .273 .210 .237 .706 .513 RI , RU , RT , RO .292 .102 .151 .744 .560 RI , RW , RU , RO .290 .115 .165 .726 .548 RI , RW , RU , RT , RO .256 .213 .232 .677 .493 and ensemble clustering (the 13th to 19th rows). The results indicate that the linear combination of relatedness measures tends to generate a clustering that features a higher Recall compared to single measures, while ensemble clustering is recommended to use if a higher Precision is preferred. 5 Conclusion In this paper, we studied the property clustering in Linked Data and evaluated different property relatedness measures, clustering algorithms and combination methods. Our experimental results demonstrated the feasibility of the automated property clustering. In future work, we will improve the quality of property clustering by leveraging user feedback and active learning. References 1. Ailon, N., Charikar, M., Newman, A.: Aggregating Inconsistent Information: Rank- ing and Clustering. Journal of the ACM, 55(5):23 (2008) 2. Lin, D.: An Information-Theoretic Definition of Similarity. In: ICML 1998. pp. 296–304. Morgan Kaufmann, San Francisco (1998) 3. Stoilos, G., Stamou, G., Kollias, S.: A String Metric for Ontology Alignment. In: Gil, Y., et al. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 624–637. Springer, Heidelberg (2005)