=Paper=
{{Paper
|id=Vol-1741/jist2016pd_paper6
|storemode=property
|title=An Empirical Study on Property Clustering in Linked Data
|pdfUrl=https://ceur-ws.org/Vol-1741/jist2016pd_paper6.pdf
|volume=Vol-1741
|authors=Saisai Gong,Haoxuan Li,Wei Hu,Yuzhong Qu
|dblpUrl=https://dblp.org/rec/conf/jist/GongLHQ16a
}}
==An Empirical Study on Property Clustering in Linked Data==
<pdf width="1500px">https://ceur-ws.org/Vol-1741/jist2016pd_paper6.pdf</pdf>
<pre>
      An Empirical Study on Property Clustering
                   in Linked Data?

                Saisai Gong, Haoxuan Li, Wei Hu, and Yuzhong Qu

                  State Key Laboratory for Novel Software Technology,
                       Nanjing University, Nanjing 210023, China
                     ssgong.nju@gmail.com, hxli.nju@gmail.com,
                          whu@nju.edu.cn, yzqu@nju.edu.cn


         Abstract. Properties are used to describe entities, a part of which are
         likely to be clustered together to constitute an aspect. However, existing
         automated approaches to property clustering remain far from satisfac-
         tory for an open domain like Linked Data. In this paper, we firstly inves-
         tigated the relatedness between properties using five different measures.
         Then, we employed three clustering algorithms and two combination
         methods for property clustering. We empirically studied the property
         clustering on a moderate-sized sample of Linked Data and found that a
         proper combination of different measures gave rise to the best result.


1      Introduction
With the development of Linked Data, billions of RDF triples have been pub-
lished to describe numerous entities. An entity usually involves multiple aspects
and its property-values may focus on different aspects. For instance, graduate
from and work at reveal the career information of a person, while parent, spouse
and child deliver her family information. Therefore, it is natural to cluster prop-
erties into meaningful groups based on the aspects that they intend to describe.
Property clustering is useful for many applications such as entity browsing, on-
tology editing, query completion, etc. It makes the presented information more
formatted and understandable and significantly enhances the capability of users
to consume the large-scale Linked Data. However, automated property cluster-
ing for an open domain like Linked Data remains far from satisfactory due to
the multi-sourced and heterogeneous vocabularies used.
    In this paper, we empirically studied the property clustering in Linked Data.
We tried our best in this study to provide answers to the following questions:
Q1. What is the most effective measure(s) for measuring property relatedness?
Q2. What is the most effective algorithm(s) for clustering properties ?
Q3. Can the combination method(s) improve the property clustering and how
    largely?
Q4. Are there any general principles or guidelines for using the property cluster-
    ing in practice?
 ?
     Funded by the National Natural Science Foundation of China (No. 61370019).
2     Property Relatedness Measures

To achieve property clustering, we measure the relatedness between properties
from the following five perspectives.

 – Lexical similarity between property names, denoted by RI , is based on the
   common characters of property names. For example, both mouth position and
   mouth elevation describe the mouth information of a river. We calculated RI
   using the I-Sub string similarity [3].
 – Semantic relatedness between property names, denoted by RW , leverages
   WordNet to measure the semantic relatedness between properties. We used
   the average Lin’s WordNet relatedness [2] of word pairs in property names
   to calculate RW .
 – Distributional relatedness between properties, referred to as RU , is based on
   the property co-occurrence in the context of an entity’s RDF description,
   i.e. both properties are used together to describe the entity. Symmetrical
   uncertainty coefficient was used to compute the distributional relatedness.
   To estimate the probabilities of co-occurrence, the Billion Triples Challenge
   (BTC) 2011 dataset1 was used, in which the descriptions of coreferent URIs
   were merged.
 – Range relatedness between properties, referred to as RT , is based on the class
   relatedness of property ranges. For example, if two properties have the ranges
   delicious food and handicraft respectively, both of them deliver the tourist
   information of a tourist city. The range relatedness is calculated using the
   maximum WordNet-based relatedness RW of class pairs in property ranges.
 – Overlap of property values, denoted by RO , leverages the common values of
   two properties to compute the relatedness. The text of each property value
   is firstly collected, e.g. local names of URIs and lexical forms of literals after
   normalization, and all the terms in the text are used to construct a term
   frequency vector. RO is then computed using the cosine similarity of the
   corresponding vectors.


3     Clustering Algorithms and Combination Methods

We employed the following three well-known clustering algorithms: DBSCAN
(denoted by CD ), Single linkage clustering (CL ) and Spectral clustering (CS ).
Combining various relatedness measures helps obtain a better clustering. We
employed two typical combination methods. The first one is to first compute
property relatedness using a linear combination of different measures for each
property pair and then carry out clustering. The second one is to first conduct
clustering based on individual measures and then aggregate these individual
results using ensemble clustering. We selected consensus clustering to realize
ensemble clustering and calculated it using CC-Pivot [1].
1
    http://km.aifb.kit.edu/projects/btc-2011/
Table 1. Average performance w.r.t. relatedness measures and clustering algorithms

         (a) Precision                    (b) Recall               (c) F-Score
          CD CL CS                    CD CL CS                      CD CL CS
    RI    .235 .235 .184         RI   .273 .273 .449          RI    .253 .253 .261
    RW    .215 .215 .198         RW   .266 .266 .337          RW    .238 .238 .250
    RU    .242 .242 .177         RU   .433 .433 .410          RU    .310 .310 .248
    RT    .170 .170 .215         RT   .381 .381 .329          RT    .235 .235 .260
    RO    .247 .247 .188         RO   .137 .138 .427          RO    .176 .177 .261

                     (d) Rand Index                     (e) NMI
                         CD CL CS                      CD CL CS
                  RI     .549 .549 .500           RI   .387 .387 .229
                  RW     .672 .672 .584           RW   .441 .441 .231
                  RU     .644 .644 .503           RU   .507 .507 .224
                  RT     .547 .547 .628           RT   .364 .364 .255
                  RO     .709 .708 .516           RO   .520 .520 .216


4     Empirical Study
We report our study of the relatedness measures, clustering algorithms and com-
bination methods. Their clustering performance w.r.t. the golden standard was
evaluated using the following five metrics: Precision, Recall, F-Score, Rand Index
and Normalized Mutual Information (NMI). All the parameters were set as the
ones achieving the highest harmonic mean of F-Score.
    We sampled 20 entities of different types in Linked Data, each of which
was integrated from a DBpedia URI with its coreferent ones from 12 different
sources2 . Every entity has at least 51 properties while the maximum number is
574. The golden standard was built based on Freebase. Freebase divides proper-
ties describing similar aspects into types and groups similar types into domains.
We invited three PhD candidates in the field of Linked Data to assign each
property to the most relevant /domain/type. The properties that were assigned
to the same /domain/type were clustered together to form the golden standard.
The Fleiss’ κ inter-rater agreement score is 0.895, showing the strong agreement.
    Table 1 depicts the average performance achieved w.r.t. different measures
using clustering algorithms. Overall, no measure achieves the highest values for
every clustering algorithm on all the measures. RI and RU generally generate
better clusterings in terms of F-Score. Besides, from the third column of each
table, we saw that CD is similar to CL and CS is greatly different from them.
CD and CL usually generate better clustering results in terms of Rand Index
and NMI. Table 2 shows the harmonic means of Precision, Recall, F-Score, Rand
Index and NMI achieved by using single measures, linear measure combinations
2
    These sources are DBpedia, DBTune, Freebase, GeoNames, LinkedGeoData, Linked-
    MDB, New York Times, OpenCyc, Project Gutenberg, RDF Book Mashup, The
    World Factbook and YAGO
Table 2. Comparison on single relatedness measures and two combination methods

Clustering algorithm: CD           Precision Recall F-Score Rand Index NMI
RI                                   .235    .273    .253     .549     .387
RW                                   .215    .266    .238     .672     .441
RU                                   .242    .433    .310     .644     .507
RT                                   .170    .381    .235     .547     .364
RO                                   .247    .137    .176     .709     .520
.3RI + .7RU                          .218    .757    .339     .471     .379
.5RI + .5RO                          .209    .619    .313     .411     .265
.6RU + .4RO                          .214    .716    .330     .477     .375
.3RI + .5RU + .2RO                   .211    .883    .341     .398     .318
.3RI + .5RU + .1RT + .1RO            .205    .878    .333     .372     .277
.2RI + .1RW + .2RU + .5RO            .216    .790    .339     .438     .344
.2RI + .1RW + .15RU + .1RT + .45RO   .207    .899    .337     .364     .268
RI , RU                              .287    .148    .196     .732     .563
RI , RO                              .331    .051    .089     .744     .566
RU , RO                              .290    .066    .108     .755     .575
RI , RU , RO                         .273    .210    .237     .706     .513
RI , RU , RT , RO                    .292    .102    .151     .744     .560
RI , RW , RU , RO                    .290    .115    .165     .726     .548
RI , RW , RU , RT , RO               .256    .213    .232     .677     .493


and ensemble clustering (the 13th to 19th rows). The results indicate that the
linear combination of relatedness measures tends to generate a clustering that
features a higher Recall compared to single measures, while ensemble clustering
is recommended to use if a higher Precision is preferred.


5   Conclusion
In this paper, we studied the property clustering in Linked Data and evaluated
different property relatedness measures, clustering algorithms and combination
methods. Our experimental results demonstrated the feasibility of the automated
property clustering. In future work, we will improve the quality of property
clustering by leveraging user feedback and active learning.


References
 1. Ailon, N., Charikar, M., Newman, A.: Aggregating Inconsistent Information: Rank-
    ing and Clustering. Journal of the ACM, 55(5):23 (2008)
 2. Lin, D.: An Information-Theoretic Definition of Similarity. In: ICML 1998. pp.
    296–304. Morgan Kaufmann, San Francisco (1998)
 3. Stoilos, G., Stamou, G., Kollias, S.: A String Metric for Ontology Alignment. In:
    Gil, Y., et al. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 624–637. Springer, Heidelberg
    (2005)

</pre>