CROCUS: Cluster-based Ontology Data
                    Cleansing

     Didier Cherix2 , Ricardo Usbeck12 , Andreas Both2 , and Jens Lehmann1
                          1
                           University of Leipzig, Germany
                  {usbeck,lehmann}@informatik.uni-leipzig.de
                     2
                       R & D, Unister GmbH, Leipzig, Germany
                    {andreas.both,didier.cherix}@unister.de


       Abstract. Over the past years, a vast number of datasets have been
       published based on Semantic Web standards, which provides an oppor-
       tunity for creating novel industrial applications. However, industrial re-
       quirements on data quality are high while the time to market as well
       as the required costs for data preparation have to be kept low. Unfortu-
       nately, many Linked Data sources are error-prone which prevents their
       direct use in productive systems. Hence, (semi-)automatic quality as-
       surance processes are needed as manual ontology repair procedures by
       domain experts are expensive and time consuming. In this article, we
       present CROCUS – a pipeline for cluster-based ontology data cleansing.
       Our system provides a semi-automatic approach for instance-level error
       detection in ontologies which is agnostic of the underlying Linked Data
       knowledge base and works at very low costs. CROCUS was evaluated
       on two datasets. The experiments show that we are able to detect errors
       with high recall.


1     Introduction

The Semantic Web movement including the Linked Open Data (LOD) cloud1
represents a combustion point for commercial and free-to-use applications. The
Linked Open Data cloud hosts over 300 publicly available knowledge bases with
an extensive range of topics and DBpedia [1] as central and most important
dataset. While providing a short time-to-market of large and structured datasets,
Linked Data has yet not reached industrial requirements in terms of provenance,
interlinking and especially data quality. In general, LOD knowledge bases com-
prise only few logical constraints or are not well modelled.
    Industrial environments need to provide high quality data in a short amount
of time. A solution might be a significant number of domain experts that are
checking a given dataset and defining constraints, ensuring the demanded data
quality. However, depending on the size of the given dataset the manual evalu-
ation process by domain experts will be time consuming and expensive. Com-
monly, a dataset is integrated in iteration cycles repeatedly which leads to a
1
    http://lod-cloud.net/
generally good data quality. However, new or updated instances might be error-
prone. Hence, the data quality of the dataset might be contaminated after a
re-import.
    From this scenario, we derive the requirements for our data quality evalua-
tion process. (1) Our aim is to find singular faults, i.e., unique instance errors,
conflicting with large business relevant areas of a knowledge base. (2) The data
evaluation process has to be efficient. Due to the size of LOD datasets, reason-
ing is infeasible due to performance constraints, but graph-based statistics and
clustering methods can work efficiently. (3) This process has to be agnostic of
the underlying knowledge base, i.e., it should be independent of the evaluated
dataset.
    Often, mature ontologies, grown over years, edited by a large amount of
processes and people, created by a third party provide the basis for industrial
applications (e.g., DBpedia). Aiming at short time-to-market, industry needs
scalable algorithms to detect errors. Furthermore, the lack of costly domain ex-
perts requires non-experts or even layman to validate the data before influencing
a productive system. Resulting knowledge bases may still contain errors, how-
ever, they offer a fair trade-off in an iterative production cycle.
    In this article, we present CROCUS, a cluster-based ontology data cleansing
framework. CROCUS can be configured to find several types of errors in a semi-
automatic way, which are afterwards validated by non-expert users called quality
raters. By applying CROCUS’ methodology iteratively, resulting ontology data
can be safely used in industrial environments.
    Our contributions are as follows: we present (1) a pipeline for semi-automatic
instance-level error detection that is (2) capable of evaluating large datasets.
Moreover, it is (3) an approach agnostic to the analysed class of the instance as
well as the Linked Data knowledge base. Finally, (4) we provide an evaluation
on a synthetic and a real-world dataset.


2     Related Work

The research field of ontology data cleansing, especially instance data can be
regarded threefold: (1) development of statistical metrics to discover anomalies,
(2) manual, semi-automatic and full-automatic evaluation of data quality and
(3) rule- or logic-based approaches to prevent outliers in application data.
    In 2013, Zaveri et al. [2] evaluate the data quality of DBpedia. This manual
approach introduces a taxonomy of quality dimensions: (i) accuracy, which con-
cerns wrong triples, data type problems and implicit relations between attributes,
(ii) relevance, indicating significance of extracted information, (iii) representa-
tional consistency, measuring numerical stability and (iv) interlinking, which
looks for links to external resources. Moreover, the authors present a manual
error detection tool called TripleCheckMate 2 and a semi-automatic approach
supported by the description logic learner (DL-Learner) [3,4], which generates a
2
    http://github.com/AKSW/TripleCheckMate
schema extension for preventing already identified errors. Those methods mea-
sured an error rate of 11.93% in DBpedia which will be a starting point for our
evaluation.
    A rule-based framework is presented by Furber et al. [5] where the authors
define 9 rules of data quality. Following, the authors define an error by the num-
ber of instances not following a specific rule normalized by the overall number
of relevant instances. Afterwards, the framework is able to generate statistics
on which rules have been applied to the data. Several semi-automatic processes,
e.g., [6,7], have been developed to detect errors in instance data of ontologies.
Bohm et al. [6] profiled LOD knowledge bases, i.e., statistical metadata is gener-
ated to discover outliers. Therefore, the authors clustered the ontology to ensure
partitions contain only semantically correlated data and are able to detect out-
liers. Hogan et al. [7] only identified errors in RDF data without evaluating the
data properties itself.
    In 2013, Kontokostas et al. [8] present an automatic methodology to assess
data quality via a SPARQL-endpoint3 . The authors define 14 basic graph pat-
terns (BGP) to detect diverse error types. Each pattern leads to the construction
of several cases with meta variables bound to specific instances of resources and
literals, e.g., constructing a SPARQL query testing that a person is born before
the person dies. This approach is not able to work iteratively to refine its result
and is thus not usable in circular developement processes.
    A first classification of quality dimensions is presented by Wang et al. [9] with
respect to their importance to the user. This study reveals a classification of data
quality metrics in four categories. Recently, Zaveri et al. [10] presents a system-
atic literature review on different methodologies for data quality assessment.
The authors chose 21 articles, extracted 26 quality dimensions and categorized
them according to [9]. The resulting overview shows which error types exist and
whether they are repairable manually, semi-automatic or fully automatic. The
presented measures were used to classify CROCUS.
    To the best of our knowledge, our tool is the first tool tackling error accuracy
(intrinsic data quality), completeness (contextual data quality) and consistency
(data modelling) at once in a semi-automatic manner reaching high f1-measure
on real-world data.


3     Method

First, we need a standardized extraction of target data to be agnostic of the
underlying knowledge base. SPARQL [11] is a W3C standard to query instance
data from Linked Data knowledge bases. The DESCRIBE query command is a way
to retrieve descriptive data of certain instances. However, this query command
depends on the knowledge base vendor and its configuration. To circumvent
knowledge base dependence, we use Concise Bounded Descriptions (CBD) [12].
Given a resource r and a certain description depth d the CBD works as follows:
3
    http://www.w3.org/TR/rdf-sparql-query/
(1) extract all triples with r as subject and (2) resolve all blank nodes retrieved
so far, i.e., for each blank node add every triple containing a blank node with
the same identifier as a subject to the description. Finally, CBD repeats these
steps d times. CBD configured with d = 1 retrieves only triples with r as subject
although triples with r as object could contain useful information. Therefore, a
rule is added to CBD, i.e., (3) extract all triples with r as object, which is called
Symmetric Concise Bounded Description (SCDB) [12].
    Second, CROCUS needs to calculate a numeric representation of an instance
to facilitate further clustering steps. Metrics are split into three categories:
    (1) The simplest metric counts each property (count). For example, this
metric can be used if a person is expected to have only one telephone number.
    (2) For each instance, the range of the resource at a certain property is
counted (range count). In general, an undergraduate student should take un-
dergraduate courses. If there is an undergraduate student taking courses with
another type (e.g., graduate courses), this metric is able to detect it.
    (3) The most general metric transforms each instance into a numeric vector
and normalizes it (numeric). Since instances created by the SCDB consist of
properties with multiple ranges, CROCUS defines the following metrics: (a) nu-
meric properties are taken as is, (b) properties based on strings are converted to
a metric by using string length although more sophisticated measures could be
used (e.g., n-gram similarities) and (c) object properties are discarded for this
metric.
    As a third step, we apply the density-based spatial clustering of applications
with noise (DBSCAN) algorithm [13] since it is an efficient algorithm and the
order of instances has no influence on the clustering result. DBSCAN clusters
instances based on the size of a cluster and the distance between those instances.
Thus, DBSCAN has two parameters: , the distance between two instances, here
calculated by the metrics above and M inP ts, the minimum number of instances
needed to form a cluster. If a cluster has less than M inP ts instances, they are
regarded as outliers. We report the quality of CROCUS for different values of
M inP ts in Section 4.
    Finally, identified outliers are extracted and given to human quality judges.
Based on the revised set of outliers, the algorithm can be adjusted and con-
straints can be added to the Linked Data knowledge base to prevent repeating
discovered errors.


4     Evaluation
LUBM benchmark. First, we used the LUBM benchmark [14] to create a
perfectly modelled dataset. This benchmark allows to generate arbitrary know-
ledge bases themed as university ontology. Our dataset consists of exactly one
university and can be downloaded from our project homepage4 .
    The LUBM benchmark generates random but error free data. Thus, we add
different errors and error types manually for evaluation purposes:
4
    https://github.com/AKSW/CROCUS
                                     Fig. 1: Overview of CROCUS.


     – completeness of properties (count) has been tested with CROCUS by adding
       a second phone number to 20 of 1874 graduate students in the dataset. The
       edited instances are denoted as Icount .
     – semantic correctness of properties (range count) has been evaluated by add-
       ing for non-graduate students (Course) to 20 graduate students (Irangecount ).
     – numeric correctness of properties (numeric) was injected by defining that
       a graduate student has to be younger than a certain age. To test this, 20
       graduate students (Inumeric ) age was replaced with a value bigger than the
       arbitrary maximum age of any other graduate.
For each set of instances holds: |Icount | = |Irangecount | = |Inumeric | = 20 and
additionally |Icount ∩ Irangecount ∩ Inumeric | = 3. The second equation overcomes
a biased evaluation and introduces some realistic noise into the dataset. One of
those 3 instances is shown in the listing below:
 1    @prefix   r df : <h t t p : / /www. w3 . o r g /1999/02/22 − rdf−s y n t a x−ns#> .
 2    @prefix   r d f s : <h t t p : / /www. w3 . o r g / 2 0 0 0 / 0 1 / rdf−schema#> .
 3    @prefix   ns2 : <h t t p : / / example . o r g/#> .
 4    @prefix   ns3 : <h t t p : / /www. Department6 . U n i v e r s i t y 0 . edu/> .
 5
 6 ns3 : G r a d u a t e S t u d e n t 7 5 a ns2 : G r a d u a t e S t u d e n t ;
 7   ns2 : name ” G r a d u a t e S t u d e n t 7 5 ” ;
 8   ns2 : u n d e r g r a d u a t e D e g r e e F r o m <h t t p : / /www. U n i v e r s i t y 4 6 7 . edu> ;
 9   ns2 : e m a i l A d d r e s s ” GraduateStudent75@Department6 . U n i v e r s i t y 0 . edu ” ;
10   ns2 : t e l e p h o n e ”yyyy-yyyy-yyyy” , ” xxx−xxx−xxxx ” ;
11   ns2 : memberOf <h t t p : / /www. Department6 . U n i v e r s i t y 0 . edu> ;
12   ns2 : a g e ”63” ;
13   ns2 : t a k e s C o u r s e ns3 : Gr aduate Cours e21 , ns3:Course39 , ns3 :
            Gra duateC ourse 26 ;
14   ns2 : a d v i s o r ns3 : A s s o c i a t e P r o f e s s o r 8 .

 Listing 1.1: Example of an instance with manually added errors (in red ).
    DBpedia - German universities benchmark. Second, we used a subset
of the English DBpedia 3.8 to extract all German universities. The following
SPARQL query (Listing 1.2) presents already the difficulty to find a complete
list of universities using DBpedia.
 1 SELECT DISTINCT ? i n s t a n c e
 2     WHERE {
 3         {    ? i n s t a n c e a dbo : U n i v e r s i t y .
 4              ? i n s t a n c e dbo : c o u n t r y dbpedia : Germany .
 5              ? i n s t a n c e f o a f : homepage ?h .
 6         } UNION {
 7              ? i n s t a n c e a dbo : U n i v e r s i t y .
 8              ? i n s t a n c e dbp : : c o u n t r y dbpedia : Germany .
 9              ? i n s t a n c e f o a f : homepage ?h .
10         } UNION {
11              ? i n s t a n c e a dbo : U n i v e r s i t y .
12              ? i n s t a n c e dbp : : c o u n t r y ”Germany”@en .
13              ? i n s t a n c e f o a f : homepage ?h .
14         }}

        Listing 1.2: SPARQL query to extract all German universities.

     After applying CROCUS to the 208 universities and validating detected in-
stances manually, we found 39 incorrect instances. This list of incorrect instances,
i.e., CBD of URIs, as well as the overall dataset can be found on our project
homepage. For our evaluation, we used only properties existing in at least 50%
of the instances to reduce the exponential parameter space. Apart from an in-
creased performance of CROCUS we did not find any effective drawbacks on our
results.
     Results. To evaluate the performance of CROCUS, we used each error type
individually on the adjusted LUBM benchmark datasets as well as a combination
of all error types on LUBM5 and the real-world DBpedia subset.


                                              LUBM
                                  count        range count            numeric
                 M inP ts F1 P      R F1 P         R F1 P         R
                        2 — — — — — — — — —
                        4 — — — 0.49 1.00 0.33 — — —
                        8 — — — 0.67 1.00 0.5 — — —
                      10 0.52 1.00 0.35 1.00 1.00 1.00 — — —
                      20 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
                      30 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
                      50 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
                     100 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
           Table 1: Results of the LUBM benchmark for all three error types.


    Table 1 shows the f1-measure (F1), precision (P) and recall (R) for each
error type. For some values of M inP ts it is infeasible to calculate cluster since
DBSCAN generates only clusters but is unable to detect outlier. CROCUS is
able to detect the outliers with a 1.00 f1-measure as soon as the correct size of
M inP ts is found.
 5
     The datasets can also be found on our project homepage.
   Table 2 presents the results for the combined error types as well as for the
German universities DBpedia subset. Combining different error types yielding a
more realistic scenario influences the recall which results in a lower f1-measure
than on each individual error type. Finding the optimal M inP ts can efficiently
be done by iterating between [2, . . . , |I|]. However, CROCUS achieves a high
recall on the real-world data from DBpedia. Reaching a f1-measure of 0.84 for
LUBM and 0.91 for DBpedia highlights CROCUS detection abilities.


          LUBM            DBpedia
                                         Property        Errors
M inP ts F1 P R          F1 P R
                                         dbp:staff,      Values are typed as
       2 0.12 1.00 0.09 0.04 0.25 0.02
                                         dbp:estab-      xsd:string although
       4 0.58 1.00 0.41 0.04 0.25 0.02
                                         lished,         they contain numeric
       8 0.84 1.00 0.72 0.04 0.25 0.02
                                         dbp:internat-   types like integer or
      10 0.84 1.00 0.72 0.01 0.25 0.01
                                         ionalStudents   double.
      20 0.84 1.00 0.72 0.17 0.44 0.10
      30 0.84 1.00 0.72 0.91 0.86 0.97                   dbp:country
                                         dbo:country,
      50 0.84 1.00 0.72 0.85 0.80 0.97                   ”Germany”@en collides
                                         dbp:country
     100 0.84 1.00 0.72 0.82 0.72 0.97                   with dbo:Germany

Table 2: Evaluation of CROCUS against Table 3: Different error types discovered
a synthetic and a real-world dataset us- by quality raters using the German uni-
ing all metrics combined.                versities DBpedia subset.


    In general, CROCUS generated many candidates which were then manually
validated by human quality raters, who discovered a variety of errors. Table 3
lists the identified reasons of errors from the German universities DBpedia sub-
set detected as outlier. As mentioned before, some universities do not have a
property dbo:country. However, we found a new type of error. Some literals
are of type xsd:string although they represent a numeric value. Lists of wrong
instances can also be found on our project homepage.
    Overall, CROCUS has been shown to be able to detect outliers in synthetic
and real-world data and is able to work with different knowledge bases.


5   Conclusion
We presented CROCUS, a novel architecture for cluster-based, iterative ontology
data cleansing, agnostic of the underlying knowledge base. With this approach
we aim at the iterative integration of data into a productive environment which
is a typical task of industrial software life cycles.
    The experiments showed the applicability of our approach on a synthetic and,
more importantly, a real-world Linked Data set. Finally, CROCUS has already
been successfully used on a travel domain-specific productive environment com-
prising more than 630.000 instances (the dataset cannot be published due to its
license).
    In the future, we aim at a more extensive evaluation on domain specific
knowledge bases. Furthermore, CROCUS will be extended towards a pipeline
comprising a change management, an open API and semantic versioning of the
underlying data. Additionally, a guided constraint derivation for laymen will be
added.
    Acknowledgments This work has been partly
supported by the ESF and the Free State of Saxony
and by grants from the European Union’s 7th Framework Programme provided
for the project GeoKnow (GA no. 318159). Sincere thanks to Christiane Lemke.

References
 1. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N.,
    Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia - a large-
    scale, multilingual knowledge base extracted from wikipedia. SWJ (2014)
 2. Zaveri, A., Kontokostas, D., Sherif, M.A., Bühmann, L., Morsey, M., Auer, S.,
    Lehmann, J.: User-driven quality evaluation of dbpedia. In Sabou, M., Blomqvist,
    E., Noia, T.D., Sack, H., Pellegrini, T., eds.: I-SEMANTICS, ACM (2013) 97–104
 3. Lehmann, J.: DL-learner: Learning concepts in description logics. Journal of
    Machine Learning Research 10 (2009) 2639–2642
 4. Buhmann, L., Lehmann, J.: Pattern based knowledge base enrichment. In: 12th
    ISWC, 21-25 October 2013, Sydney, Australia. (2013)
 5. Fürber, C., Hepp, M.: Swiqa - a semantic web information quality assessment
    framework. In Tuunainen, V.K., Rossi, M., Nandhakumar, J., eds.: ECIS. (2011)
 6. Böhm, C., Naumann, F., Abedjan, Z., Fenz, D., Grutze, T., Hefenbrock, D., Pohl,
    M., Sonnabend, D.: Profiling linked open data with ProLOD. Data Engineering
    Workshops ICDEW 2010 IEEE 26th International Conference on (2010) 175–178
 7. Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic
    web. In Bizer, C., Heath, T., Berners-Lee, T., Hausenblas, M., eds.: LDOW. Volume
    628 of CEUR Workshop Proceedings., CEUR-WS.org (2010)
 8. Kontokostas, D., Westphal, P., Auer, S., Hellmann, S., Lehmann, J., Cornelissen,
    R., Zaveri, A.J.: Test-driven evaluation of linked data quality. In: Proceedings of
    the 23rd international conference on World Wide Web. (2014) to appear.
 9. Wang, R.Y., Strong, D.M.: Beyond accuracy. what data quality means to data
    consumers. Journal of Management Information Systems (4) 5–33
10. Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S., Hitzler,
    P.: Quality assessment methodologies for linked open data. Submitted to SWJ
    (2013)
11. Quilitz, B., Leser, U.: Querying distributed rdf data sources with sparql. In Bech-
    hofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M., eds.: The Semantic Web:
    Research and Applications. Volume 5021 of Lecture Computer Science. Springer
    Berlin Heidelberg (2008) 524–538
12. Stickler, P.: Cbd-concise bounded description. W3C Member Submission 3 (2005)
13. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discov-
    ering clusters in large spatial databases with noise. In: KDD. Volume 96. (1996)
    226–231
14. Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base
    systems. Web Semantics: Science, Services and Agents on the World Wide Web
    3(2–3) (2005) 158 – 182