=Paper= {{Paper |id=Vol-1409/paper-01 |storemode=property |title=Fixing the Domain and Range of Properties in Linked Data by Context Disambiguation |pdfUrl=https://ceur-ws.org/Vol-1409/paper-01.pdf |volume=Vol-1409 |dblpUrl=https://dblp.org/rec/conf/www/TononCDC15 }} ==Fixing the Domain and Range of Properties in Linked Data by Context Disambiguation== https://ceur-ws.org/Vol-1409/paper-01.pdf
Fixing the Domain and Range of Properties in Linked Data
              by Context Disambiguation

                                    Alberto Tonon                     Michele Catasta
                                   eXascale Infolab                          EPFL
                                 University of Fribourg                    Lausanne
                                     Switzerland                          Switzerland
                              alberto@exascale.info              michele.catasta@epfl.ch
                               Gianluca Demartini                 Philippe Cudré-Mauroux
                                  Information School                    eXascale Infolab
                                 University of Sheffield              University of Fribourg
                                    United Kingdom                        Switzerland
                           g.demartini@sheffield.ac.uk                phil@exascale.info

ABSTRACT                                                         data does not always adhere to its corresponding schema, as
The amount of Linked Open Data available on the Web is           we discuss in more detail in Section 3. That is, factual state-
rapidly growing. The quality of the provided data, however,      ments (i.e., RDF triples) do not always follow the definitions
is generally-speaking not fundamentally improving, hamper-       given in the related RDF Schemas or ontologies. Having a
ing its wide-scale deployment for many real-world applica-       schema which the published data adheres to allows for bet-
tions. A key data quality aspect for Linked Open Data can        ter parsing, automated processing, reasoning, or anomaly
be expressed in terms of its adherence to an underlying well-    detection over the data. Also, it serves as a de facto doc-
defined schema or ontology, which serves both as a docu-         umentation for the end-users querying the LOD datasets,
mentation for the end-users as well as a fixed reference for     fostering an easier deployment of Linked Data in practice.
automated processing over the data. In this paper, we first         To mitigate the issues related to the non-conformity of the
report on an analysis of the schema adherence of domains         data, statistical methods for inducing the schema over the
and ranges for Linked Open Data. We then propose new             data have been proposed. Voelker et al. [9], for example,
techniques to improve the correctness of domains and ranges      extract OWL-EL axioms from the data and use statistics to
by i) identifying the cases in which a property is used in the   compute confidence values on the axioms. Similar statistics
data with several different semantics, and ii) resolving them    were also used in order to detect inconsistencies in the data
by updating the underlying schema and/or by modifying the        [8].
data without compromising its retro-compatibility. We ex-           In this work, we focus on one particular issue of LOD
perimentally show the validity of our methods through an         schema adherence: the proper definition of the properties’
empirical evaluation over DBpedia by creating expert judge-      domains and ranges in LOD. More precisely, we propose (see
ments of the proposed fixes over a sample of the data.           Section 4) a new data-driven technique that amends both
                                                                 the schema and the instance data in order to assign bet-
                                                                 ter domains and ranges to properties; this goal is achieved
Categories and Subject Descriptors                               by detecting the cases through which a property is used for
H.4.m [Information Systems]: Miscellaneous                       different purposes (i.e., with different semantics) and by dis-
                                                                 ambiguating its different uses by dynamically creating new
General Terms                                                    sub-properties extending the original property. Thus, our
                                                                 approach modifies both the schema (new sub-properties are
Experimentation, Algorithms                                      created) and the data (occurrences of the original property
                                                                 that were used with some given semantics are replaced with
Keywords                                                         the newly created sub-property). One of the interesting
Linked Open Data, Schema adherence, Data quality                 properties of our approach is that the modified data is retro-
                                                                 compatible, that is, a query made over the original version
                                                                 of the data can be posed as is over the amended version.
1.     INTRODUCTION                                                 We evaluate our methods in Section 5 by first comparing
   Linked Open Data (LOD) is rapidly growing in terms of         how much data it can fix by adjusting different parameters,
the number of available datasets moving from 295 available       and then by asking Semantic Web experts to judge the qual-
datasets in 2011 to 1’014 datasets in 20141 . As we report in    ity of the modifications suggested by our approach.
Section 2, LOD quality has already been analyzed from dif-
ferent angles; one key LOD quality issue is the fact that the
1
    http://lod-cloud.net
                                                                 2.   RELATED WORK
                                                                   One of the most comprehensive piece of work describing
Copyright is held by the author/owner(s).                        LOD is the article by Schmachtenberg et al. [7] in which the
WWW2015 Workshop: Linked Data on the Web (LDOW2015).             adoption of best practices for various aspects, from creation
to publication, of the 2014 LOD are analyzed. Such prac-             is also suggested by the fact that the top-3 most frequent
tices, ultimately, are meant to preserve the quality of a large      properties defined in the DBpedia ontology, namely
body of data as LOD—a task that is even more daunting,               dpo:birthPlace, dpo:birthYear, and birthDate, have
considering the inherently distributed nature of LOD.                WDR and WRR smaller than 0.01, while the top-3 most
   Data quality is a thoroughly-studied area in the context of       used property in Freebase, namely fb:type.object.type,
companies [6], because of its importance in economic terms.          fb:type.type.instance, and fb:type.object.key, have
Recently, LOD did also undergo a similar scrutiny: in [4],           an average WDR of 0.30 and an average WRR of 0.87.
the authors show that the Web of Data is by no means a               This disparity can in part be explained by the fact that
perfect world of consistent and valid facts. Linked Data has         the Freebase ontology is a forest of trees rather than a
multiple dimensions of shortcomings ranging from simple              tree with a single root note (as in DBpedia). Thus, while
syntactical errors over logical inconsistencies to complex se-       one could expect that each entity in the dataset should
mantic errors and wrong facts. For instance, Töpper et al. [8]      descend from ‘object’, this is not the case when looking
statistically infer the domain and range of properties in or-        at the data. In addition, we noticed that in DBpedia, out
der to detect inconsistencies in DBpedia. Similarly, Bizer           of the 1’368 properties actually used in the data, 1’109
et al. in [5] propose a data-driven approach that exploits           have a domain declaration in the ontology and 1’181 have
statistical distributions of properties and types for enhanc-        a range declaration. Conversely, Freebase specifies domain
ing the quality of incomplete and noisy Linked Data sets,            and range of 65’019 properties but only 18’841 properties
specifically for adding missing type statements, and identi-         are used in the data.
fying faulty statements. Differently from us, they leverage             In this paper we argue that a number of occurrences of
the number of instances of a certain type appearing in the           wrong domains or ranges are due to the fact that the same
property’s subject and object position in order to infer the         property is used in different contexts, thus with different se-
type of an entity, while we use data as evidence to detect           mantics. The property dpo:gender, for example, whose do-
properties used with different semantics.                            main is not specified in the DBpedia ontology, is used both
   There is also a vast literature ( [9, 3, 2, 1]) that introduces   to indicate the gender of a given person and the gender of a
statistical schema induction and enrichment (based on asso-          school (that is, if it accepts only boys, girls or both). Hence,
ciation rule mining, logic programming, etc.) as a means to          dpo:gender appears both in the context of dpo:GivenName
generate ontologies from RDF data. Such methods can for              and of dpo:School. While this can make sense in spoken
example extract OWL axioms and then use probabilities to             language, we believe that the two cases should be distinct
come up with confidence scores, thus building what can be            in a knowledge base. However, we cannot make a general
considered a “probabilistic ontology” that can emerge from           rule out of this sole example as, for instance, we have that
the messiness and dynamicity of Linked Data. In this work,           foaf:name (whose domain is not defined in the DBpedia on-
we focus on the analysis of property usage with the goal of          tology) is attached to 25 direct subtypes of owl:Thing out of
fixing Linked Data and improve its quality.                          33; these types include dpo:Agent (the parent of dpo:Person
                                                                     and dpo:Organization), dpo:Event, and dpo:Place. In
                                                                     this case, it does not make sense to claim that all these
3.   MOTIVATION AND BASIC IDEAS                                      occurrences represent different contexts in which the prop-
   The motivation that led us to the research we are                 erty appears, since the right domain for this case is indeed
presenting is summarized in Table 1. Its upper part reports          owl:Thing, as specified by the FOAF Vocabulary Specifica-
the top-5 properties in DBpedia2 and Freebase3 The table             tion.5 Moreover, in this case creating a new property for
reports on the number of times the properties appear with            each subtype would lead to an overcomplicated schema. Fi-
a wrong domain, together with their Wrong Domain Rate                nally, the fact that dpo:name is not attached to all the sub-
(WDR), that is, the ratio between the number of times the            types of owl:Thing suggests that the property is optional.
property is used with a wrong domain to its total number                What follows describes the intuition given by this example
of uses. Analogously, the lower part of the table reports            in terms of statistics computed on the knowledge base. In
on the top-5 properties by number of range violations and            addition, we also present algorithms to identify the use of
their Wrong Range Rate (WRR).4 We observe that the                   properties in different contexts.
absolute number of occurrences of wrong domains/ranges
in Freebase is two orders of magnitude greater than that of
DBpedia. This cannot be explained only by the different              4.     DETECTING   AND   CORRECTING
number of entities contained in the two knowledge bases                     MULTI-CONTEXT PROPERTIES
since the number of topics covered by Freebase is only                 In this section, we describe in detail the algorithm we
one order of magnitude greater than that of DBpedia                  propose, namely, LeRiXt (LEft and RIght conteXT). For
(approximately 47.43 and 4.58 million topics, respectively,          the sake of presentation, we first describe a simpler version
according to their Web-pages). We deduce that in Freebase            of the method we call LeXt (LEft conteXT) that uses the
the data adheres to the schema less than in DBpedia. This            types of the entities appearing as subjects of the property
2
  We used the English version of DBpedia 2014 (http://               in order to identify properties that are used in different con-
dbpedia.org/Downloads2014).                                          texts (multi-context properties). We then present the full
3
   We used a dump downloaded on March 30th 2014 (http:               algorithm as an extension of this simpler version. For the
//freebase.com).                                                     description of the algorithm, we make use of the notation
4
  When computing WDR and WRR we do take into account                 defined in Table 2.
the type hierarchy for computing the violation rate. That
is, if a property has ‘Actor’ as range and is used in a RDF          4.1     Statistical Tools
triple where the object is an ‘American Actor’ we consider
                                                                     5
it as correct as ‘American Actor’ is a subtype of ‘Actor’.               http://xmlns.com/foaf/spec/.
Table 1:     Top-5 properties by absolute number of domain violations (top), and range vi-
olations (bottom),     with their domain/range violation rate (the truncated properties are
fb:dataworld.gardening_hint.last_referenced_by and fb:common.topic.topic_equivalent_webpage).
     DBpedia property                     #Wrong Dom.           WDR      Freebase Property                     #Wrong Dom.        WDR
     dpo:years                                    641’528       1.00     fb:type.object.type                        99’119’559      0.61
     dpo:currentMember                            260’412       1.00     fb:type.object.name                        41’708’548      1.00
     dpo:class                                    255’280       0.95     fb:type.object.key                         35’276’872      0.29
     dpo:managerClub                               47’324       1.00     fb:type.object.permission                   7’816’632      1.00
     dpo:address                                   36’449       0.90     fb:[. . . ].last referenced by              3’371’713      1.00
     DBpedia property                     #Wrong Rng.           WRR      Freebase Property                      #Wrong Rng.       WRR
     dpo:starring                                 298’713       0.95     fb:type.type.instance                      96’764’915      0.61
     dpo:associatedMusicalArtist                   70’307       0.64     fb:[. . . ]topic equivalent webpage        53’338’833      1.00
     dpo:instrument                                60’385       1.00     fb:type.permission.controls                 7’816’632      1.00
     dpo:city                                      55’697       0.55     fb:common.document.source uri               4’578’671      1.00
     dpo:hometown                                  47’165       0.52     fb:[. . . ].last referenced by              3’342’789      0.99


                                                                             and we analyze Pr(p | tL   i ) for all ti ∈ Ch(owl:Thing) we see
   Table 2: Notation used for describing LeRiXt.                             that the probability is greater than 0 in 25 cases out of 33
   Symbol     Meaning                                                        and is greater than 0.50 in 18 cases, suggesting that all the
     KB       the knowledge base composed of triples (s, p, o)               ti s do not constitute uses of the properties in other contexts
              s.t. s ∈ E ∪ T , p ∈ P , o ∈ E ∪ L ∪ T with E set              but rather that the properties are used in the more general
              of all entities, P set of all properties, T set of all         context identified by owl:Thing.
              entity types, and L set of all literals.                           Computationally, we only need to maintain one value for
     >        the root of the type hierarchy.
    e, t      an entity and an entity type, respectively.                    each property p and for each type t, that is the number
    eat       (e, a, t) ∈ KB, that is, e is an instance of t                 #(p ∧ tL ) of triples having as subject an instance of t and
      p       a property.                                                    p as predicate. In fact, if we assume that whenever there
     tL       an entity type t on the left side of a property.               is a triple stating that (e, a , t) ∈ KB there is also a triple
    tR        an entity type t on the right side of a property.              (e, a , t0 ) ∈ KB for each ancestor t0 of t in the type hierarchy,
   Par(t)     the parent of a type t in the type hierarchy.                  we have that
   Ch(t)      the set of children of a type t in the type hierar-
                                                                                 ∀p ∈ P.| (s, p0 , o) ∈ KB | p = p0 | = #(p ∧ >L ),
                                                                                            
              chy.
   Cov(p0 )   the coverage of a sub-property p0 of a property                                                                X
                                                                                   ∀p ∈ P.| (s, p0 , o) ∈ KB | s ∈ t | =         #(p0 ∧ tL ).
                                                                                             
              p, that is, the rate of occurrences of p covered by
              p0 .                                                                                                        p0 ∈P

                                                                             The computation of all the #(p ∧ tL ) can be done with one
                                                                             map/reduce job similar to the well-known word-count ex-
   LeXt makes use of two main statistics: Pr(tL | p), that                   ample often used to show how the paradigm works, thus, it
is, the conditional probability of finding an entity of type t               can be efficiently computed in a distributed environment al-
as the subject of a triple having p as predicate (i.e., finding              lowing the algorithms we propose to scale to large amounts
t “to the Left” of p), and the probability Pr(p | tL ), that is,             of data. Another interesting property implied by the type
the probability of seeing a property p given a triple whose                  subsumptions of the underlying type hierarchy is that if
subject is an instance of t. Equation 1 formally defines those               t1 ∈ Ch(t0 ) then Pr(tL            L
                                                                                                   1 | p) ≤ Pr(t0 | p). Assuming the same

two probabilities.                                                           premises, however, nothing can be said about Pr(p | tL0 ) and
                                                                             Pr(p | tL
                                                                                     1 ).
                     | { (s, p0 , o) ∈ KB | s a t, p = p0 } |
       Pr(tL | p) =                                                          4.2     LeXt
                         | { (s, p0 , o) ∈ KB | p = p0 } |
                                                                   (1)         As previously anticipated, LeXt detects multi-context
                     | { (s, p0 , o) ∈ KB | s a t, p = p0 } |
       Pr(p | tL ) =                                                         properties by exploiting the types of the entities found on
                          | { (s, p0 , o) ∈ KB | s ∈ t } |
                                                                             the left-hand side of the property taken into consideration.
As one can imagine, Pr(tL | p) = 1 indicates that t is a suit-               Specifically, given a property p, the algorithm makes a
able domain for p, however, t can be very generic. In partic-                depth-first search of the type hierarchy starting from
ular Pr(>L | p) = 1 for every property p where > is the root                 the root to find all cases for which there is enough
of the type hierarchy. Conversely, Pr(p | tL ) measures how                  evidence that the property is used with a different context.
common a property is among the instances of a certain type.                  Practically, at each step, a type t—the current root of the
Pr(p | tL ) = 1 suggests that the property is mandatory for                  tree—is analyzed and all the ti ∈ Ch(t) having Pr(tL  i | p)
t’s instances. In addition, whenever we have strong indica-                  greater than a certain threshold λ are considered. If
tors that a property is mandatory for many children ti of a                  there is no such child, or if we are in a case similar to
given type t, that is, Pr(p | tL
                               i ) is close to 1 for all ti s, we can        that of the foaf:name example described previously, a
deduce that t is a reasonable domain for p and that all the ti               new sub-property t p of p is created with t as domain;
are using p as an inherited (possibly optional) property. For                otherwise the method is recursively called on each ti .
example, if in DBpedia we consider the property foaf:name                    Finally, cases analogous to the foaf:name example are
                                          Thing                                       Algorithm 1 LeXt
                                                1.00
                                                                                      Require: 0 ≤ λ ≤ 1 strictness threshold, η ≥ 0 entropy thresh-
H = 0.09        SportSeason               Agent            ...                            old.
                                   0.55          0.24
                                                                                      Require: curr root ∈ T the current root, p ∈ P .
SportsTeamSeason           ...        ...       Organisation                          Require: acc a list containing all the meanings found so far.
                   0.55                                            0.44               Ensure: acc updated     with all
                                                                                                                      the meanings of p.
                                                                                       1: p given t ← Pr p | tD
                                                                                                        
                                                                                                                  c ) | tc ∈ Ch(curr root)
SoccerClubSeason          ...             SportsTeam             ...                   2: H = Entropy(p given t)
                  0.55                                  0.44
                                                                                       3: candidates ← tc | tc ∈ Ch(curr root) ∧ Pr(tD    c | p) ≥ λ
                                                                                       4: if H ≥ η ∨ candidates = ∅ then
H = 1.96       Soccer            Baseball
                                        "
                                                  ...    Cricket
                                                              "
                                                                             Rugby"    5:     if Pr(curr root | p) = 1 then
                         0.42               1                          k 1        k
                                                                                       6:         acc ← (p, curr root, 1) : acc
      Figure 1: Execution of LeXt on dpo:manager.                                      7:     else
                                                                                       8:         p0 ← new property(p, curr root)
                                                                                       9:         KB ← KB ∪ { (p0 , rdfs:subPropertyOf,     p) }
detected by using the entropy of the probabilities Pr(p | tD i )                      10:         acc ← p0 , curr root, Pr(curr root 0 | p) : acc
with ti ∈ Ch(t) that captures the intuition presented                                 11:      end if
while introducing   the above mentioned statistics. Since,                            12: else
            P                  L
in general,    ti ∈Cht Pr(p | ti ) 6= 1,P we normalize each                           13:      for c ∈ candidates do
probability by dividing it by Z =                                                     14:         LeXt(λ, η, c, acc)
                                           ti ∈Ch(t) ti and we
compute the entropy H using Equation 2.                                               15:      end for
                                                                                    16: end if
                      X Pr(p | ti )           Pr(p | ti )
  H p | Ch(t) = −                     · log2                 (2)
                                Z                  Z
                     ti ∈Ch(t)                                                        Section 5 we show how the algorithm behaves with varying
   Algorithm 1 formally describes the full process. In the                            levels of strictness.
pseudo-code, a context of the input property is encoded with                             The presented algorithm has a number of limitations. In
a triple (p0 , dom(p0 ), coverage) where p0 is a property identi-                     particular, it does not explicitly cover the cases for which
fying the context, dom(p0 ) is its domain, and coverage ≥ λ                           one type has more than one parent, thus multi-inheriting
is the rate of the occurrences of p covered by the context,                           from several other types. In that case, an entity type can
denoted by Cov(p0 ). If the coverage is one, p is used in just                        be processed several times (at most once per parent). We
one context (see Line 5). In Line 8, a new property p0 is                             leave to future work studying if simply making sure that
created and its domain is set to curr root, while in Line 9,                          each node is processed once is enough to cover that case.
p0 is declared to be a sub-property of p: this makes the data
retro-compatible under the assumption that the clients can
                                                                                      4.4    ReXt and LeRiXt
resolve sub-properties. Ideally, after the execution of the                              It is straightforward to define a variant of LeXt that con-
algorithm, all the triples referring to the identified mean-                          siders property ranges instead of property domains by using
ings should be updated. The algorithm can also be used to                             Pr(tR | p) and Pr(p | tR ). We call this method ReXt. In our
obtain hints on how to improve the knowledge base.                                    implementation we only consider object properties, that is,
   The execution steps of the algorithm on dpo:manager                                properties that connect an entity to another entity (rather
(m, for short) with λ = 0.4 and η = 1 are depicted in                                 than, for example, to a literal since these values are not
Figure 1. The entity types are organized according to the                             entities and thus are not in the type hierarchy).
DBpedia type hierarchy and each type t is subscripted by                                 Generalizing LeXt to identify multi-context properties
Pr(t | m). As can be observed, during the first step the                              based on both domains and ranges is a more complicated
children of owl:Thing are analyzed: the entropy constraint                            task. The solution we propose is called LeRiXt and con-
is satisfied and two nodes satisfy the Pr(t | m) constraint.                          sists in using two copies of the type hierarchy, one for the
The exploration of the dpo:sportsSeason branch ends                                   domains, and one for the ranges. At each step there is a
when dpo:SoccerClubSeason is reached.                The triple                       “current domain” td and a “current range” tr whose children
(SoccerClubSeason manager,             dpo:SoccerClubSeason,                          are analyzed (thus the algorithm takes one more parameter
0.55) is returned. The new property is a sub-property of                              than LeXt). Instead of using the condition Pr(tD | p) ≥ λ
dpo:manager that covers 55% of the occurrences. Finally,                              to select the candidate types to explore, we use Pr(tD     R
                                                                                                                                            i ∧ tj |

the algorithm goes down the other branch until the entropy                            p) ≥ λ for each ti ∈ Ch(td ), tj ∈ Ch(tr ), and we recursively
constraint is violated and returns the context (SportsTeam                            call LeRiXt for each pair of types satisfying the constraint
manager, dpo:SportsTeam, 0.45).                                                       (see Line 14 of Algorithm 1).

4.3     Discussion                                                                    5.    EXPERIMENTS
   The threshold λ sets a condition on the minimum degree                                We empirically evaluate the three methods described in
of evidence we need to state that we have identified a new                            Section 4, namely, LeXt, ReXt, and LeRiXt, first by
meaning for p, expressed in term of the Pr(t | p) probability.                        studying how they behave when varying the threshold λ,
This threshold is of key importance in practice. On the one                           and then by measuring the precision of the modifications
hand, low thresholds require little evidence and thus foster                          they suggest. The LOD dataset we selected for our evalua-
the creation of new properties, possibly over-populating the                          tion is DBpedia 2014 since its entity types are organized in a
schema. On the other hand, high thresholds almost never                               well-specified tree, contrary to Freebase, whose type system
accept a new meaning of a property, thus inferring coarser                            is a forest. As we anticipated in Section 4.4, we consider
domains. In particular, with λ = 1 the exact domain of p                              only object properties when the range is used to identify
is inferred (which in several cases can result to be >). In                           multi-context properties by using ReXt and LeRiXt. The
                     1.0        LeXt                    ReXt                   LeRiXt             of the algorithm as future work.
      Avg. Coverage                                                                                  In practice, we envision our algorithms to be used as a
                     0.8                                                                          decision-support tool for LOD curators rather than a fully
                     0.6                                                                          automatic system to fix LOD datasets.
                     0.4
                     0.2
                     0.0
                                                                                                  6.   CONCLUSIONS
                   1000                                                                              In this paper, we tackled the problem of extracting and
# New Properties




                    800                                                                           then amending domain and range information from LOD.
                                                                                                  The main idea behind our work stems from the observa-
                    600
                                                                                                  tion that many properties are misused at the instance level
                    400
                                                                                                  or used in several, distinct contexts. The three algorithms
                    200
                                                                                                  we proposed, namely, LeXt, ReXt, and LeRiXt, exploit
                       0                                                                          statistics about the types of the entities appearing as subject
                        0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
                                                  Threshold Value                                 and object in the triples involving the property analyzed in
                                                                                                  order to identify the various cases in which a multi-context
Figure 2: Average coverage and number of new                                                      property is used. Once a particular context is identified,
properties with varying values of the threshold λ.                                                a new sub-property is derived such that occurrences of the
                                                                                                  original property can be substituted using the newly gener-
numbers of properties we take into consideration when run-                                        ated sub-property. Our methods can also be used to provide
ning LeXt and the other two algorithms are 1’368 and 643,                                         insight into the knowledge base analyzed and how it should
respectively. Finally, during our experimentation we fix the                                      be revised in subsequent iterations. We evaluated our meth-
η threshold to 1. This value was chosen based on the anal-                                        ods by studying their behavior with different parameter set-
ysis of the entropy stopping criterion on a small subset of                                       tings and by asking Semantic Web experts to evaluate the
properties.                                                                                       generated sub-properties.
   The impact of λ on the output of the algorithms is studied                                        The algorithms we propose require the entities contained
in terms of average property coverage and number of gen-                                          in the dataset to be typed with types organized in a tree-
erated sub-poperties. Recall that in Section 4.2 we defined                                       structured type hierarchy. As future work, we plan to run a
the coverage of a sub-property. Here we measure the prop-                                         deeper evaluation of our techniques, and to design a method
erty coverage, defined as the overall rate of occurrences of a                                    that overcomes the limitation presented above by consider-
certain property p that is covered by its sub-properties, that                                    ing the case in which the entity types are organized in a
is, the sum of Cov(p0 ) for all p0 generated sub-property of p.                                   Direct Acyclic Graph, thus supporting multiple inheritance.
   In the upper part of Figure 2 the average over the prop-
erty coverage is shown for various λ. We notice that, as                                          Acknowledgments
expected, lower values of λ lead to a high coverage since
many new properties covering small parts of the data are cre-                                     This work was supported by the Haslerstiftung in the
ated. As the value of the threshold increases, fewer and fewer                                    context of the Smart World 11005 (Mem0r1es) project
properties are created, reaching the minimum at λ = 1. In-                                        and by the Swiss National Science Foundation under grant
terestingly, we observe that the average coverage curve is                                        number PP00P2 128459.
M-shaped with a local minimum at λ = 0.5. That is the
consequence of the fact that with λ ≥ 0.5 the new proper-
ties are required to cover at least half of the occurrences of                                    7.   REFERENCES
                                                                                                  [1] L. Bühmann and J. Lehmann. Universal OWL axiom
the original property, leaving no space for other contexts,                                           enrichment for large knowledge bases. LNCS, 7603
thus, at most one new context can be identified for each                                              LNAI:57–71, 2012.
property. Finally, at λ = 1 the average coverage drops to                                         [2] C. d’Amato, N. Fanizzi, and F. Esposito. Inductive learning
0 since no sub-property can cover all the instances of the                                            for the semantic web: What does it buy? Semantic Web,
original property.                                                                                    1(1):53–59, 2010.
   In order to evaluate the output produced by the methods,                                       [3] G. A. Grimnes, P. Edwards, and A. Preece. Learning
3 authors and 2 external experts evaluated the output of the                                          meta-descriptions of the foaf network. In The Semantic
                                                                                                      Web–ISWC 2004, pages 152–165. Springer, 2004.
algorithms computed on a sample of fifty randomly selected
                                                                                                  [4] M. Knuth and H. Sack. Data Cleansing Consolidation with
DBpedia properties using λ = 0.1 and η = 1. To decide                                                 PatchR. In ESWC, volume 8798 of LNCS, pages 231–235.
whether the context separation proposed by the algorithm                                              Springer, 2014.
is correct or not, we built a web application showing to the                                      [5] H. Paulheim and C. Bizer. Improving the Quality of Linked
judges the clickable URI of the original property together                                            Data Using Statistical Distributions. I. J. Semantic Web
with the types of the entities it appears with. The judges                                            Inf. Syst., 10(2):63–86, Jan. 2014.
had then to express their opinion on every generated sub-                                         [6] L. L. Pipino, Y. W. Lee, and R. Y. Wang. Data quality
property.                                                                                             assessment. Communications of the ACM, 45(4):211, 2002.
   The judgments were aggregated by majority vote and then                                        [7] M. Schmachtenberg, C. Bizer, and H. Paulheim. Adoption of
                                                                                                      the linked data best practices in different topical domains. In
precision was computed by dividing the number of positive                                             ISWC, pages 245–260, 2014.
judgments by the number of all judgments. LeXt, ReXt,                                             [8] G. Töpper, M. Knuth, and H. Sack. DBpedia ontology
and LeRiXt achieved a precision of 96.50%, 91.40%, and                                                enrichment for inconsistency detection. I-SEMANTICS,
87.00%, respectively.                                                                                 page 33, 2012.
   We note that this result was obtained with just one con-                                       [9] J. Völker and M. Niepert. Statistical schema induction.
figuration of the parameters—we leave a deeper evaluation                                             LNCS, 6643 LNCS:124–138, 2011.