<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Uniqueness, Density, and Keyness: Exploring Class Hierarchies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anja Jentzsch</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hannes Muhleisen</string-name>
          <email>hannes@cwi.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felix Naumann</string-name>
          <email>felix.naumanng@hpi.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centrum Wiskunde &amp; Informatica (CWI)</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Hasso Plattner Institute (HPI)</institution>
          ,
          <addr-line>Potsdam</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Pro ling Linked Data</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Web of Data contains a large number of openly-available datasets covering a wide variety of topics. In order to bene t from this massive amount of open data, e.g., to add value to an organization's internal data, such external datasets must be analyzed and understood already at the basic level of data types, uniqueness, constraints, value patterns, etc. For Linked Datasets and other Web data such meta information is currently quite limited or not available at all. Data pro ling techniques are needed to compute respective statistics and meta information. Analyzing datasets along the vocabulary-de ned taxonomic hierarchies yields further insights, such as the data distribution at di erent hierarchy levels, or possible mappings betweens vocabularies or datasets. In particular, key candidates for entities are di cult to nd in light of the sparsity of property values on the Web of Data. To this end we introduce the concept of keyness and perform a comprehensive analysis of its expressiveness on multiple datasets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        and their properties. Furthermore, Linked Datasets are often inconsistent and
lack even basic metadata. One of the main reasons for this problem is that
many of the data sources, such as DBpedia or YAGO, have been extracted from
unstructured datasets and their schemata usually evolve over time. Hence it is
vital to thoroughly examine and understand each dataset, its structure, and its
properties before usage. Algorithms and tools are needed that pro le the dataset
to retrieve relevant and interesting metadata analyzing the entire dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>We focus on a speci c metadata aspect, namely the identi cation of objects
with keys. Keys are important in many aspects of data management, such as
guiding query formulation, query optimization, indexing, etc. Furthermore, for
Linked Datasets key properties allow, e.g., for de ning interlinking speci cation
rules to establish the underlying links as owl:sameAs that make the Web of Data
a linked one.</p>
      <p>
        In OWL 2 a collection of properties can be assigned as a key to a class using
the owl:hasKey statement [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This means that each named instance of the class
is uniquely identi ed by the set of values. While in OWL 2 key properties are
not required to be functional or total properties, it is always possible to
separately state that a key property is functional, if desired. Though OWL allows
the de nition of key properties, it has not yet fully arrived on the Web of Data.
Glimm et al. found that in 2012 only one dataset uses owl:hasKey, while
properties like owl:sameAs are already widely used on the Web of Data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Thus,
actually analyzing and pro ling Linked Datasets to nd key candidates requires
manual, time-consuming inspection or the help of tools.
      </p>
      <p>Specifying or nding keys in RDF data has another dimension that
distinguishes it from relational data: Ontology classes usually are arranged in a
taxonomic (subclass/superclass) hierarchy. The class owl:Thing is a superclass
of all OWL classes and thus, classes without an explicit superclass are direct
subclasses of it. While the Web of Data spans a global distributed data graph,
its ontology classes build a tree with owl:Thing as its root. Analyzing datasets
along the vocabulary-de ned taxonomic hierarchies yields further insights, such
as the data distribution at di erent hierarchy levels, or possible mappings
between vocabularies or datasets.</p>
      <p>Given a Linked Dataset with a set of properties, a unique property
combination is a set of one or more properties whose projection has only unique entities.
Our initial approach to e ciently detect keys in a given dataset regards all
property combinations to nd those that together uniquely (and reliably) identify an
object. Unfortunately, due to poor speci cation and data sparsity, such unique
property combinations are rare. This insight leads us to the de nition of three
relaxed dimensions of an RDF property:
{ Uniqueness is the degree to which the values of a property are unique.
{ Density is the degree to which a property is lled with values.
{ Keyness combines the two former dimensions: Properties that are highly
unique and highly dense are good key candidates.</p>
      <p>Each of these dimensions is determined at all levels of a class hierarchy, leading
to insights about which properties best distinguish objects of a class and its
subclasses. We evaluate the usefulness of our approach by showing various insights
we gained analyzing the DBpedia and LinkedGeoData datasets.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        A plethora of tools for pro ling Linked Datasets and gathering comprehensive
statistics, most tools focus on a speci c pro ling task. One example for the area
of schema induction is ExpLOD [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which creates summaries for RDF graphs
based on class and property usage as well as statistics on the interlinking
between datasets based on owl:sameAs links. Li describes a tool that can deduce
the actual schema of an RDF dataset [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. It gathers schema-relevant statistics
like cardinalities for class and property usage, and presents the induced schema
in a UML-based visualization. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] the authors present RDFStats, which uses
a SPARQL query processor to collect statistics on Linked Datasets and thus
optimize queries. These statistics include histograms for subjects (URIs, blank
nodes) and histograms for properties and associated ranges. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] authors
calculate certain statistical information for the purpose of observing the dynamic
changes in datasets.
      </p>
      <p>
        Others have worked more generally on generating statistics that describe
datasets on the Web of Data and thereby help understanding them. LODStats
computes statistical information for datasets from the Data Hub [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. It
calculates 32 simple statistical criteria, e.g., cardinalities for di erent schema
elements and types of literal values (e.g., languages, value data types). In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] the
authors automatically create VoID descriptions for large datasets using
MapReduce. Aether [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] generates VoID statistical descriptions of RDF datasets. It
also provides a Web interface to view and compare them. Roomba [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
generates and validates descriptive Linked Dataset pro les. Finally, ProLOD++ is a
web-based tool for pro ling and mining Linked Datasets [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It comprises various
traditional data pro ling tasks, adapted to the RDF data model. In addition, it
features many speci c pro ling results for open data, such as schema discovery
for user-generated attributes, or association rule mining to uncover synonymous
properties.
      </p>
      <p>
        While there are some RDF pro ling tools already available, few tackle key
discovery. KD2R allows the automatic discovery of composite key constraints in
RDF datasets by deriving maximal non-keys and from these minimal keys [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
Symeonidou et al. introduce SAKey, which extends KD2R to nd \almost keys" [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ],
i.e., sets of properties that are not quite a key due to few exceptions. The set of
almost keys is derived from the set of non-keys found in the data. Both approaches
take into account that Linked Data can be erroneous or contain duplicate data
but omit the missing density. ROCKER [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] uses a re nement operator that is
based on key monotonicity, nds candidate sets of key properties, and assigns
them with a discriminability score.
      </p>
      <p>All these existing approaches do not deliver key candidates for each and every
dataset. This is where our approach is superior as it calculates the keyness for
all properties.</p>
    </sec>
    <sec id="sec-3">
      <title>Uniqueness, Density, and Keyness of Data</title>
      <p>As Linked Datasets are usually sparsely populated, minimal unique property
combinations (key candidates) often consist of either multiple low-density
properties or cannot be found at all. Novel property attributes, such as the
uniqueness, density, and keyness of a property are needed to discover the set of
properties that likely identi es an entity, the key candidates. Furthermore, since
ontologies are topically clustered by their underlying ontologies, these attributes
can be determined per cluster and give some detailed insights into the properties
that serve as key candidates per topic.
3.1</p>
      <sec id="sec-3-1">
        <title>Statistics for class hierarchies</title>
        <p>A Linked Dataset's class hierarchy is the taxonomy de ned by its ontology and
therein the rdfs:subClassOf relations between the classes. A cluster Cc for a class
c consists of all the entities e that are of rdf:type c, which includes all subclasses
of c.</p>
        <p>Cc = feje rdf:typ!e cg
Clusters can contain entities e that are not in any of its subclusters d. We cluster
these entities separately and call the resulting clusters unspecialized clusters,
denoted as C0 .</p>
        <p>c</p>
        <p>Cc0 = Cc n fe j e rdf:typ!e d; d rdfs:subClassO!f cg
We omit the c subscript where it is irrelevant in the context. As an additional
complication, properties on the Web of Data can have multiple property values.
E.g., in the DBpedia dataset we nd the following four values for the property
dbpedia:birthPlace for the entity of Albert Einstein:
dbpedia:Albert Einstein dbpedia:birthPlace dbpedia:Ulm,
dbpedia:Kingdom of Wuerttemberg,
dbpedia:German Empire
dbpedia:Baden-Wuerttemberg .</p>
        <p>We denote the set of property values of an entity e and property p as V (e; p).
To count the number of entities in a cluster C that have at least one value for
p, we de ne V (C; p) = fe j jV (e; p)j &gt; 0; e 2 Cg. Property values of a property p
and two entities e1 and e2 are equal if V (e1; p) = V (e2; p), i.e., if the two sets
are identical. With this de nition we further de ne the set of unique value sets
as Vuq(C; p) = fV (e; p) j e 2 Cg.</p>
        <p>We are now ready to de ne the three attributes, uniqueness, density, and
keyness, of a property. The uniqueness uq of a property p for a cluster C is the
number of unique value sets Vuq(C; p) per number of total value sets V (C; p) for
the given property.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Uniqueness:</title>
        <p>uq(C; p) = jVuq(C; p)j
jV (C; p)j
(1)
The density d of a property p for a cluster C is the ratio of entities in C that
have p to the overall number of entities in C.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Density:</title>
        <p>d(C; p) = jV (C; p)j
jCj</p>
        <p>We call a property full key candidate if its density and uniqueness are both 1.
For cases where they are not both 1 we de ne its keyness as a useful attribute.
The keyness k of a property p for a cluster C is the harmonic mean of its
uniqueness and density. The harmonic mean emphasizes that both parameters
must be high to achieve an overall high keyness:</p>
      </sec>
      <sec id="sec-3-4">
        <title>Keyness:</title>
        <p>k(C; p) =
2 uq(C; p) d(C; p)
uq(C; p) + d(C; p)
We call a property key candidate if its keyness is above some threshold.</p>
        <p>We investigate the three attributes of an RDF property, uniqueness, density,
and keyness, for the given cluster types C, and C0. Determining uniqueness,
density, and keyness for a property p in a cluster Cc requires analyzing all property
value sets for all entities in the given cluster. We observe all kinds of speci
cities of properties for clusters and their subclusters that allow for a ne-grained,
cluster-based retrieval of key candidates.
(2)
(3)
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>
        Our approach has been implemented in ProLod++, a web-based tool for
proling and mining Linked Datasets [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]5. It comprises various traditional data
pro ling tasks, adopted to the RDF data model. In addition, it features many
speci c pro ling results for Linked Datasets, such as schema discovery for
usergenerated attributes, association rule discovery to uncover synonymous
properties, and, in particular, key discovery along ontology hierarchies. It allows to
navigate a Linked Dataset via an automatically computed topical and
hierarchical clustering as well as along its ontology class tree. The latter allows the
user to observe the evolution of key features along hierarchies, thus determining
class-speci c properties and key candidates.
      </p>
      <p>In combination, having these property attributes at hand, the user can better
decide on which properties serve as keys, especially on class level and gain further
insight into the completeness and structure of the dataset at hand.
4.1</p>
      <sec id="sec-4-1">
        <title>Datasets</title>
        <p>
          We evaluated our approach using two datasets, DBpedia v.3.9 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]6 and
LinkedGeoData [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. DBpedia is a Linked Data version of Wikipedia and thus a
crossdomain dataset. It has evolved into a hub on the Web of Data with many other
5 A ProLod++ demo is available at http://prolod.org
6 Our evaluation uses the English DBpedia, excluding the raw infobox properties.
datasets linking to it. LinkedGeoData is a spatial dataset that publishes
OpenStreetMap data as Linked Data. It is one of the bigger and constantly evolving
datasets on the Web of Data.
        </p>
        <p>For DBpedia we analyzed the Person cluster and several subclusters including
Athlete and its subclusters, and Scientist (see Table 1). We deliberately omit the
arti cial class Agent, subclass of owl:Thing and superclass of dbpedia:Person, due
to its main function to de ne properties that are also needed for the Organization
and Family class on the same class hierarchy level as Person.</p>
        <p>Table 1 lists some subclasses of the DBpedia Person class along with the
number of entities and number of properties in the respective cluster. 35% of
the entities in DBpedia are persons due to Wikipedia mainly covering persons,
places, and sports topics. The Person cluster is a diverse one with 255 properties
being used, half of them (127) also occurring on the Athlete subclass. The Athlete
class has several subclasses from which we chose representative ones.
Furthermore, we included the Scientist class as a comparison in our evaluation, which
has no further subclasses and only 40 properties.</p>
        <p>This analysis shows that having some basic pro ling results on property usage
at hand can already be useful. While there are 2,333 properties de ned in the
DBpedia 3.9 ontology, only 1,376 are used by at least one entity. When browsing
through Linked Dataset class hierarchies, the information on which properties
are actually being used, compared to the de ned ones, can help to narrow down
the properties of interest for tasks like key discovery.</p>
        <p>For LinkedGeoData we analyzed the 2013 version and therein the Amenity
cluster and the subclusters shown in Table 2. Due to its automatic conversion
from OpenStreetMaps, which is publicly curated by a large number of people,
the number of properties to describe amenities is very high.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Density, uniqueness, and keyness</title>
        <p>Figure 1 plots the uniqueness and density for all properties in the DBpedia
Person, Athlete, and SoccerPlayer cluster, highlighting selected properties. Overall,
the property densities are noticeably low, while the uniqueness is distributed over
the entire x-axis. The DBpedia Person cluster has only two properties (foaf:name
and rdfs:label) with a high uniqueness (above 0.9) and density (roughly 0.7).
For the subclasses there are a few more properties with a high density and
uniqueness. We can already see that there is di erent behaviour in property
uniqueness and density along the class hierarchy. While the uniqueness for
dbpedia:birthPlace stays approximately equal for Person, Athlete, and SoccerPlayer,
dbpedia:team becomes more unique for Athlete than for Person and even more
unique for SoccerPlayer. The density of both properties increases.</p>
        <p>Table 3 shows some of the 255 properties in the Person cluster along with
their uniqueness, density, and keyness, ordered by keyness. What can already
be observed from this selection are some typical characteristics of Linked
Datasets. They often contain properties with only few property values but a high
uniqueness (nearly 1), e.g., dbpedia:pseudonym. Many properties have a high
uniqueness but their density is not 1, e.g., foaf:name, and rdfs:label. The density
rarely reaches 0.5 (for only eight properties) and only in one of 255 properties
(rdf:type) reaches 1. 86 % of the properties have only up to 5 % values available.</p>
        <p>For example out of the 185,081 athletes in DBpedia, only 36 have a
dbpedia:espnId value, yet all of these values are unique. This would identify
dbpedia:espnId as a key candidate for athletes using traditional key discovery
approaches. This observation emphasizes the need to take into account further
details like the keyness of a property when choosing key candidates.</p>
        <p>Figure 2 shows the uniqueness and density for all properties in the
LinkedGeoData Amenity, Shop, and Bakery clusters, again highlighting selected
properties. wgs84:lat is the latitude of an amenity's position. Uniqueness and density
stay approximately equal for the classes along the hierarchy. The same can be
observed for the label rdfs:label of entities in LinkedGeoData. Positions and
labels are the two most used properties in LinkedGeoData. The overall property
density is again low, which is due to the enormous mostly manual e ort to add
metadata for all the 49,355,161 entities in OpenStreetMaps/LinkedGeoData.</p>
        <p>Even more than DBpedia, LinkedGeoData is sparsely populated with
property values. Table 4 shows some of the 12,371 properties in the Amenity cluster
along with their uniqueness, density, and keyness, ordered by keyness. Only ve
properties have a keyness above 0.8 and as we have already observed in
Figure 2, the overall property density is low. Only ten properties have a density
above 0.5, and two of them, dcterms:modi ed and lgd:changeset, are metadata
properties. The average property uniqueness is 0.64, the average property
density and keyness are 0.00. These numbers re ect the enormous number of places
in LinkedGeoData and the according e ort to add metadata for all of them.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Class hierarchies</title>
        <p>Furthermore, we evaluated the uniqueness, density, and keyness of selected
properties along the class hierarchy. Table 5 shows the uniqueness, density, and
keyness for the properties dbpedia:birthDate and dbpedia:team along the Person class
hierarchy. As already observed in Figure 1, the keyness for dbpedia:team increases
for Athlete and subclasses of Athlete as it is speci c to the sports theme but not
existent for the Scientist cluster at all. The dbpedia:birthDate property has a high
density for all persons but its uniqueness is naturally not very high. The
coverage of athletes' birth dates starts only in the 17th century: the oldest athlete on
DBpedia is a cricketer called William Bedle, born 1679.</p>
        <p>In the Person' and Athlete' clusters only few properties are used, which
explains the missing dbpedia:team property. Generally, these clusters contain
entities that could not be further classi ed. When identifying key candidates, we
observe that the keyness for dbpedia:birthDate is an outlier compared with the
main clusters. Thus the keyness of Person and Athlete are of higher con dence.
Table 6 in the appendix shows the analogous analysis for selected
LinkedGeoData properties.</p>
        <p>To summarize our ndings, we identi ed three types of property keyness
along the class hierarchy:
{ Less speci c, i.e., keyness decreases per class level in the class hierarchy. An
example is dbpedia:deathPlace whose keyness for Person is 0.18, for Athlete
0.14, and for SoccerPlayer 0.08.
{ Generic, i.e., keyness stays approximately equal throughout the class
hierarchy. An example is dbpedia:birthPlace, whose keyness for Person is 0.28, for
Athlete 0.29, and for SoccerPlayer 0.28.
{ More speci c, i.e., keyness increases per class level in the class hierarchy. An
example is dbpedia:team which keyness for Person is 0.34, for Athlete 0.61,
and for SoccerPlayer 0.93.</p>
        <p>Figures 1 and 2 already depict these types of property keyness. In DBpedia we
can nd all three types of property keyness: the keyness for dbpedia:birthPlace
stays approximately equal for Person, Athlete, and SoccerPlayer (generic). For
dbpedia:team the keyness increases signi cantly from Person to Athlete and nally
to SoccerPlayer (more speci c). The keyness for dbpedia:deathPlace decreases
along the class hierarchy to SoccerPlayer (less speci c). This can be explained
with the fact that DBpedia is an ever-evolving knowledge base and at the time
of writing most death places covered for soccer players were in a certain town
in England, namely Stoke-on-Trent. In LinkedGeoData the properties' keyness
is mostly generic, like rdfs:label and wgs84:lat, but also sometimes gets more
speci c on the lower class hierarchy levels.</p>
        <p>Finally, Figure 3 shows the keyness distribution for all properties along the
DBpedia class hierarchy. Class level 1 contains all properties of owl:Thing, level 2
all properties of subclasses of owl:Thing and so forth. Overall, the keyness is
increasing per class level. For the root class level 1 out of 1,376 properties only
one has a keyness higher than 0.8, for class level 7 this increases to 7 out of 22
properties. This observation leads to the conclusion that class level-based keys
are a better choice than high-level keys. Key features for athletes might be too
high-level for soccer players and should be rede ned on that level.</p>
        <p>Our evaluation shows that the property keyness can help discovering key
candidates for Linked Datasets. It also highlights the advantages of analyzing
the class hierarchy in order to observe property behaviour for classes along it
and make better choices when identifying key candidates for speci c classes.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>We have introduced the concept of keyness (and therein uniqueness and density)
of a property to address the sparsity on the Web of Data and thus create the
possibility to nd key candidates where traditional approaches fail.</p>
      <p>Our approach has been implemented in ProLod++ and provides users with
the uniqueness, density, and keyness for all properties. Having these pro ling
results at hand helps users in nding key candidates and analyzing the relevance
of properties along class hierarchies in Linked Datasets.</p>
      <p>While the keyness of a property is already useful for discovering key
candidates, the keyness values rarely reach 0.8 due to dataset sparsity. Thus we plan
to extend the keyness concept to sets of properties. Because minimal unique
property combinations rely only on a high combined density, they are not the
ideal approach here. It seems reasonable, for instance, to consider the amount
of overlap of properties in combination with the properties' keyness.</p>
    </sec>
    <sec id="sec-6">
      <title>Keyness analysis for selected LinkedGeoData properties</title>
      <p>In analogy to Table 5 for persons, Table 6 shows the uniqueness, density, and
keyness for the selected LinkedGeoData properties wgs84:lat and rdfs:label along
the Amenity class hierarchy. For the position properties it is noticeable that the
keyness stays high (above 0.79) for all the classes along the hierarchy. The fewer
the entities in the class cluster (butchers and ice cream shops), the higher the
property keyness. For the label the keyness also is quite stable along the class
hierarchy (around 0.6) but is especially high in bakeries due to the uniqueness
in labels and a high density.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          , T. Grutze,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Mining and Pro ling RDF Data with ProLOD++</article-title>
          .
          <source>In Proceedings of the International Conference on Data Engineering (ICDE)</source>
          , pages
          <fpage>1198</fpage>
          {
          <fpage>1201</fpage>
          ,
          <year>2014</year>
          . Demo.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Assaf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Senart</surname>
          </string-name>
          . Roomba:
          <article-title>An extensible framework to validate and build dataset pro les</article-title>
          .
          <source>In ESWC International Workshop on Dataset Pro ling &amp; Federated Search for Linked Data (PROFILES)</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Demter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <surname>and J. Lehmann.</surname>
          </string-name>
          <article-title>LODStats { an extensible framework for high-performance dataset analytics</article-title>
          .
          <source>In Proceedings of the Int. Conf. on Knowledge Engineering and Knowledge Management (EKAW)</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. C. Bohm, J. Lorey, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Creating VoiD descriptions for web-scale data</article-title>
          .
          <source>Journal of Web Semantics</source>
          ,
          <volume>9</volume>
          (
          <issue>3</issue>
          ):
          <volume>339</volume>
          {
          <fpage>345</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>B.</given-names>
            <surname>Glimm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Krotzsch, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          . OWL:
          <article-title>Yet to arrive on the web of data?</article-title>
          <source>In WWW Workshop on Linked Data on the Web (LDOW)</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>P.</given-names>
            <surname>Hitzler</surname>
          </string-name>
          , M. Krotzsch,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parsia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Patel-Schneider</surname>
          </string-name>
          , and S. Rudolph, editors.
          <source>OWL 2 Web Ontology Language: Primer. W3C Recommendation</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. T. Kafer,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdelrahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <article-title>O'Byrne, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          .
          <article-title>Observing linked data dynamics</article-title>
          .
          <source>In Proceedings of the Extended Semantic Web Conference (ESWC)</source>
          , volume
          <volume>7882</volume>
          <source>of LNCS</source>
          , pages
          <volume>213</volume>
          {
          <fpage>227</fpage>
          . Springer,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>S.</given-names>
            <surname>Khatchadourian</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Consens</surname>
          </string-name>
          . ExpLOD:
          <article-title>Summary-based exploration of interlinking and RDF usage in the linked open data cloud</article-title>
          .
          <source>In Proceedings of the Extended Semantic Web Conference (ESWC)</source>
          , pages
          <fpage>272</fpage>
          {
          <fpage>287</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>A.</given-names>
            <surname>Langegger</surname>
          </string-name>
          and W. Wo .
          <article-title>RDFStats { an extensible RDF statistics generator and library</article-title>
          .
          <source>In Proceedings of the International Workshop on Database and Expert Systems Applications (DEXA)</source>
          , pages
          <fpage>79</fpage>
          {
          <fpage>83</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>J. Lehmann</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Isele</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Jakob</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Jentzsch</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Kontokostas</surname>
            ,
            <given-names>P. N.</given-names>
          </string-name>
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Morsey</surname>
            , P. van Kleef,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Auer</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>DBpedia { a large-scale, multilingual knowledge base extracted from Wikipedia</article-title>
          .
          <source>Semantic Web Journal</source>
          ,
          <volume>6</volume>
          (
          <issue>2</issue>
          ):
          <volume>167</volume>
          {
          <fpage>195</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Data Pro ling for Semantic Web Data</article-title>
          .
          <source>In Proceedings of the International Conference on Web Information Systems and Mining (WISM)</source>
          , pages
          <fpage>472</fpage>
          {
          <fpage>479</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. E. Makela.
          <article-title>Aether { generating and viewing extended VoID statistical descriptions of RDF datasets</article-title>
          .
          <source>In ESWC (Satellite Events)</source>
          , pages
          <fpage>429</fpage>
          {
          <fpage>433</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>R.</given-names>
            <surname>Meusel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Petrovski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>The webdatacommons microdata, rdfa and microformat dataset series</article-title>
          .
          <source>In Proceedings of the International Semantic Web Conference (ISWC)</source>
          , pages
          <fpage>277</fpage>
          {
          <fpage>292</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Data pro ling revisited</article-title>
          .
          <source>SIGMOD Record</source>
          ,
          <volume>42</volume>
          (
          <issue>4</issue>
          ):
          <volume>40</volume>
          {
          <fpage>49</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>N.</given-names>
            <surname>Pernelle</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Sas, and</article-title>
          <string-name>
            <given-names>D.</given-names>
            <surname>Symeonidou</surname>
          </string-name>
          .
          <article-title>An automatic key discovery approach for data linking</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          ,
          <volume>23</volume>
          :
          <fpage>16</fpage>
          {
          <fpage>30</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>T.</given-names>
            <surname>Soru</surname>
          </string-name>
          , E. Marx, and A.
          <string-name>
            <surname>-C. Ngonga</surname>
          </string-name>
          <article-title>Ngomo. ROCKER { a re nement operator for key discovery</article-title>
          .
          <source>In Proceedings of the International World Wide Web Conference (WWW)</source>
          , pages
          <fpage>1025</fpage>
          {
          <fpage>1033</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>C. Stadler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>Ho ner, and</article-title>
          <string-name>
            <surname>S. Auer.</surname>
          </string-name>
          <article-title>LinkedGeoData: A core for a web of spatial open data</article-title>
          .
          <source>Semantic Web Journal</source>
          ,
          <volume>3</volume>
          (
          <issue>4</issue>
          ):
          <volume>333</volume>
          {
          <fpage>354</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>D.</given-names>
            <surname>Symeonidou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Armant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pernelle</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Sa</surname>
          </string-name>
          <article-title>s. SAKey: Scalable almost key discovery in RDF data</article-title>
          .
          <source>In Proceedings of the International Semantic Web Conference (ISWC)</source>
          , pages
          <fpage>33</fpage>
          {
          <fpage>49</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>Ontology Class Entities wUgnsiq84.:laDtens. Keyn</source>
          . rUdnfsiq:l.abeDl ens. Keyn.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>owl:Thing 49,355,161 0.93 0.16 0.27 0.53 0.14 0.22 Amenity 6,824,892 0.95 0.70 0.81 0.53 0.55 0.54 Amenity' 5,543,014 0.95 0.68 0.79 0.55 0.51 0.53 Shop 1,130,204 0.99 0.79 0.88 0.49 0.77 0.60 Shop' 1,008,344 0.99 0.78 0.87 0.79 0.61 Butcher 20,432 1.00 0.89 0.94 0.71 0.73 0.72 Bakery 57,204 0.54 0.74 0.62 1.00 0.90 0.95 IceCream 2,643 1.00 0.89 0.94 0.70 0.78 0</source>
          .
          <issue>74 Table 6</issue>
          . Uniqueness, density, and
          <article-title>keyness for LinkedGeoData properties wgs84:lat and rdfs:label along exemplary classes of the Amenity class hierarchy</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>