<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Pro ling the Web of Data</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Hasso-Plattner-Institute</institution>
          ,
          <addr-line>Potsdam</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Web of Data contains a large number of openly-available datasets covering a wide variety of topics. In order to bene t from this massive amount of open data such external datasets must be analyzed and understood already at the basic level of data types, constraints, value patterns, etc. For Linked Datasets such meta information is currently very limited or not available at all. Data pro ling techniques are needed to compute respective statistics and meta information. However, current state of the art approaches can either not be applied to Linked Data, or exhibit considerable performance problems. This paper presents my doctoral research which tackles these problems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Over the past years, an increasingly large number of data sources has been
published as part of the Web of Data1. At the time of writing the Web of Data
comprised already roughly 1,000 datasets totaling more than 82 billion triples2,
including prominent examples, such as DBpedia, YAGO, and DBLP.
Furthermore, more than 17 billion triples are available as RDFa, Microdata and
Microformats in HTML pages3. This trend, together with the inherent heterogeneity
of Linked Datasets and their schemata, makes it increasingly time-consuming to
nd and understand datasets that are relevant for integration. Metadata gives
consumers of the data clarity about the content and variety of a dataset and the
terms under which it can be reused, thus encouraging its reuse.</p>
      <p>A Linked Dataset is represented in the Resource Description Framework
(RDF). In comparison to other data models, e.g., the relational model, RDF
lacks explicit schema information that precisely de nes the types of entities and
their attributes. Therefore, many datasets provide ontologies that categorize
entities and de ne data types and semantics of properties. However, ontology
information is not always available or may be incomplete. Furthermore, Linked
Datasets are often inconsistent and lack even basic metadata. Algorithms and
tools are needed that pro le the dataset to retrieve relevant and interesting
metadata analyzing the entire dataset.
1 The Linked Open Data Cloud nicely visualizes this trend: http://lod-cloud.net
2 http://datahub.io/dataset?tags=lod
3 http://webdatacommons.org</p>
      <p>Data pro ling is an umbrella term for methods that compute metadata for
describing datasets. Traditional data pro ling tools for relational databases have
a wide range of features ranging from the computation of cardinalities, such as
the number of values in a column, to the calculation of inclusion dependencies;
they determine value patterns, gather information on used data types, determine
unique column combinations, and nd keys.</p>
      <p>
        Use cases for data pro ling can be found in various areas concerned with
data processing and data management [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]:
Query optimization is concerned with nding optimal execution plans for
database queries. Cardinalities and value histograms can help to estimate the
costs of such execution plans. Such metadata can also be used in the area of
Linked Data, e.g., for optimizing SPARQL queries.
      </p>
      <p>Data cleansing can bene t from discovered value patterns. Violations of
detected patterns can reveal data errors, and respective statistics help measure
and monitor the quality of a dataset. For Linked Data, data pro ling techniques
help validate datasets against vocabularies and schema properties.
Data integration is often hindered by the lack of information on new datasets.
Data pro ling metrics reveal information on, e.g., size, schema, semantics, and
dependencies of unknown datasets. This is a highly relevant use case for Linked
Data, because for many openly available datasets only little information is
available.</p>
      <p>Schema induction: Raw data, e.g., data gathered during scienti c experiments,
often does not have a known schema at rst; data pro ling techniques need to
determine adequate schemata, which are required before data can be inserted
into a traditional DBMS. For the eld of Linked Data, this applies when working
with datasets that have no dereferencable vocabulary. Data pro ling can help
induce a schema from the data, which then can be used to nd a matching
vocabulary or create a new one.</p>
      <p>Data Mining: Finally, data pro ling is an essential preprocessing step to almost
any statistical analysis or data mining task. While data pro ling focuses on
gathering structural metadata about a dataset, data mining is usually more
concerned with gaining new insights about data.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Relevancy</title>
      <p>There are many commercial tools, such as IBM's Information Analyzer,
Microsoft's SQL Server Integration Services (SSIS), or others for pro ling relational
datasets. However these tool were designed to pro le relational data. Linked Data
has a very di erent nature and calls for speci c pro ling and mining techniques.</p>
      <p>Finding information about Linked Datasets is an open issue on the
constantly growing Web of Data due to the use cases mentioned above. While most
of the Linked Datasets are listed in registries as for instance at the Data Hub
(datahub.io), these registries usually are manually curated, and thus incomplete
or outdated. Furthermore, existing means and standards for describing datasets
are often limited in their depth of information. VoiDand Semantic Sitemapscover
basic details of a dataset, but do not cover detailed information on the dataset's
content, such as their main classes or number of entities. More detailed
descriptions, e.g., information on a dataset's RDF graph structure, topics etc., is usually
not available. Data pro ling techniques can help to ful l the need for information
about, e.g., classes and property types, value distributions, or entity interlinking.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>While many general tools and algorithms already exist for data pro ling, most of
them cannot be used for graph datasets, because they assume a relational data
structure, a well-de ned schema, or simply cannot deal with very large datasets.
Nonetheless, some Linked Data pro ling tools already exist. Most of them focus
on solving speci c use cases instead of data pro ling in general.</p>
      <p>
        One relevant use case is schema induction, because the lack of a xed and
well-de ned schema is a common problem with Linked Datasets. One example
for this eld of research is the ExpLOD tool [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. ExpLOD creates summaries
for RDF graphs based on class and property usage as well as statistics on the
interlinking between datasets based on owl:sameAs links.
      </p>
      <p>
        Li describes a tool that can induce the actual schema of an RDF dataset [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
It gathers schema-relevant statistics like cardinalities for class and property
usage, and presents the induced schema in a UML-based visualization. Its
implementation is based on the execution of SPARQL queries against a local database.
Like ExpLOD, the approach is not parallelized. Both solutions still take
approximately 10h to process a 10 million triples dataset with 13 classes and 90
properties. These results illustrate that performance is a common problem with large
Linked Datasets.
      </p>
      <p>
        An example for the query optimization use-case is presented in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The
authors present RDFStats, which uses Jena's SPARQL processor to collect
statistics on Linked Datasets. These statistics include histograms for subjects (URIs,
blank nodes) and histograms for properties and associated ranges.
      </p>
      <p>
        Others have worked more generally on generating statistics that describe
datasets on the Web of Data and thereby help understanding them. LODStats
computes statistical information for datasets from the Data Hub [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It calculates
32 simple statistical criteria, e.g., cardinalities for di erent schema elements and
types of literal values (e.g., languages, value data types).
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] the authors automatically create VoID descriptions for large datasets
using MapReduce. They manage to pro le the BTC2010 dataset in about an
hour on Amazon's EC2 cloud, showing that parallelization can be an e ective
approach to improve runtime when pro ling large amounts of data.
      </p>
      <p>
        Finally, the ProLOD++ tool allows to navigate an RDF dataset via an
automatically computed hierarchical clustering [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and along its ontology class
tree [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Data pro ling tasks are performed on each cluster or class dynamically
and independently to improve e ciency.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Challenges</title>
      <p>This section describes selected challenges that I identi ed as speci c to pro ling
Linked Data and web data, as opposed to pro ling relational tables.</p>
      <sec id="sec-4-1">
        <title>Pro ling along hierarchies</title>
        <p>Vocabularies de ne classes and their relationships. Ontology classes usually
are arranged in a taxonomic (subclass{superclass) hierarchy. While the Web of
Data spans a global distributed data graph, its ontology classes build a tree
with owl:Thing as its root. Analyzing datasets along the vocabulary-de ned
taxonomic hierarchies yield further insights, such as the data distribution at
different hierarchy levels, or possible mappings betweens vocabularies or datasets.</p>
        <p>
          Keys are clearly of vital importance to many applications in order to uniquely
identify individuals of a given class by values of (a set of) key properties. In
OWL 2 a collection of properties can be assigned as a key to a class using the
owl:hasKey statement [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>
          Nevertheless it has not yet fully arrived on the Web of Data: only one Linked
Dataset uses owl:hasKey [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Thus, actually analyzing and pro ling Linked
Datasets requires manual, time consuming inspection or the help of tools.
        </p>
        <p>Many languages have a so-called \unique names" assumption. On the web,
such an assumption is not possible as real-world entities can be referred to with
di erent URI references.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Heterogeneity</title>
        <p>A common practice in the Linked Data community is to reuse terms from
widely deployed vocabularies whenever possible, in order to increase
homogeneity of descriptions and, consequently, easing the understanding of these
descriptions. There are at least 416 di erent vocabularies to be found on the Web of
Data4. Some datasets, however, also exist without any de ned or dereferencable
vocabularies. And even if common vocabularies are used, there is no guarantee
that the speci cations and constraints are followed correctly.</p>
        <p>
          Nearly all datasets on the Web of Data use terms from the W3C base
vocabularies RDF, RDF Schema, and OWL. In addition, 191 (64.75 %) of the 295
datasets in the Linked Open Data Cloud Catalogue use terms from other widely
deployed vocabularies [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>As Linked Datasets cover a wide variety of topics, widely deployed
vocabularies that cover all aspects of these topics may not exist yet. Thus, data providers
often de ne proprietary terms that are used in addition to terms from widely
deployed vocabularies in order to cover the more speci c aspects and to publish
the complete content of a dataset on the Web. Currently 190 (64.41 %) out of the
295 datasets use proprietary vocabulary terms with 83.68 % making the term
URIs dereferenceable.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Topical pro ling</title>
        <p>The Web of Data covers not only a wide range of topics, it also contains
a number of topically overlapping data sources. Since it provides for
data</p>
        <sec id="sec-4-3-1">
          <title>4 http://lov.okfn.org/</title>
          <p>coexistence, everyone can publish data to it, express their view on things, and use
the vocabularies of their choice. Integrating topically relevant datasets requires
knowledge on the datasets' content and structure.</p>
          <p>The State of the LOD Cloud document ?? gives an overview of the Linked
Datasets for each topical domain but there is no ne-grained topical clustering
for Linked Datasets. With 504 million inter-dataset links the Web of Data is
highly interlinked; 1.6% of all triples are links stating the relationship between
the real-world entities in di erent datasets. Thus a huge topical overlap amongst
the datasets is given.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>Large scale pro ling</title>
        <p>With more than 82 billion triples distributed among roughly 1,000 Linked
Datasets and more than 17 billion triples available as RDFa, Microdata and
Microformats, the need for e cient pro ling methods and tools is apparent.</p>
        <p>
          The runtime of pro ling tasks as presented in Sec. 7 takes up to hours, e.g.,
for determining property co-occurrences [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Pro ling tasks often have the same
preprocessing steps, e.g., ltering or grouping the dataset. Thus there is a large
incentive and potential to optimize the execution of multiple scripts.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Research Questions</title>
      <p>The main question in my doctoral research is:</p>
      <p>What are the challenges that are speci c to pro ling Linked Data and web
data, as opposed to pro ling relational tables?</p>
      <p>After identifying four selected challenges, the following questions arise:
Pro ling along hierarchies Does analyzing Linked Datasets along the
vocabulary-de ned taxonomic hierarchies, such as the data distribution at
different hierarchy levels, yield further insights?</p>
      <p>Heterogeneity How does pro ling help analyzing the heterogeneity on the
Web of Data?</p>
      <p>Topical pro ling How can topical clusterings for unknown datasets on the
constantly growing Web of Data be derived e cently?</p>
      <p>Large scale pro ling How can these huge amounts of Linked Data be
proled e ciently?
6</p>
    </sec>
    <sec id="sec-6">
      <title>Approach</title>
      <p>My approach to address the research questions is to tackle each of the identi ed
challenges. The main goal is to reuse existing pro ling techniques and adapt
them to the Linked Data world.</p>
      <p>This section presents possible and if available developed solutions by me to
the presented challenges.</p>
      <sec id="sec-6-1">
        <title>Pro ling along hierarchies</title>
        <p>One example of pro ling tasks along the class hierarchy is determining the
uniqueness of properties as well as the unique property combinations, which
can bring insights into the property distribution inside the dataset. It allows for
nding relevant (key-candidate) properties for each level in the class hierarchy
and see if the relevance is increasing or decreasing along hierarchy.</p>
        <p>As I have found, due to the sparsity on the Web of Data, usually neither full
key-candidates of properties nor unique property combinations can be retrieved
using traditional techniques. Thus I de ned the concept of keyness as the
Harmonic Mean of uniqueness and density of a property5, allowing to nd potential
key candidates.</p>
      </sec>
      <sec id="sec-6-2">
        <title>Heterogeneity</title>
        <p>Data pro ling can be used to provide metadata describing the
characteristics of a dataset, for instance its topic and more detailed statistics, like the
main classes and properties. Furthermore, data pro ling can not only determine
the usage of vocabularies but also the help understanding and reusing existing
vocabularies. Additionally, it can assist when mapping vocabulary terms.</p>
      </sec>
      <sec id="sec-6-3">
        <title>Topical pro ling</title>
        <p>The rst pro ling task is, of course, to discover (and possibly label) these
topical clusters. The discovery of which topics an unknown dataset is even about,
is already a very helpful insight. Next, any pro ling task can be executed on data
of a particular topic and compared against the metadata of other topics.</p>
      </sec>
      <sec id="sec-6-4">
        <title>Large scale pro ling</title>
        <p>
          The runtime of the pro ling tasks takes up to hours already on 1 million
triples, e.g., for determining property co-occurrences [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. A number of di erent
approaches can be chosen when trying to optimize the execution time of
algorithms dealing with RDF data in general and data pro ling tasks in particular.
Algorithmic optimization : Pro ling tasks that have high computational
complexity cannot be computed navely, e.g., it is infeasible to detect property
cooccurrence by considering all possible property combinations. Such metrics
require innovative algorithms for e ciently computing the targeted result. If such
an algorithm can not be found, approximation techniques (e.g., sampling) may
be required. Because these algorithms are often highly specialized for a speci c
pro ling task, they usually do not bene t other tasks.
        </p>
        <p>
          Parallelization: When dealing with large datasets, a good approach for improving
performance is to perform calculations in parallel when possible [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. This can be
done on di erent levels: dataset, pro ling run, pro ling task and triples.
Clusterbased parallelization based on MapReduce is a reasonable choice when working
with Linked Data.
        </p>
        <p>Multi-Query Optimization : A data pro ling run usually consists of a number of
di erent tasks, which all have to be computed on the same dataset. Depending
on the set of data pro ling tasks, di erent tasks may require the same
prepro5 We de ne the uniqueness of a property as the number of unique values per number
of total values for a given property; and the density of a property as the ratio of
non-NULL values to the number of entities.
cessing steps, or perform similar computation steps. Overall execution time can
be reduced by avoiding duplicate computations. Similar computation steps may
be interweaved to reduce runtime and I/O costs. If di erent tasks require similar
intermediate results, these can be stored in materialized views.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Preliminary Results</title>
      <p>Initially, I have de ned a set of 56 useful data pro ling tasks along various
groupings to pro le Linked Datasets. The have been implemented as Apache
Pig scripts and are available online6.</p>
      <p>
        Furthermore, I illustrated the Web of Data's diversity with the results for four
di erent Linked Datasets [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <sec id="sec-7-1">
        <title>Pro ling along hierarchies</title>
        <p>When analyzing the uniqueness in the class hierarchy for DBpedia, I found
that there are properties that become more speci c by class level, thus their
uniqueness gets higher for subclasses. For instance, dbpedia:team becomes more
unique for athletes than it is for all persons. I also found properties that are
generic, their uniqueness stays constant throughout the class hierarchy. For
instance, dbpedia:birthDate is not speci c to persons or their subclasses.</p>
        <p>Furthermore, I have de ned the concept of keyness of the property to gap
the sparsity on the Web of Data and thus the possibility to nd potential key
candidates where traditional approaches fail.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Large scale pro ling</title>
        <p>
          We have addressed the di erent approaches to improve Linked Data
proling performance and not only developed LODOP, a system for executing,
benchmarking and optimizing Linked Data pro ling scripts on Hadoop but also
developed and evaluated 3 multi-query optimization rules [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. We
experimentally demonstrated that they achieve their respective goals of optimizing the
amount of MapReduce jobs or the amount of data materialized between jobs,
thus reducing the pro ling tasks runtimes by 70%.
8
        </p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Evaluation Plan</title>
      <p>For the evaluation, there are three main lines of interest.</p>
      <p>Metadata The main goal is to provide comprehensive dataset metadata that
helps analyzing the datasets. The metadata can be evaluated on quantity and
quality wrt existing metadata on the Data Hub, VoiD and Semantic Sitemaps.</p>
      <p>Usability Tools and techniques should have a high usability in terms of
results being presented in both human and machine readable ways to achieve
better decision making when working with datasets.</p>
      <p>Performance evaluation Various aspects of the developed tools should be
tested for performance, especially the for huge amounts of data as it is present
on the Web of Data.</p>
      <sec id="sec-8-1">
        <title>6 http://github.com/bforchhammer/lodop/</title>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Re ections and Conclusion</title>
      <p>The main di erence in my approach with existing work on Linked Data pro
ling is to address the shortcomings mentioned in Sec. 3, in particular gathering
comprehensive metadata in an e cient way. Within my research I am building
on existing pro ling techniques for relational data and adapting them according
to the di erent nature of Linked Datasets.</p>
      <p>This paper has presented the outline and preliminary results of my doctoral
research, in which I am focussing on pro ling the Web of Data.</p>
      <p>So far I have speci ed and implemented a comprehensive set of Linked Data
pro ling tasks and illustrated the Web of Data's diversity with the results for
four di erent Linked Datasets. Furthermore I introduced three common
techniques for improving performance of Linked Data pro ling and implemented
three multi-query optimization rules, reducing pro ling taskruntimes by 70%.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          , T. Grutze,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Mining and pro ling RDF data with ProLOD++</article-title>
          .
          <source>In Proceedings of the International Conference on Data Engineering (ICDE)</source>
          ,
          <year>2014</year>
          . Demo.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Demter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <surname>and J. Lehmann.</surname>
          </string-name>
          <article-title>LODStats { an extensible framework for high-performance dataset analytics</article-title>
          .
          <source>In Proceedings of the Int. Conf. on Knowledge Engineering and Knowledge Management (EKAW)</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          .
          <source>State of the LOD Cloud</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. C. Bohm, J. Lorey, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Creating VoiD descriptions for web-scale data</article-title>
          .
          <source>Journal of Web Semantics</source>
          ,
          <volume>9</volume>
          (
          <issue>3</issue>
          ):
          <volume>339</volume>
          {
          <fpage>345</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. C. Bohm,
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fenz</surname>
          </string-name>
          , T. Grutze,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hefenbrock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pohl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Sonnabend</surname>
          </string-name>
          .
          <article-title>Pro ling Linked Open Data with ProLOD</article-title>
          .
          <source>In Proceedings of the International Workshop on New Trends in Information Integration (NTII)</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>B.</given-names>
            <surname>Forchhammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann. LODOP</surname>
          </string-name>
          <article-title>- Multi-Query Optimization for Linked Data Pro ling Queries</article-title>
          .
          <source>In ESWC Workshop on Pro ling &amp; Federated Search for Linked Data (PROFILES)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>B.</given-names>
            <surname>Glimm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Krotzsch, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          . OWL:
          <article-title>Yet to arrive on the Web of Data?</article-title>
          <source>In WWW Workshop on Linked Data on the Web (LDOW)</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>P.</given-names>
            <surname>Hitzler</surname>
          </string-name>
          , M. Krotzsch,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parsia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Patel-Schneider</surname>
          </string-name>
          , and S. Rudolph, editors.
          <source>OWL 2 Web Ontology Language: Primer. W3C Recommendation</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>S.</given-names>
            <surname>Khatchadourian</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Consens</surname>
          </string-name>
          . ExpLOD:
          <article-title>Summary-based exploration of interlinking and RDF usage in the linked open data cloud</article-title>
          .
          <source>In Proceedings of the Extended Semantic Web Conference (ESWC)</source>
          , Heraklion, Greece,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>A.</given-names>
            <surname>Langegger</surname>
          </string-name>
          and W. Wo .
          <article-title>RDFStats { an extensible RDF statistics generator and library</article-title>
          .
          <source>In Proceedings of the International Workshop on Database and Expert Systems Applications (DEXA)</source>
          , pages
          <fpage>79</fpage>
          {
          <fpage>83</fpage>
          , Los Alamitos, CA, USA,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Data Pro ling for Semantic Web Data</article-title>
          .
          <source>In Proceedings of the International Conference on Web Information Systems and Mining (WISM)</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Data pro ling revisited</article-title>
          .
          <source>SIGMOD Record</source>
          ,
          <volume>42</volume>
          (
          <issue>4</issue>
          ),
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>