<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Linked Data Pro ling Service for Quality Assessment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nandana Mihindukulasooriya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raul Garc a-Castro</string-name>
          <email>rgarcia@fi.upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Freddy Priyatna</string-name>
          <email>fpriyatna@fi.upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edna Ruckhaus</string-name>
          <email>eruckhaus@fi.upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nelson Saturno</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ontology Engineering Group, Universidad Politecnica de Madrid</institution>
          ,
          <addr-line>Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Simon Bol var</institution>
          ,
          <addr-line>Caracas</addr-line>
          ,
          <country country="VE">Venezuela</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Linked (Open) Data cloud has been growing at a rapid rate in recent years. However, the large variance of quality in datasets is a key obstacle that hinders their use, so quality assessment has become an important aspect. Data pro ling is one of the widely used techniques for data quality assessment in domains such as relational data; nevertheless, it is not so widely used in Linked Data. We argue that one reason for this is the lack of Linked Data pro ling tools that are con gurable in a declarative manner, and that produce comprehensive pro ling information with the level of detail required by quality assessment techniques. To this end, this demo paper presents, Loupe API, a RESTful web service that pro les Linked Data based on user requirements and produces comprehensive pro ling information on explicit RDF general data, class, property and vocabulary usage, and implicit data patterns such as cardinalities, instance ratios, value distribution, and multilingualism. Pro ling results can be used to assess quality either by manual inspection, or automatically using data validation languages such as SHACL, ShEX, or SPIN.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Data</kwd>
        <kwd>Quality</kwd>
        <kwd>Data Pro ling</kwd>
        <kwd>Services</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Linked (Open) Data cloud has been growing at a rapid rate in recent years.
Some portions of it come from crowd-sourced knowledge bases such as Wikipedia,
while others come from government administrations, research publishers, and
other organizations. These datasets have di erent levels of quality [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] such that
for most practical use cases, they need to be assessed to get an indication of
their quality.
      </p>
      <p>
        Juran and Godfrey describe quality using multiple views [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. On the one hand,
quality can be seen as \ t for intended use in operations, decision-making, and
planning", i.e., relevance, recency, completeness, and precision. On the other
hand, quality is also viewed as \freedom from de ciencies", i.e., correctness and
? This research is partially supported by the 4V (TIN2013-46238-C4-2-R) and
MobileAge (H2020/693319) projects and the FPI grant (BES-2014-068449.)
consistency. In either case, quality assessment is needed before using the data
for a given task to ensure that the data has an adequate quality level. Further,
the results of the assessment can be used to assist the process of improving
quality by cleaning and repairing de ciencies in the data. The objective of the
work presented in this paper is to provide data pro ling service with ne-grained
information that can be used as input for many quality assessment tasks related
to both these views of data quality.
      </p>
      <p>
        Detailed data analysis is one common preliminary task in quality assessment,
and data pro ling is one of the most widely-used techniques for such analysis
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Data pro ling is de ned as the process of examining data to collect statistics
and provide relevant metadata about the data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Even though data pro ling is
widely used in quality assessment in domains such as relational data, we see a
lack of usage of data pro ling in Linked Data.
      </p>
      <p>
        In this paper, we describe a Linked Data pro ling service, the Loupe API,
which provides access to the Loupe tool via a RESTful interface. The Loupe API
may be con gured to specify the source data as well as the pro ling activities it
should perform. As a consequence it can be used for di erent purposes. In the
recent years, Loupe has been used to assess the quality of datasets in several
projects such as DBpedia [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and 3Cixty [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. A RESTful interface facilitates the
integration of the Loupe pro ling services to other systems. The Loupe API has
been integrated with one of our ongoing projects, MappingPedia1, a collaborative
environment for R2RML mappings, in order to gather statistics and do quality
assessment since R2RML mappings are themselves RDF/Linked Data datasets.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Loupe API</title>
      <p>The Loupe API2 is a con gurable Linked Data pro ling service. The three main
phases in Linked Data pro ling are (1) speci cation of input, (2) execution of
data pro ling, and (3) representation of pro ling results. Listing 1a shows an
example input of a Loupe API pro le request. Users can specify their requirements
(i.e., which pro ling tasks to execute) and other con guration details such as how
to access the data source (and which data to pro le), or whether to persist the
pro ling results in the Loupe public repository (i.e., they will be available via
search). The pro ling tasks are grouped into four categories:
{ summary - provides generic statistics on an RDF data source related to
its size and the type of content it has, for example, typed entity count or
distinct IRI object count.
{ vocabUsage - provides information on the implicit schema of the data by
analyzing how vocabulary terms such as classes and properties are used,
their domains and ranges, cardinalities, uniqueness, among others.
{ languagePartitions - provides information on multilingual content by
analyzing the frequency of each language in language tagged strings.
1 http://demo.mappingpedia.linkeddata.es/
2 http://api.loupe.linkeddata.es/
{ valueDistributions - provides information on the value distribution of a
given property.</p>
      <p>The results of pro ling are represented in RDF using the Loupe ontology3.
The main elements of the pro ling results are illustrated in Figure 1b; the
complete results in RDF are available4; we also provide a set of cURL examples5 for
invoking the service.</p>
      <p>(a) Input Con guration
(b) Output Elements</p>
      <p>
        These pro ling results can be used for validating the quality of a dataset
either by manual inspection or by specifying the validation rules in a language such
as SHACL6, ShEx7, or SPIN8. Data pro ling facilitates the manual inspection
by providing a high-level summary so that an evaluator can adapt techniques
such as exploratory testing [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to identify strange occurrences in data.
      </p>
      <p>
        Nevertheless, automatic validation is needed when a large amount of data is
present and it is feasible in most situations. For example, data model constraints
such as uniqueness of values, expected cardinalities, domains and ranges,
inconsistent use of duplicate properties [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] can be easily validated by expressing those
in a constraint language and automatically using pro ling information.
      </p>
      <p>Further, pro ling results enable the analysis of a dataset over a period of time
by periodically pro ling data and performing the analysis on multiple pro ling
results. For instance, expected deletions of data or undesired changes can be
detected by analysing the changes in the dataset pro les.</p>
      <p>
        The Loupe API is implemented as a RESTful service and currently three
operations are available as illustrated in Figure 2.
3 http://ont-loupe.linkeddata.es/def/core#
4 https://git.io/vy1tO
5 https://github.com/nandana/loupe-api/wiki/examples
6 https://www.w3.org/TR/shacl/
7 https://shexspec.github.io/spec/
8 http://spinrdf.org/
Zaveri et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] present a comprehensive review of data quality assessment
techniques and tools in the literature, and propose a conceptual framework with
quality metrics grouped in four dimensions: accesibility, intrinsic, contextual,
and representational; in particular, it mentions the use of pro ling by the
ProLOD tool for Semantic Accuracy. The ProLOD tool [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] has a pre-processing
clustering and labeling phase and a real-time pro ling phase that gathers
statistics on a speci c cluster in order to detect misused properties and discordant
values.
      </p>
      <p>
        Tools that provide statistics on the Linked Open Data Cloud include Aether
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] that provides extended VOID statistical descriptions of RDF content and
interlinking, and ExpLOD [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and ABSTAT [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] that provide summaries of RDF
usage and interlinking.
      </p>
      <p>Di erently from the other tools mentioned, the Loupe API is available as
a RESTful web service where users can con gure and generate Linked Data
pro les in RDF using the Loupe ontology. Further, Loupe provides summarized
information not only on explicit vocabulary, class and property usage as the other
tools but it also facilitates the analysis of implicit data patterns by providing
a ner grained set of metrics compared to existing tools, such as instance ratio
(ratio of instances of a given class to all entities) and property cardinalities. Low
granularity metrics and other capabilities of Loupe have been applied to the
analysis of redundant information, consistency with respect to the axioms in the
ontology, syntactic validity and detection of outliers.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion and future work</title>
      <p>This paper presents the Loupe API, a con gurable RESTful service for pro ling
Linked Data, where results can be used for quality assessment purposes. The
paper illustrated its use, and motivated it with a discussion on how it can be
integrated to the quality assessment process.</p>
      <p>Nevertheless, there are several challenges in pro ling large datasets using a
service compared to a standalone tool. Thus, Loupe API is mostly suitable for
pro ling small datasets. Large datasets (e.g., DBPedia) could take a long time to
pro le and the requests might timeout. In the future, we plan to provide support
for asynchronous executions for such cases.</p>
      <p>Another challenge is to detect the capabilities and limitations of the SPARQL
endpoint and to adapt to those capabilities. Loupe API uses SPARQL 1.1
features and some metrics are omitted if an endpoint only supports SPARQL 1.0.</p>
      <p>In the future, we also plan to extend the pro ling service to other Linked
Data sources such as RDF dumps, SPARQL construct queries, and LDF
endpoints. Further, we plan to allow users to specify their quality requirements in
a declarative manner using formal languages such as SHACL, ShEX, SPIN or
using an editor with common validation rules. This will allow Loupe API to
generate quality assessment reports based on those requirements.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Zaveri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rula</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maurino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pietrobon</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Quality Assessment for linked Data: A Survey</article-title>
          .
          <source>Semantic Web</source>
          <volume>7</volume>
          (
          <issue>1</issue>
          ) (
          <year>2016</year>
          )
          <volume>63</volume>
          {
          <fpage>93</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Defeo</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Juran</surname>
            ,
            <given-names>J.M.:</given-names>
          </string-name>
          <article-title>Juran's Quality Handbook: The Complete Guide to Performance Excellence. 6 edn. McGraw-Hill Education (6</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Olson</surname>
            ,
            <given-names>J.E.</given-names>
          </string-name>
          :
          <article-title>Data Quality: The Accuracy Dimension</article-title>
          .
          <volume>1</volume>
          <fpage>edn</fpage>
          . Morgan Kaufmann (1
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Rahm</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Do</surname>
            ,
            <given-names>H.H.</given-names>
          </string-name>
          :
          <article-title>Data Cleaning: Problems and Current Approaches</article-title>
          .
          <source>IEEE Data Eng. Bull</source>
          .
          <volume>23</volume>
          (
          <issue>4</issue>
          ) (
          <year>2000</year>
          )
          <volume>3</volume>
          {
          <fpage>13</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mihindukulasooriya</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rico</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garc</surname>
            a-Castro,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez-Perez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>An analysis of the quality issues of the properties available in the spanish dbpedia</article-title>
          .
          <source>In: Conference of the Spanish Association for AI</source>
          , Springer (
          <year>2015</year>
          )
          <volume>198</volume>
          {
          <fpage>209</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Mihindukulasooriya</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rizzo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troncy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corcho</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garc</surname>
            a-Castro,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>A Two-Fold Quality Assurance Approach for Dynamic Knowledge Bases: The 3cixty Use Case</article-title>
          .
          <source>In: Proceedings the 1st International Workshop on Completing and Debugging the Semantic Web</source>
          . (
          <year>2016</year>
          )
          <volume>1</volume>
          {
          <fpage>12</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Bohm,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Abedjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Fenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            , Grutze, T.,
            <surname>Hefenbrock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Pohl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Sonnabend</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Pro ling Linked Open Data with ProLOD</article-title>
          . In Haas, L., ed.
          <source>: Proceedings of the 2nd International Workshop on New Trends in Information Integration</source>
          , IEEE (
          <year>2010</year>
          )
          <volume>175</volume>
          {
          <fpage>178</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Makela, E.:
          <article-title>Aether{generating and viewing extended void statistical descriptions of rdf datasets</article-title>
          .
          <source>In: European Semantic Web Conference</source>
          , Springer (
          <year>2014</year>
          )
          <volume>429</volume>
          {
          <fpage>433</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Khatchadourian</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Consens</surname>
            ,
            <given-names>M.P.:</given-names>
          </string-name>
          <article-title>ExpLOD: Summary-based exploration of interlinking and RDF usage in the Linked Open Data cloud</article-title>
          . In: ESWC. (
          <year>2010</year>
          )
          <volume>272</volume>
          {
          <fpage>287</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Spahiu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Porrini</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palmonari</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rula</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maurino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>ABSTAT: OntologyDriven Linked Data Summaries with Pattern Minimalization</article-title>
          .
          <source>In: ESWC (Satellite Events) 2016</source>
          , Springer (
          <year>2016</year>
          )
          <volume>381</volume>
          {
          <fpage>395</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>