<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Loupe - An Online Tool for Inspecting Datasets in the Linked Data Cloud</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nandana Mihindukulasooriya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mar a Poveda-Villalon</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raul Garc a-Castro</string-name>
          <email>rgarcia@fi.upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Asuncion Gomez-Perez ?</string-name>
          <email>asun@fi.upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ontology Engineering Group, Universidad Politecnica de Madrid</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Linked Data initiative continues to grow making more datasets available; however, discovering the type of data contained in a dataset, its structure, and the vocabularies used still remains a challenge hindering the querying and reuse. VoID descriptions provide a starting point but a more detailed analysis is required to unveil the implicit vocabulary usage such as common data patterns. Such analysis helps the selection of datasets, the formulation of e ective queries, or the identication of quality issues. Loupe is an online tool for inspecting datasets by looking at both implicit data patterns as well as explicit vocabulary de nitions in data. This demo paper presents the dataset inspection capabilities of Loupe.</p>
      </abstract>
      <kwd-group>
        <kwd>linked data</kwd>
        <kwd>vocabulary</kwd>
        <kwd>inspection</kwd>
        <kwd>exploration</kwd>
        <kwd>ontology</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>In recent years, we have seen a big enthusiasm in publishing data in a way that
can be easily understood by machines following the Linked Data principles
leading to a large Web of Data. According to the State of the LOD Cloud 20141, there
were more than 1,000 publicly available datasets describing more than 8 million
resources. However, in order to know which of these datasets can be reused for
a given task or to perform queries to nd some speci c information, people still
have to perform multiple data discovery tasks beforehand to nd out which type
of data each of the datasets contains, their structure, the vocabularies2 used,
and other characteristics.</p>
      <p>One way of providing such information about a dataset is to provide a VoID
description. However, VoID descriptions are not present in every dataset, and
even when they are, the information provided is not enough to understand the
structure of data and formulate proper queries. For instance, if we consider a
popular dataset, the English DBpedia, it has instances of 321,506 distinct classes
? This research is partially supported by the 4V Spanish national project
(TIN201346238-C4-2-R) and the FPI grant (BES-2014-068449).
1 See http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/
2 Along this work we will use the terms \vocabulary" and \ontology" interchangeably.
and 58,059 distinct properties associated with those instances. Thus, in order
to use such dataset and to understand its content, one needs to know which
classes and properties are most commonly used, which properties are generally
associated with an instance of a given class, the potential ranges of a given
property, etc.</p>
      <p>In this paper we present Loupe, an online tool3 for inspecting the structure
of datasets by analyzing both the implicit data patterns (i.e., how the di erent
vocabularies are being used in the data themselves) and the explicit vocabulary
de nitions (i.e., any RDFS and OWL ontological axioms found in data).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Exploring a dataset with Loupe</title>
      <p>There are several ways in which datasets can be linked to the ontologies used to
annotate their content. Loupe makes use of a series of parametrized queries in
order to unveil such links between datasets and ontologies4.</p>
      <p>The set of data annotations themselves represent the vocabularies used in a
dataset. In this case, they represent an implicit reference to the ontologies
appearing in such underlying vocabulary. For inspecting these implicit vocabularies
Loupe provides the following capabilities:
{ Class inspector: reports the classes used in the dataset together with the
number of instances for each class and groups those classes by their
namespaces. In addition, the properties used by individuals belonging to a given
class are also provided. Finally, information about multiclassi cation is
provided, that is, to which other classes, the individuals of a given class belongs.
{ Property inspector: shows the properties used together with the number
of triples in which the property appears as predicate. For each property, a
subject and an object analysis are performed. For the subject analysis the
system distinguishes between whether the elements found are URI resources
or blank nodes. For the objects analysis, the system shows whether they are
URI resources, blank nodes, or literals (data types).
{ Triples inspector: provides the common patterns that appear in the dataset
by inspecting all the triples. For each triple, one or more patterns of the form
&lt;subjectType, predicate, objectType&gt; are extracted. The subject eld
represents the di erent classes to which the individual being the subject of a
triple belongs to. The predicate eld represents the predicate appearing in
the given triple. The object eld contains the di erent classes to which an
individual appearing in the object position of a triple belongs to or a literal
(or the de ned xsd:datatypes).
{ Namegraph inspector: lists the graphs appearing in the dataset together
with the number of triples that each graph contains.
3 A demo is available at http://loupe.linkeddata.es/loupe/demo.html
4 Detailed information about the methods and SPARQL queries used for inspecting
the datasets is provided in http://loupe.linkeddata.es/loupe/methods.html</p>
      <p>In other cases vocabularies might have been included as part of the dataset
itself; in this case, the vocabularies could be de ned as instances of the class
owl:Ontology. For this explicit ontological knowledge present in the datasets,
Loupe's ontology inspector provides information about the ontologies, classes,
object properties and datatype properties declared in the dataset. The graphs
where this information is stored are also listed. Finally, some provenance
information for the inspectors created for each dataset is provided.
3</p>
    </sec>
    <sec id="sec-3">
      <title>High-level architecture</title>
      <p>
        Given a new dataset, Loupe Core creates an index in the ElasticSearch server
and indexes the information about the dataset (see Fig. 1) using the methods
discussed in Section 2. Once the indexes are created, they are picked up by the
Loupe UI, which allows users to browse through the indexed data. In addition
most important data is diplayed in tabular formats and search and autocomplete
features are provided to make it easier for users to discover the information about
used classes and properties.
5 https://www.elastic.co/products/elasticsearch
6 https://github.com/openlink/virtuoso-opensource
make-void7 is a tool for generating VoID statistics for RDF les while
RDFStats [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] generates statistics of RDF sources including histograms. Loupe goes
beyond the properties de ned in VoiD descriptions and extracts more detailed
characteristics of classes and properties used in a dataset as well as common
triple patterns.
      </p>
      <p>
        LODStats [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is a framework for dataset analytics and its portal8 provides
statistics for approximately 10K datasets. However, LODStats does not include
information such as the properties associated with instances of a given class or
a detailed analysis of subject and object values of a property, as those require
complex graph pattern queries that do not t within its statement-stream-based
approach. ABSTAT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] provides a summary of the most commonly used abstract
knowledge patterns similar to the triple-patterns shown by Loupe.
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and future work</title>
      <p>This paper presents Loupe, an online tool for inspecting Linked Datasets. It
provides a summary of implicit vocabulary information about the dataset such
as the classes and properties used in the dataset and a detailed analysis of them.
In addition, it inspects the dataset for explicit ontological axiom de nitions.</p>
      <p>One of the challenges for Loupe is the generation of indexes from the public
SPARQL endpoints due to endpoint unreliability, query limitations, and
interoperability issues. To address this challenge, (a) queries are decomposed into
simpler ones while doing some aggregations and computations in the
LoupeCore and (b) local highly-available endpoints are created if RDF dumps are
available. One limitation of Loupe is that it is most suitable when the datasets
are updated in a batch mode periodically. If the dataset is frequently updated,
the index will have to be recreated to synchronize with the dataset. As future
work, we will look into ways of detecting and keeping track of changes to the
dataset and updating only the relevant parts of the index.
7 https://github.com/cygri/make-void
8 http://stats.lod2.eu/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Langegger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Wo , W.:
          <article-title>Rdfstats-an extensible rdf statistics generator and library</article-title>
          .
          <source>In: 20th International Workshop on Database and Expert Systems Application</source>
          ,
          <year>2009</year>
          . DEXA'09,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2009</year>
          )
          <volume>79</volume>
          {
          <fpage>83</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demter</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
          </string-name>
          , J.:
          <article-title>Lodstats{an extensible framework for high-performance dataset analytics</article-title>
          .
          <source>In: Knowledge Engineering and Knowledge Management</source>
          . Springer (
          <year>2012</year>
          )
          <volume>353</volume>
          {
          <fpage>362</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Palmonari</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rula</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Porrini</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maurino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spahiu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferme</surname>
          </string-name>
          , V.:
          <article-title>ASBTAT: Linked Data Summaries with ABstraction and STATistics</article-title>
          .
          <source>In: Demo at the 12th Extended Semantic Web Conference (ESWC2015)</source>
          , Portoroz, Slovenia (
          <year>June 2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>