<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring Linked Data Graph Structures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anja Jentzsch</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Dullweber</string-name>
          <email>christian.dullweber@student.hpi.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierpaolo Troiano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felix Naumann</string-name>
          <email>felix.naumanng@hpi.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DII, University of Modena and Reggio Emilia</institution>
          ,
          <addr-line>Modena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Hasso Plattner Institute</institution>
          ,
          <addr-line>Potsdam</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The true value of Linked Data becomes apparent when datasets are analyzed and understood already at the basic level of data types, constraints, value patterns etc. Such data pro ling is especially challenging for Rdf data, the underlying data model on the Web of Data. In particular, graph analysis can be used to gain more insight into the data, induce schemas, or build indices. We present ProLod++, a tool for various pro ling and mining tasks and in particular its recent extension GraphLod, which o ers Rdf graph analysis features. ProLod++ features many interactive pro ling results speci c for open data, such as schema discovery for user-generated attributes, association rule discovery to uncover synonymous predicates, and key discovery along ontology hierarchies. GraphLod enhances it with subgraph pattern mining, node degree distribution, component visualization and analysis, and more.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        generated attributes, association rule discovery to uncover synonymous
predicates, and key discovery along ontology hierarchies [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. ProLod++ now is a
Play application and allows easy extension by further techniques. It is
available at http://prolod.org. We implemented and added the GraphLod library,
which provides the following new functionality:
{ Basic graph statistics, such as the number of connected components and
strongly connected components, their corresponding diameter, chromatic
number, and node degree distribution.
{ Connected components are visualized, and grouped if isomorphic.
{ Three graph pattern mining algorithms.
{ Visualization of mined patterns with class coloring.
      </p>
      <p>{ Interactive graph structure exploration in a faceted fashion.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Pro ling and Mining Features</title>
      <p>The features of ProLod++ can be categorized into pro ling and mining tasks,
as illustrated in Table 1.</p>
      <p>
        Basic Analysis. Imported data is clustered by hierarchical topic clustering if no
underlying schema is available, otherwise it is grouped based on the underlying
taxonomic hierarchy. The pro ling and mining tasks are executed on import and
results are stored in a relational database. These include statistics on frequencies
and distributions of distinct subjects, properties, and objects. Pattern analysis
provides the user with statistics on data types and value pattern distributions
of particular properties. ProLod++ discovers positive and negative association
rules, e.g., to discover synonymous properties or inverse properties. To cope
with the sparsity of property values on the Web of Data when discovering key
candidates, ProLod++ calculates the keyness measure for each property along
the ontology class hierarchy. These features were already demonstrated in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ];
the main contributions of this demonstration are described next.
Graph Feature Analysis. ProLod++ allows exploring the graphical
structures of Linked Datasets by visualizing the connected components and the graph
patterns mined from them. Given the underlying graph for a Linked Dataset,
containing all entities as nodes and object properties between them as links,
we detect graph patterns for its directed as well as undirected version. The
latter allows for pattern mining on a more general level. Bigger graph components
(&gt; 1000 nodes) are mined for subgraph patterns using three di erent approaches:
gSpan, GRAMI, and a new approach that mines for prede ned patterns. Our
goal is to de ne a set of graph patterns that can be considered the core of most
Linked Datasets. We identify graph patterns such as paths, cycles, stars, siamese
stars, antennas, caterpillars, and lobsters. Figure 1 is a screenshot of ProLod++
showing all occurrences of a selected pattern and their class distribution along
with some statistical information.
      </p>
      <p>ProLod++ allows faceted browsing through the graph patterns. Patterns
are grouped when isomorphic, rst based on their underlying structure and
then based on the class membership (color). This allows for nding not only
common, re-occurring patterns but also patterns that are dominant for certain
class-combinations. E.g., astronomers in DBpedia are often to be found in star
patterns, surrounded by their discovered astronomical objects.</p>
      <p>Based on the graph features provided by ProLod++ and its underlying
GraphLod library, an overall model for Linked Datasets can be given: We
observe that most of the Linked Datasets consist of a number of small satellite
graphs and a giant component that contains more than 80% of the nodes and
thus resemble scale-free networks as they occur in social networks.</p>
      <p>When jointly pro ling multiple datasets, ProLod++ highlights the
connectivity of connected components across them based on inter-dataset links. This,
for instance, identi es the potential of dataset integration.
3</p>
    </sec>
    <sec id="sec-3">
      <title>ProLOD++ Demonstration</title>
      <p>ProLod++ is a web-based tool to be either distributed for local execution or
hosted as a service at http://prolod.org. Some of the described features are
still under development, but at the time of submission ProLod++ is already a
useful tool to explore Rdf datasets and their graph structure. During the demo,
users can bring along their own Rdf dataset, import it into ProLod++ and
begin the analysis. A number of several interesting datasets from various domains
have been already imported, including DBpedia, Diseasome, and LinkedMDB.</p>
      <p>After the initial analysis phase, users can select datasets and clusters in
a tree model and browse the pro ling results across several tabs. The graph
feature analysis shows graph statistics, such as number of nodes and edges, and
the diameter for the connected and strongly connected components. A node
degree distribution chart is displayed to analyze the underlying graph model.
Besides statistical information, ProLod++ allows faceted browsing through the
graph patterns, from general patterns to class-colored patterns down to concrete
pattern examples. The class distribution is visualized at each facet level.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          , T. Grutze,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Mining and Pro ling RDF Data with ProLOD++</article-title>
          .
          <source>In Proceedings of the International Conference on Data Engineering (ICDE)</source>
          , pages
          <fpage>1198</fpage>
          {
          <fpage>1201</fpage>
          ,
          <year>2014</year>
          . Demo.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Assaf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Senart</surname>
          </string-name>
          . Roomba:
          <article-title>An extensible framework to validate and build dataset pro les</article-title>
          .
          <source>In ESWC International Workshop on Dataset Pro ling &amp; Federated Search for Linked Data (PROFILES)</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Demter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <surname>and J. Lehmann.</surname>
          </string-name>
          <article-title>LODStats { an extensible framework for high-performance dataset analytics</article-title>
          .
          <source>In Proceedings of the International Conference on Knowledge Acquisition, Modeling and Management (EKAW)</source>
          , volume
          <volume>7603</volume>
          , pages
          <fpage>353</fpage>
          {
          <fpage>362</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>F.</given-names>
            <surname>Benedetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Po</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bergamaschi</surname>
          </string-name>
          .
          <article-title>A visual summary for linked open data sources (demo)</article-title>
          .
          <source>In Proceedings of the International Semantic Web Conference (ISWC)</source>
          , pages
          <fpage>173</fpage>
          {
          <fpage>176</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>D. V.</given-names>
            <surname>Camarda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mazzini</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Antonuccio</surname>
          </string-name>
          .
          <article-title>LodLive, exploring the web of data</article-title>
          .
          <source>In Proceedings of the International Conference on Semantic Systems, ISEMANTICS</source>
          , pages
          <volume>197</volume>
          {
          <fpage>200</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>M.</given-names>
            <surname>Elseidy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Abdelhamid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Skiadopoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Kalnis</surname>
          </string-name>
          .
          <article-title>GRAMI: frequent subgraph and pattern mining in a single large graph</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>7</volume>
          (
          <issue>7</issue>
          ):
          <volume>517</volume>
          {
          <fpage>528</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. T. Kafer,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdelrahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <article-title>O'Byrne, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          .
          <article-title>Observing linked data dynamics</article-title>
          .
          <source>In Proceedings of the Extended Semantic Web Conference (ESWC)</source>
          , volume
          <volume>7882</volume>
          <source>of LNCS</source>
          , pages
          <volume>213</volume>
          {
          <fpage>227</fpage>
          . Springer,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>S.</given-names>
            <surname>Khatchadourian</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Consens</surname>
          </string-name>
          . ExpLOD:
          <article-title>Summary-based exploration of interlinking and RDF usage in the linked open data cloud</article-title>
          .
          <source>In Proceedings of the Extended Semantic Web Conference (ESWC)</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>A.</given-names>
            <surname>Langegger</surname>
          </string-name>
          and W. Wo .
          <article-title>RDFStats { an extensible RDF statistics generator and library</article-title>
          .
          <source>In Proceedings of the International Workshop on Database and Expert Systems Applications (DEXA)</source>
          , pages
          <fpage>79</fpage>
          {
          <fpage>83</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Data Pro ling for Semantic Web Data</article-title>
          .
          <source>In Proceedings of the International Conference on Web Information Systems and Mining (WISM)</source>
          , pages
          <fpage>472</fpage>
          {
          <fpage>479</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. E. Makela.
          <article-title>Aether { generating and viewing extended VoID statistical descriptions of RDF datasets</article-title>
          .
          <source>In ESWC (Satellite Events)</source>
          , pages
          <fpage>429</fpage>
          {
          <fpage>433</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          and J. Han. gSpan:
          <article-title>Graph-based substructure pattern mining</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Data Mining (ICDM)</source>
          , pages
          <fpage>721</fpage>
          {
          <fpage>724</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>