<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detection of Related Semantic Datasets Based on Frequent Subgraph Mining</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mikel Emaldi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oscar Corcho</string-name>
          <email>ocorcho@fi.upm.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diego Lopez-de-Ipin~a</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Deusto Institute of Technology - DeustoTech, University of Deusto</institution>
          ,
          <addr-line>Bilbao</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ontology Engineering Group, Departamento de Inteligencia Arti cial, Facultad de Informatica, Universidad Politecnica de Madrid</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We describe an approach to nd similarities between RDF datasets, which may be applicable to tasks such as link discovery, dataset summarization or dataset understanding. Our approach builds on the assumption that similar datasets should have a similar structure and include semantically similar resources and relationships. It is based on the combination of Frequent Subgraph Mining (FSM) techniques, used to synthesize the datasets and nd similarities among them. The result of this work can be applied for easing the task of data interlinking and for promoting data reusing in the Semantic Web.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Since the creation of Linked Open Data Cloud3 initiative in 2007 with 12 datasets,
to its last update in 2014 with 570 datasets, the number of Linked Datasets has
grown enormously. This growth trend suggests that in few years, selecting
appropriate datasets to link our datasets to, is going to become harder and harder.
The same applies to the task of nding the dataset that may contain useful
information for us, according to our needs. The work presented in this paper is
focused on providing some steps forward into some of the aforementioned
limitations: nding datasets to link to, nding datasets that provide support to our
needs or understanding or summarizing datasets.</p>
      <p>
        Our main contribution is an approach to nd similarities among RDF datasets
based on their graph structure, which can be used for solving the aforementioned
problems. The main challenge that we have to deal with stems from the fact that,
due to the size of many of the graphs derived from these RDF datasets, a direct
comparison among their complete structure is not applicable. Therefore, a
Frequent Subgraph Mining (FSM) based approach is proposed. FSM techniques,
widely used in the domains of chemistry and biology to nd similarities and
correlations among di erent chemical compounds and molecules [
        <xref ref-type="bibr" rid="ref11 ref2 ref4">2, 4, 11</xref>
        ], allow
extracting the most frequent subgraphs from a single graph or a set of graphs.
3 http://lod-cloud.net/
Given that RDF datasets are graphs, we think that the combination of
techniques based on the summarization of RDF graphs and the identi cation of the
most frequent subgraphs can provide good results on nding related datasets.
      </p>
      <p>An example of an application of the work proposed in this paper is related
to dataset interlinking. As stated in section 2, when using existing dataset
interlinking tools, the user have to select the input datasets whose links are going to
be searched. Nowadays, there are two main approaches to select these datasets:
applying the brute force for applying all the possible pairs of datasets to the
interlinking tool; or requesting the user for selecting the most suitable datasets
under her/his beliefs, a task that is becoming harder and harder because of the
growth of the Linked Open Data Cloud. Proposed solution can ease this task
suggesting a subset of related datasets, with the consequent reduction of the
search space.</p>
      <p>In summary, in this work a new approach for synthesizing and nding
similarities among RDF datasets is presented. Speci cally, this approach proposes
the use of FSM techniques to synthesize these datasets. The approach proposed
in this work can be used for easing the task of interlinking new datasets, and for
improving data reuse through nding similar datasets.</p>
      <p>The rest of the paper is organized as follows. In Section 2 the previous works
in semantic dataset browsing, interlinking and summarization, and Linked Data
source discovery are presented. In Section 3 some de nitions and concepts about
graph mining are explained. Section 4 describes our new approach based on
FSM. In Section 5 proposed approach is evaluated against a set of datasets from
Linked Open Data Cloud. At last, in Section 6, conclusions and future research
challenges are explained.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>There are four research elds related to possibles usages of the work presented
in this paper: semantic dataset browsing, interlinking and summarization, and
Linked Data source searching.</p>
      <p>
        Semantic Dataset Browsing. Works under this eld provide search
capabilities over linked datasets. From a set of terms, these browsers nd resources in
which these terms appear. Most of these works use techniques given from
information retrieval eld like TF-IDF (term frequency-inverse document frequency),
and some works o er more complex techniques for re ning the results. However,
these works do not apply a previous lter on the datasets against they search the
terms given by the user. Proposed work can be useful in this area when a term
is found in a dataset, for prioritizing related datasets when searching for more
results. In this eld works like Swoogle [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Falcons [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Sindice [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] or Sig.ma
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] can be categorized.
      </p>
      <p>
        Semantic Dataset Interlinking. The aim of works under this category is
about given a pair of datasets, establishing links between them, based on a set
of rules de ned by the user. Most of these works use di erent properties from
resources within a dataset for establishing owl:sameAs links among them. One
of the most important lacks of works in this area is that the user has to select
the pair of datasets to establish new links between them. The solution proposed
in this paper can be used to select these input datasets. Most remarkable works
in this eld are Silk [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and LIMES [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        Semantic Dataset Summarization. Although dataset summarization is not
the nal goal of this work, we have considered interesting to analyse most
remarkable works in this eld, although they are oriented for creating human-readable
data summaries, instead of machine-readable summaries that are used in the
proposed work. In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], after detecting patterns in a graph, they extract labels
from vertices and edges for elaborating a summary. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] applies NER (Named
Entity Recognition) techniques over literals of graphs for nding them in
DBPedia. Once correspondent resources from DBPedia are found, they extract their
categories for elaborating a summary.
      </p>
      <p>
        Linked Data Source Searching These works try to nd candidate datasets
for interlinking. Works under this category are the most related with the work
described in this paper. In [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], they extract literals from rdfs:label, foaf:name
or dc:title properties. They search these literals in Sig.ma and group the results
by source dataset. They consider that more instances a source has, more chances
to be linked with original dataset has. The mayor weakness of this approach is
that Sig.ma is no longer harvesting new data, so it is no a suitable solution
for recently published datasets. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] uses naive Bayes classi ers for establishing
a ranking of related datasets based on correlations among them. Through this
ranking the search space can be reduced. At last, in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] they use already existing
links for establishing new links among datasets. As previously mentioned, one of
the objectives of our work was to solve the cold-starting problem when searching
related datasets.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Background</title>
      <p>
        The main objective of FSM is to extract all frequent subgraphs from a single
graph or a set of graphs. We assume the de nitions from [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]:
{ Labeled graph: A labeled graph can be represented as G(V; E; LV ; LE ; '),
where V is a set of vertices, E V V is a set of edges; LV and LE are sets
of vertex and edge labels respectively; and ' is a label function that de nes
the mappings V ! LV and E ! LE . G is a directed graph if 8e 2 E, e is
an ordered pair of vertexes.
{ Subgraph: Given two graphs G1(V1; E1; LV1 ; LE1 ; '1) and G2(V2; E2; LV2 ;
LE2 ; '2), G1 is a subgraph of G2, if G1 satis es: i) V1 V2, and 8v 2
V1; '1(v) = '2(v), and ii) E1 E2, and 8(u; v) 2 E1; '1(u; v) = '2(u; v).
      </p>
      <p>
        Multiple state-of-the-art tools implement FSM. In Table 1 a summary of the
most relevant features of each solution is shown. These features are the following
ones:
{ Single graph/Transactions: according to [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] there are two di erent FSM
problem formulations. In the rst one, single graph based FSM, only a single
very large graph is analyzed. In graph transaction based FSM the common
substructures are extracted from a set of medium-size graphs (named
transactions).
{ Directed graphs: applying directionality to graphs increases
computational cost considerably. For this reason, many of the solutions do not
implement this feature.
{ Labeled vertexes: solution allows (or not) labeled vertexes in input graphs.
{ Labeled edges: solution allows (or not) labeled edges in input graphs.
      </p>
      <p>As shown in Table 1 only SUBDUE and DPMine cover all features to be
suitable for dealing with the characteristics of RDF graphs. As in this approach
we want to extract the most common subgraph from each dataset, the solution
that supports single graphs has been selected, i.e. SUBDUE.</p>
      <p>
        Given a single, directed and labeled graph, SUBDUE [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] extracts the most
frequent substructures. SUBDUE de nes the most frequent subgraph as the
subgraph that once replaced by a single node, compresses most the original
graph. Assuming that G is the original graph, S is the candidate subgraph to
be evaluated, size(G) and size(S) are the size of G and S respectively and
size(GjS) is the size of G compressed by S, the total compression rate can be
calculated as:
value(S; G) =
      </p>
      <p>size(G)
(size(S) + size(GjS))
(1)
size(G) = (jvertex(G)j + jedges(G)j)
(2)</p>
      <p>SUBDUE can be parameterized to adapt its behavior to the di erent input
graphs. To facilitate the understanding of the application of SUBDUE in Section
4 some of these parameters are explained:
{ inc: this parameter allows the incremental analysis of large graphs, avoiding
the consumption of all the memory of the system by large graphs and
allowing the preview of partial results. To perform the incremental analysis, the
input graph has to be split in di erent and numbered les. SUBDUE
analyses these les in order, aggregating results of the les previously analyzed
to the current le.
{ limit: limits the number of candidate substructures that SUBDUE takes in
consideration in each iteration. The default value is jedgesj .
2
{ prune: prunes the graph discarding useless substructures.
4
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>Frequent Subgraph Mining Approach</title>
      <sec id="sec-4-1">
        <title>RDF Graph Synthesis Model</title>
        <p>In order to apply the RDF graph synthesis model presented in this paper, some
modi cations have been done to the original RDF graph model. It is important to
note that these transformations do not preserve the meaning of the RDF model,
but we do not consider that property important for our approach. However, after
applying these transformations, a proper interpretation of the graph can be done.
As can be seen in this section, the aim of these transformations is to simplify the
graph for easing the task of extracting the most common substructures. From the
triples shown in Listing 1, represented graphically in Figure 1, transformations
are applied to ensure the correct understanding of the presented model.</p>
        <p>The rst transformation applied to RDF graphs consists in replacing URIs
from the subjects of resources from datasets. Since they are unique identi ers
of resources, URIs in subjects will generate a large amount of unique nodes
which do not belong to any candidate substructure, increasing the di culty of
nding frequent subgraphs. To avoid this, these URIs have been replaced by the
ontological class (or classes, if it is represented by more than one class) of the
resource represented by the rdf:type property if any, as can be seen in Figure 2.
If a resource has no a rdf:type predicate associated, this resource is discarded.</p>
        <p>The next transformation is about managing interlinked resources.
Establishing links among resources from di erent datasets is one of the most important
features in Linked Data publication. For this reason, a large amount of
internal and external links can be found in linked datasets. Managing external links,
adds to the computational cost generated by the analysis of each triple, the
delay generated by retrieving the information pointed by them through the Web.
Furthermore, one of the challenges of this work was to solve the cold start
problem when looking for related datasets. For these reason, external links have been
1 @prefix : &lt;http://example.org/resource/&gt; .
2 @prefix foaf: &lt;http://xmlns.com/foaf/0.1/&gt; .
3 @prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; .
4 @prefix dcterms: &lt;http://purl.org/dc/terms/&gt; .
5 @prefix aktors: &lt;http://www.aktors.org/ontology/portal/&gt; .
6
7 :Tim_Berners-Lee rdf:type foaf:Person ;
8 foaf:name "Tim Berners-Lee" ;
9 foaf:mbox "timbl@w3.org" ;
10 foaf:homepage &lt;http://www.w3.org/People/Berners-Lee&gt; .
11
12 :pub1 rdf:type aktors:Article-Reference ;
13 aktors:has-title "The Semantic Web" ;
14 dcterms:creator :Tim_Berners-Lee ;
15 aktors:published-by "Scientific American" .</p>
        <p>Listing 1: RDF triples used in the example model.
removed. Despite this, in Section 6 the in uence of existing links is brie y
analyzed. In Figure 3, the resulting model after elimination of external links can be
seen. In this model a structure representing a publication and its author can be
seen, as a synthesis of triples presented in Listing 1.</p>
        <p>Regarding to the literals, although they have been maintained in proposed
model, there are no literals in any of most frequent structures extracted during
the evaluation. The explanation of this situation is similar to the explanation
given about the URIs in subjects: with so much variety of di erent literals, the
probability to form part of a candidate substructure is minimal.
foaf:homepage</p>
      </sec>
      <sec id="sec-4-2">
        <title>Extraction and Comparison of Most Frequent Subgraphs</title>
        <p>
          We apply SUBDUE to the graph obtained as a result of the previous
transformations. According to [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], SUBDUE's runtime and resource consumption do
not grow linearly with the size of the input graphs, making it hard to do an
estimation of the total runtime or knowing whether the process is going to
nalize in a reasonable amount of time. The ideal parameters for getting a balance
between a ordable runtime and obtaining an appropriate number of candidate
substructures are still subject of experimentation, but limiting the number of
candidate substructures to 5, applying the incremental analysis capabilities and
pruning the input graph seem to be appropriate parameters to start nding this
balance.
        </p>
        <p>Once the most frequent substructures from di erent datasets are extracted,
the comparison among them has been done through SUBDUE's gm (Graph
Matcher). Given a pair of graphs, this utility computes the cost of transforming
the largest graph into the smallest one, returning the number of transformations
done. In this case, all transformations (addition, removal or replacing of a node)
have the same cost. As this number of transformations is not normalized by
default (it depends on the size of the input graphs), the normalization shown in
Equation 3 has been applied to the result. Finally, the similarity between both
substructures is calculated as can be seen in Equation 4.</p>
        <p>Costnormalized =</p>
        <p>Cost
jvertexeslargestGraphj + jedgeslargestGraphj
Similarity = 1</p>
        <p>Costnormalized
(3)
(4)</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3 Implementation</title>
        <p>This work has been implemented following the work ow explained next. The
di erent stages of this work ow have been implemented as independent tasks:
{ Generation of IDs and replacement of subjects: as can be seen in
listing 2, SUBDUE has its own format for representing graphs. This format
requires to assign an unique ID to each vertex. At this rst step, the RDF
graphs are iterated, replacing the subject of each resource by its ontological
class if the property rdf:type is presented and an unique and consecutive
IDs are assigned to each generated vertex.
{ SUBDUE le generation: once the IDs are assigned, the relationships
among generated vertices are analysed in order to generate edges. Once these
edges are generated, the nal SUBDUE le of each graph is generated.
{ Most frequent subgraph extraction: at these step the most frequent
subgraph of each RDF graph is extracted with SUBDUE and previously
generated input les.
{ Graph matching: at last, similarities among these subgraphs are found
with SUBDUE's Graph Matching (gm) tool.</p>
        <p>The implementation of this work and baselines (subsection 5.2) can be found
at https://github.com/memaldi/lod-fsm.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>
        Presented approach has been evaluated against datasets from Linked Open Data
Cloud. The development of the evaluation follows these steps. First, a gold
standard has been created for determining the e ectiveness of both developed system
and baseline solutions in terms of precision and recall. These baseline solutions
(or baselines) are simple solutions that solve proposed problem in a simple way,
with the aim of establishing a baseline to be surpassed by the new solution.
At last, the results given from proposed solution are compared with the results
given by baseline solutions. The evaluation has been done only in terms of e
cacy because the developed work has been designed to be launched in batch and
without the interaction of the end-user, so that, the e ciency is not considered
a key factor to be evaluated.
Listing 2: Representation of the graph from gure 3 in SUBDUE. In les 1-6 the
vertices are represented while in les 7-11 the edges are represented.
For constituting the gold standard, two di erent sources have been checked. The
rst source, inspired by [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], consists on checking already existing links among
datasets used in this evaluation. The links among these datasets have been
extracted through the property links:&lt;target dataset id&gt; from The Datahub4
entry of each dataset, as this property is requested for publishing datasets in the
LOD Cloud. But, when evaluating the proposed solution, many links that are
not described in The Datahub were discovered. These links could not appear in
The Datahub for many reasons: related dataset have been published after the
publication of the source dataset and the publisher has not checked them, or
simply, the publisher did not know the existence of these related datasets. The
absence of these valid links could provoke a situation in where the developed
system could recommend datasets that, in fact, are valid results but considered
as false positives by the gold standard.
      </p>
      <p>
        To solve this issue, a second source have been used to form the gold standard.
This source consisted on surveying di erent researchers on Semantic Web and
Linked Data for determining the validity of these new relations among datasets.
These surveys have been performed through a web application5 that shows to
researchers di erent pairs of datasets, to determine if there was any possible
relationship between them. These datasets were represented by the title, description
and resources published in their The Datahub's entry. Three options were
allowed for each pair of datasets: \yes" if they consider that there was a possible
relationship between them, \no" if they consider the opposite, and \unde ned"
if they were not sure about the possible relationships. Each pair of datasets have
been evaluated by three di erent researchers. This approach arises another issue:
the number of di erent pairs resulting from the combination of all the datasets
4 http://datahub.io
5 https://github.com/memaldi/ld-similarity-survey
employed during the evaluation ups to 2,3466. Considering that each pair have
to be evaluated three times, this number increases to 7,038 evaluations to be
done by selected researchers. Considering that this number of evaluations is too
high, the number of dataset pairs have been reduced considering the evidence
proposed by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The authors of this work consider if a pair of datasets have
common links to the same datasets, they could be related. From these evidence,
only datasets that are linked to common datasets have been included, reducing
the number of evaluations to 594. Once all the evaluations have been done, the
Fleiss' Kappa [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] coe cient reveals an agreement among the reviewers of 41%,
which means a moderate agreement according to [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. At last, for constituting
the gold standard, relations extracted from The Datahub have been
complemented with relations which in the survey have been approved by at least two
reviewers.
      </p>
      <p>At the time of writing, an unique gold standard has been created for
evaluating the developed system. However, a more suitable solution could be to develop
a di erent gold standard depending on the topic of the datasets whose
similarities are going to be extracted (biology, statistical government data, academical
publications, etc.). This work is going to be attempted in the future work.
5.2</p>
      <sec id="sec-5-1">
        <title>Baselines</title>
        <p>For weighting the results given by proposed solution, three baselines have been
developed. The rst baseline is based on the evidence that as more ontologies
are shared between a pair of datasets, more related they are. The relation degree
between a pair of datasets is calculated as follows, being N the set of ontologies
used to describe the dataset D:
score(D1; D2) =</p>
        <p>N1 \ N2
max(jN1j; jN2j)</p>
        <p>The second baseline, similarly to the rst one, takes the common ontologies
between a pair of datasets to establish their relation degree, but establishing a
ranking based on the usage of the classes and properties of each ontology used
within each dataset. The distance between di erent pair of rankings have been
calculated through a normalized Kendall's Tau:</p>
        <p>K( 1; 2) =</p>
        <p>X Ki;j ( 1; 2)
i;j2P</p>
        <p>At last, the third baseline calculates the relation degree between a pair of
datasets calculating the Jaccard distance among all the triples of each dataset.
Being T1 and T2 the pair of datasets to be compared, the Jaccard distance is
calculated as follows:
6 The complete list of used datasets can be found at http://apps.morelab.deusto.
es/iesd2015/datasets.csv
(5)
(6)
In gure 4, the results of both proposed solution and baselines are shown, in
terms of precision, recall, F1-score and accuracy. As can be seen, in terms of
precision, the proposed solution clearly overcomes the baselines, overpassing a
value of 0.8 from a threshold of 0.4; reaching a maximum value of 0.9. On the
other hand, the maximum value of recall is about 0.51, decaying from a threshold
of 0.3, o ering a result that is not as good as expected and being surpassed by one
of the baselines. This situation is promoted by the fact that higher the threshold
is, the requested similarity between pair of graphs is higher too. Thus, there are
pairs of datasets detected as related by our solution but their relation degree is
not as high as expected. These results show that recommendations done by the
proposed solution are valid in a high percentage (low number of false positives),
although there still are many related datasets that the solution omits.</p>
        <p>
          Regarding to the good results obtained by the rst baseline, there is a
clarication to be done. As can be seen, many datasets used in the evaluation where
produced by the RKB Explorer project [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. These datasets have been published
using the same ontologies and methodology, so they share the same ontologies in
similar proportion. As exposed in section 6, providing a more diverse evaluation
set is one of the key tasks for the future work.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Work</title>
      <p>In this work, a solution for recommending related datasets and easing the task of
dataset linking has been presented. As exposed in Section 5, proposed solution
provides precise recommendations of candidate datasets to be linked. Although
the recall is not as good as expected, given that nowadays the help that a data
publisher has at time of selecting related datasets for linking his datasets is
very limited, we consider that is more important to recommend valid
candidate datasets for interlinking, although these datasets are not all the available
datasets. However, the results given by the recall are an issue in which we are
currently working. At the present time, for avoiding false negatives provoked by
related datasets described by di erent ontologies, string similarity techniques are
being introduced, achieving an increase of recall between 0.10 and 0.30 regarding
to the work exposed in this paper. Another task to be done in the future work is
to analyse how the links generated by the own system can be used for improving
the results in an iterative way. At last, regarding to the evaluation, an important
future task is to include more diverse datasets in the evaluation set for avoiding
the over tting of the proposed model or any of the baselines and developing a
di erent topic-based gold standard.</p>
      <p>In conclusion, the promising results obtained show that most frequent
subgraph mining techniques can be used to ease the task of interlink datasets from
the Semantic Web.</p>
      <p>Acknowledgments. This work has been developed within WeLive project,
founded by the European Union's Horizon 2020 research and innovation
programme under grant agreement No 645845.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Bohm,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Kasneci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          :
          <article-title>Latent topics in graph-structured data</article-title>
          .
          <source>In: Proceedings of the 21st ACM international conference on information and knowledge management</source>
          . pp.
          <volume>2663</volume>
          {
          <fpage>2666</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Borgelt</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berthold</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          :
          <article-title>Mining molecular fragments: Finding relevant substructures of molecules</article-title>
          .
          <source>In: Proc. of the 2002 IEEE International Conference on Data Mining</source>
          . pp.
          <volume>51</volume>
          {
          <issue>58</issue>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Cheng, G.,
          <string-name>
            <surname>Qu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Searching linked objects with Falcons</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          <volume>5</volume>
          (
          <issue>3</issue>
          ),
          <volume>49</volume>
          {
          <fpage>70</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dehaspe</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toivonen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>King</surname>
          </string-name>
          , R.D.:
          <article-title>Finding frequent substructures in chemical compounds</article-title>
          .
          <source>In: Proc. of the 4th International Conference on Knowledge Discovery and Data Mining</source>
          . vol.
          <volume>98</volume>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cost</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reddivari</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doshi</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sachs</surname>
          </string-name>
          , J.:
          <article-title>Swoogle: A Semantic Web search and metadata engine</article-title>
          .
          <source>In: Proceedings of the 13th ACM Conference on Information and Knowledge Management</source>
          . pp.
          <volume>652</volume>
          {
          <issue>659</issue>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fetahu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dietze</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nunes</surname>
            ,
            <given-names>B.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casanova</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taibi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nejdl</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>A scalable approach for e ciently generating structured dataset topic pro les</article-title>
          .
          <source>In: The Semantic Web: Trends and Challenges</source>
          , pp.
          <volume>519</volume>
          {
          <fpage>534</fpage>
          . Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Fleiss</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>Measuring nominal scale agreement among many raters</article-title>
          .
          <source>Psychological bulletin 76(5)</source>
          ,
          <volume>378</volume>
          (
          <year>1971</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Glaser</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Millard</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          : RKB Explorer:
          <article-title>Application and infrastructure</article-title>
          .
          <source>In: Proceedings of the Semantic Web Challenge, in conjunction with the 6th International Semantic Web Conference</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Holder</surname>
            ,
            <given-names>L.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cook</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Djoko</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Substructure discovery in the SUBDUE system</article-title>
          .
          <source>In: Proc. of the AAAI Workshop on Knowledge Discovery in Databases</source>
          . pp.
          <volume>169</volume>
          {
          <issue>180</issue>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Coenen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zito</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A survey of frequent subgraph mining algorithms</article-title>
          .
          <source>The Knowledge Engineering Review</source>
          <volume>28</volume>
          (
          <issue>1</issue>
          ),
          <volume>75</volume>
          {
          <fpage>105</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kramer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Helma</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Mining for causes of cancer: machine learning experiments at various levels of detail</article-title>
          .
          <source>In: Proc. of the 3th International Conference on Knowledge Discovery and Data Mining</source>
          . pp.
          <volume>223</volume>
          {
          <issue>226</issue>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Landis</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koch</surname>
            ,
            <given-names>G.G.</given-names>
          </string-name>
          :
          <article-title>The measurement of observer agreement for categorical data</article-title>
          .
          <source>Biometrics</source>
          <volume>33</volume>
          (
          <issue>1</issue>
          ),
          <volume>159</volume>
          {
          <fpage>174</fpage>
          (
          <year>1977</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Leme</surname>
            ,
            <given-names>L.A.P.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopes</surname>
            ,
            <given-names>G.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nunes</surname>
            ,
            <given-names>B.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casanova</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dietze</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Identifying candidate datasets for data interlinking</article-title>
          .
          <source>In: Proceedings of the 13th International Conference on Web Engineering</source>
          . pp.
          <volume>354</volume>
          {
          <fpage>366</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Lopes</surname>
            ,
            <given-names>G.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leme</surname>
            ,
            <given-names>L.A.P.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nunes</surname>
            ,
            <given-names>B.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casanova</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dietze</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Recommending tripleset interlinking through a social network approach</article-title>
          .
          <source>In: Proceedings of the 14th international conference on Web Information Systems Engineering</source>
          . pp.
          <volume>149</volume>
          {
          <fpage>161</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>LIMES: a time-e cient approach for large-scale link discovery on the web of data</article-title>
          .
          <source>In: Proceedings of the 22nd International Joint Conference on Arti cial Intelligence</source>
          . pp.
          <volume>2312</volume>
          {
          <issue>2317</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Nikolov</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>d'Aquin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Identifying relevant sources for data linking using a Semantic Web index</article-title>
          .
          <source>In: Proceedings of the Linked Data on the Web workshop in conjunction with the 20th international World Wide Web conference</source>
          . vol.
          <volume>813</volume>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Tummarello</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delbru</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oren</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Sindice.com: Weaving the Open Linked Data</article-title>
          .
          <source>In: Proceedings of the 6th International Semantic Web Conference and the 2nd Asian Semantic Web Conference</source>
          . pp.
          <volume>552</volume>
          {
          <issue>565</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Tummarello</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Catasta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Danielczyk</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delbru</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Sig.ma:
          <article-title>Live views on the web of data</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>8</volume>
          (
          <issue>4</issue>
          ),
          <volume>355</volume>
          {364 (Nov
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Volz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaedke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobilarov</surname>
          </string-name>
          , G.:
          <article-title>Silk: A link discovery framework for the web of data</article-title>
          .
          <source>In: Proceedings of the Linked Data on the Web workshop in conjuntion with the 18th international World Wide Web conference</source>
          . vol.
          <volume>583</volume>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>