<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging Linked Data to Infer Semantic Relations within Structured Sources</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohsen Taheriyan</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Craig A. Knoblock</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pedro Szekely</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jose Luis Ambite</string-name>
          <email>ambiteg@isi.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yinyi Chen</string-name>
          <email>yinyic@usc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Southern California Department of Computer Science</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Southern California Information Sciences Institute and Department of Computer Science</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Information sources such as spreadsheets and databases contain a vast amount of structured data. Understanding the semantics of this information is essential to automate searching and integrating it. Semantic models capture the intended meaning of data sources by mapping them to the concepts and relationships de ned by a domain ontology. Most of the e ort to automatically build semantic models is focused on labeling the data elds with ontology classes and/or properties, e.g., annotating the rst column of a table with dbpedia:Person and the second one with dbpedia:Film. However, a precise semantic model needs to explicitly represent the relationships too, e.g., stating that dbpedia:director is the relation between the rst and second column. In this paper, we present a novel approach that leverages the small graph patterns occurring in the Linked Open Data (LOD) to automatically infer the semantic relations within a given data source assuming that the source attributes are already annotated with semantic labels. We evaluated our approach on a dataset of museum sources using the linked data published by Smithsonian American Art Museum as background knowledge. Mining only patterns of length one and two, our method achieves an average precision of 78% and recall of 70% in inferring the relationships included in the semantic models associated with data sources.</p>
      </abstract>
      <kwd-group>
        <kwd>semantic model</kwd>
        <kwd>semantic relation</kwd>
        <kwd>linked data</kwd>
        <kwd>semantic label</kwd>
        <kwd>semantic web</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Information sources such as relational databases, spreadsheets, XML, JSON,
and Web tables contain a signi cant amount of structured data. Given this
large amount of data available, we would like to be able to easily search over
this information or integrate di erent pieces of information. Understanding the
semantics of data sources enables us to provide a richer search experience and
automate the data integration task. In the Semantic Web, we can represent
dbpedia:Artwork
dbpedia:title
dbpedia:Person
foaf:name
dbpedia:Museum</p>
      <p>rdfs:label
title name location
Abraham Lincoln Mathew B. Brady National Portrait Gallery
Winter Landscape Mortimer L. Smith Detroit Institute of Art</p>
      <p>Zinnias Charles Demuth Dallas Museum of Art
the semantics of data by mapping it to the concepts and relationships de ned
by a domain ontology. In this paper, we use a graphical representation called
semantic model to show the mapping between a data source and an ontology.
In a semantic model, the nodes are ontology classes and the links are ontology
properties. Figure 1 depicts a semantic model for a sample data source including
information about some paintings. This model explicitly represents the meaning
of the data by mapping the source to the DBpedia ontology3 and the FOAF
ontology.4 Knowing this semantic model enables us to easily transform the data
in the table to RDF and publish it on the Web.</p>
      <p>The process of creating a semantic model for a source involves two steps:
(1) semantic labeling, and (2) establishing the semantic relations. In semantic
labeling, each data eld, or source attribute, is labeled with a semantic type, a
class or/and a data property of the domain ontology. In our example in Figure 1,
the semantic types of the rst, second, and third columns are title of Artwork,
name of Person, and label of Museum respectively. However, a semantic model
that only includes the semantic labels does not precisely represent the implicit
meaning of the data because it is not telling us how the source attributes are
related to each other. In our example, a Person could be the owner, painter,
or sculptor of an Artwork, but in the context of the given source, only painter
correctly interprets the relationship between Artwork and Person. To build a
semantic model that fully recovers the semantics of a data source, we need a
second step that determines the semantic relations between the source attributes
in terms of the properties in the ontology.</p>
      <p>There has been much e ort to automatically map data sources to
ontologies [4, 5, 7{11, 15, 16], but most focus on semantic labeling or are very limited
in automatically inferring the relationships. Our goal is to construct semantic
models that not only include the semantic types of the source attributes, but also
describe the relationships between them. Inferring the relationships between the
attributes is not a trivial task even when the correct semantic labels are given.
There might be multiple paths connecting two classes in the ontology and
without additional context, we do not know which one characterizes the intended
meaning of the data. This work focuses on the second step of building semantic</p>
    </sec>
    <sec id="sec-2">
      <title>3 http://dbpedia.org/ontology 4 http://xmlns.com/foaf/spec</title>
      <p>models. That is, we assume that source attributes are already annotated with
semantic labels and our work focuses on learning the relationships.</p>
      <p>We present a novel approach that exploits the knowledge from the domain
ontology and the Linked Open Data (LOD) to automatically infer the semantic
relations within a given data source. The LOD cloud contains a vast amount of
semantic data that can be used to learn how instances of di erent classes are
linked to each other. The main contribution of our work is exploiting the graph
patterns occurring in the linked data to disambiguate the relationships between
the source attributes. First, we use SPARQL to extract graph patterns with
di erent lengths occurring in the linked data. Next, we combine these patterns
into one graph and expand the resulting graph using the paths inferred from the
domain ontology. Then, we introduce a search algorithm that explores the graph
starting from the semantic labels of the source and heuristically nds the top k
semantic models connecting all the labels.</p>
      <p>We evaluated our approach on a dataset of well-known museum sources
modeled using the CIDOC-CRM5 ontology along with some other known
vocabularies including a total of 147 classes and 409 properties. We used the linked
data published by the Smithsonian American Art Museum6 as the background
knowledge. When we used only patterns of length one, our method achieved an
average precision of 65% and recall of 55% in inferring the relationships included
in the semantic models associated with data sources. Once we added patterns of
length two, the precision and recall improved to 78% and 70%. Our evaluation
shows that the longer the patterns extracted from the linked data, the more
accurate semantic models are generated.
2</p>
      <sec id="sec-2-1">
        <title>Motivating Example</title>
        <p>In this section, we provide an example to explain the problem of inferring
semantic relations within structured sources. In this example, we want to model a
data source using the CIDOC-CRM ontology and then use the created
semantic model to publish the source data as RDF. This source is a table containing
information about artworks in the Crystal Bridges Museum of American Art7
(Figure 2). We formally write the signature of this source as s(title, creation,
name) where s is the name of the source and title, creation, and name are the
names of the source attributes (columns).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5 http://www.cidoc-crm.org 6 http://americanart.si.edu/collections/search/lod/about 7 http://crystalbridges.org</title>
      <p>E22_Man-Made_Object</p>
      <p>P102_has_title P108i_was_produced_by
E35_Title</p>
      <p>E12_Production</p>
      <p>P14_carried_out_by</p>
      <p>E21_Person
label
title</p>
      <p>P4_has_time-span</p>
      <p>P131_is_identified_by
E52_Time-Span</p>
      <p>E82_Actor_Appellation</p>
      <p>P82_at_some_time_within
creation</p>
      <p>label
name</p>
      <p>
        To model this source, we rst need to label the source attributes. This
task involves assigning a semantic type to each source attribute. We formally
de ne a semantic type to be a pair consisting of a domain class and one of
its data properties hclass uri,property urii [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In this example, the correct
semantic types for the columns title, creation, and name are hE35 Title,labeli,
hE52 Time-Span, P82 at some time withini, and hE82 Actor Appellation,labeli.
Various techniques can be employed to automate the labeling task, nonetheless,
in this work, we assume that the labeling step is already done and we concentrate
on inferring the semantic relations.
      </p>
      <p>Figure 3 shows the correct semantic model of the source s. As we can see in
the gure, none of the semantic types corresponding to the source columns are
directly connected to each other, which makes the problem of nding the
correct semantic model more complex. There are many paths in the CIDOC-CRM
ontology connecting the assigned labels. For instance, we can use the classes
E39 Actor and E67 Birth to relate the semantic types E82 Actor Appellation
and E52 Time-Span:
(E39 Actor, P1 is identi ed by, E21 Actor Appellation)
(E39 Actor, P98i was born, E67 Birth)
(E67 Birth, P4 has time-span, E52 Time-Span)</p>
      <p>However, this way of modeling does not correctly represent the semantics
of this particular data source. In general, the ontology de nes a large space of
possible semantic models and without additional knowledge, we do not know
which one is a correct interpretation of the data.</p>
      <p>Now assume that we have access to a repository of linked data
containing the RDF triples published by some other museums. We can exploit this
linked data to bias the search space to prefer those models that are used for
related source. Once we have identi ed the semantic types of the source
attributes, we can search the linked data to nd the frequent patterns connecting
the corresponding classes. For example, by querying the linked data, we nd out
that P131 is identi ed by is more popular than P1 is identi ed by to connect
instances of E82 Actor Appellation and instances of E21 Person, and this makes
...
sense when we investigate the de nitions of these two properties in the
ontology. The property P1 is identi ed by describes the naming or identi cation of
any real world item by a name or any other identi er, and P131 is identi ed by
is a specialization of P1 is identi ed by that identi es a name used speci cally
to identify an instance of E39 Actor (superclass of E21 Person). We can query
the linked data to nd longer paths between entities. For instance, by
inspecting the paths with length two between the instances of E22 Man-Made Object
and E21 Person, we observe that the best way to connect these two classes
is through the path: E22 Man-Made Object P 108i was produced by
! E12 Production
P 14 is carried out b!y E21 Person.
3</p>
      <sec id="sec-3-1">
        <title>Inferring Semantic Relations</title>
        <p>In this section, we explain our approach to automatically deduce the attribute
relationships within a data source. The input to our approach are the domain
ontology, a repository of linked data in the same domain, and a data source whose
attributes are already labeled with semantic types. The output is a semantic
model expressing how the assigned labels are connected.
3.1</p>
        <p>Extracting Patterns from Linked Open Data
The Linked Open Data (LOD) includes a vast and growing collection of semantic
content published by various data providers in many domains. When modeling
a source in a particular domain, we can exploit the linked data published in that
domain to hypothesize attribute relationships within the source. We assume that
the source attributes are labeled with hclass uri,property urii pairs.</p>
        <p>We use SPARQL to query the linked data in order to extract the graph
patterns connecting the instances of the classes corresponding to the semantic
types. Each pattern consists of some nodes and links. The nodes correspond to
ontology classes and the links correspond to ontology properties. Suppose that
we want to nd the patterns connecting three classes c1, c2, and c3. Figure 4
exempli es some of the possible patterns to connect these classes. Depending on
the structure of the domain ontology, there might be a large number of possible
patterns for any given number of ontology classes. For simplicity, in this paper,
we only extract the tree patterns, the patterns in which the number of properties</p>
        <p>Algorithm 1 Construct Graph G</p>
        <p>Input: LOD Patterns, Semantic Types, Domain Ontology
Output: Graph G
. Add LOD patterns
1: sort the patterns descending based on their length
2: exclude the patterns contained in longer patterns
3: merge the nodes and links of the remaining patterns into G</p>
        <p>. Add Semantic Types
4: for each semantic type hclass uri,property urii do
5: add the class to the graph if it does not exist in G
6: end for</p>
        <p>. Add Ontology Paths
7: for each pair of classes ci and cj in G do
8: nd the directed and inherited properties between ci and cj in the ontology
9: add the properties that do not exist in G
10: end for</p>
        <p>return G
is exactly one less than the number of classes (the rst three patters in Figure 4).
That said, we de ne the length of a pattern as the number of links (ontology
properties) in a pattern. For example, a pattern with length one is in the form
of c1 p!c2 indicating that at least one instance of the class c1 is connected to
an instance of the class c2 with the property p. The following SPARQL query
extracts the patterns with length one and their frequencies from the linked data:
SELECT DISTINCT ?c1 ?p ?c2 (COUNT(*) as ?count)
WHERE f
?x ?p ?y.
?x rdf:type ?c1.
?y rdf:type ?c2.</p>
        <p>FILTER (?x != ?y).g
GROUP BY ?c1 ?p ?c2
ORDER BY DESC(?count);</p>
        <p>Querying a triple store containing a huge amount of linked data to mine long
patterns is not e cient. In our experiments, we only extracted the patterns with
length one and two. We will see later that even small graph patterns provide
enough evidence to infer rich semantic models.
3.2</p>
        <p>Merging LOD Patterns into a Graph
Once we extracted the LOD patterns, we combine them into a graph G that
will be used to infer the semantic models. Building the graph has three parts:
(1) adding the LOD patterns, (2) adding the semantic labels assigned to the
source attributes, and (3) expanding the graph with the paths inferred from the
ontology. Algorithm 1 shows the pseudocode of building the graph.</p>
        <p>The graph G is a weighted directed graph in which nodes correspond to
ontology classes and links correspond to ontology properties. The algorithm to
construct the graph is straightforward. However, we adopt a subtle approach
to weight the links. We assign a much lower weight to the links added from
the LOD patterns comparing to the links added from the ontology. Since we
are generating minimum-cost models in the next section, this weighting strategy
gives more priority to the links used more frequently in the linked data. The
weight of the links coming from the LOD patterns has an inverse relation with
the frequency of the patterns.</p>
        <p>The other important feature of the links in the graph is their tags. We assign
an identi er to each pattern added to the graph and annotate the links with the
identi ers of the supporting patterns. Suppose that we are adding two patterns
m1 : c1 p!1c2 p!2c3 and m2 : c1 p!1c2 p!3c4 to G. The link p1 from c1 to c2 will
be tagged with fm1; m2g, the link p2 from c2 to c3 will have only fm1g as its tag
set, and the link p3 from c2 to c4 will be tagged with fm2g. We use the link tags
later to prioritize the models containing larger segments from the LOD patterns.
3.3</p>
        <p>
          Generating and Ranking Semantic Models
The nal part of our approach is to compute the semantic models from the graph.
We map the semantic types to the nodes of the graph and then nd the top k
minimal trees connecting those nodes. The algorithm that nds the top k trees
is a customized version of the BANKS algorithms [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. It creates one iterator for
each of the nodes corresponding to the semantic types, and then the iterators
follow the incoming links to reach a common ancestor.
        </p>
        <p>The BANKS algorithm uses the iterator's distance to its starting point to
decide which link should be followed next. Because our weights have an inverse
relation with their popularity, the algorithm prefers more frequent links.
However, selecting more popular links does not always yield the correct semantic
model. The coherence of the patterns is another important factor that we need
to consider. Thus, we use a heuristic that prefers the links that are parts of the
same pattern even if they have higher weights. Suppose that m1 : c1 p!1c2 and
m2 : c1 p!2c2 p!3c3 are the only patterns used to build the graph G, and the
weight of the link p2 is higher than p1. Assume that c1 and c3 are the semantic
labels. The algorithm creates two iterators, one starting from c1 and one from
c3. The iterator that starts from c3 reaches c2 by following the incoming link
c2 p!3c3. At this point, it analyzes the incoming links of c2 and although p1 has
lower weight, it rst chooses p2 to traverse next. This is because p2 is part of
the pattern m2 which includes the previously traversed link p3.</p>
        <p>Once we computed top k trees, we rank them rst based on their coherence
and then their cost (sum of the weights of the links). The coherence metric gives
priority to the models that contain longer patterns. For example, a model that
includes one pattern with length 3 will be ranked higher than a model including
two patterns with length 2, and the latter in turn will be preferred over a model
with only one pattern with length 2.
4</p>
      </sec>
      <sec id="sec-3-2">
        <title>Evaluation</title>
        <p>To evaluate our approach, we used a dataset of 29 museum data sources in
CSV, XML, or JSON format containing data from di erent art museums in
the US. The total number of attributes for this dataset was 418 (on average
14 attributes per source). We applied our approach on this dataset to nd the</p>
        <p>Evaluation Dataset
number of sources 29
number of attributes 418
number of classes in the ontologies 147
number of properties in the ontologies 409
number of nodes in the gold-standard models 812
number of links in the gold-standard models 785
candidate semantic models for each source and then compared the rst ranked
models with the gold standard models created manually by an expert in CIDOC
Conceptual Reference Model (CIDOC-CRM). Table 1 shows more details of the
evaluation dataset. The dataset including the sources, the domain ontologies,
and the gold standard models is available on GitHub.8. The source code of our
approach is integrated into Karma which is available as open source.9</p>
        <p>The linked data that we used as the background knowledge is the RDF
data published by the Smithsonian American Art Museum. The museum has
made use of the CIDOC-CRM to map out the concepts and relationships that
exist within the artwork collection. This repository includes more than 3 million
triples (3,398,350). We injected the data into a Virtuoso triple store and then
used SPARQL to extract patterns of length one and two. There were 68 distinct
patterns with length one (two nodes and one link) and 634 distinct patterns with
length two (three nodes and two links).</p>
        <p>We assumed that the correct semantic labels for the source attributes are
known. The goal was to see how well our approach learns the attribute
relationships having the correct semantic types. We performed three experiments. First,
we only used the domain ontology to build a graph on top of the semantic labels
and then computed the semantic model connecting those labels. In the second
experiment, we took into account the patterns with length one extracted from
the linked data, and in the third experiment, we used the patterns of both length
one and two. For each source, we computed top 10 candidate semantic models
and compared the rst one with the correct model in the gold standard set.</p>
        <p>We measured the accuracy of the computed semantic models by comparing
them with the gold standard models in terms of precision and recall. Assuming
that the correct semantic model of the source s is sm and the semantic model
learned by our approach is sm0, we de ne the precision and recall as:
precision =
rel(sm) \ rel(sm0)
rel(sm0)
; recall =
rel(sm) \ rel(sm0)
rel(sm)
where rel(sm) is the set of triples (u; v; e) in which e is a link from the node u
to the node v in the semantic model sm. Consider the semantic model in Figure 3,
rel(sm)=f(E22 Man-Made Object, P108i was produced by, E12 Production), g.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>8 https://github.com/taheriyan/cold-2015 9 https://github.com/usc-isi-i2/Web-Karma</title>
      <p>Since the correct semantic types are given, we excluded their corresponding
triples in computing the precision and recall. For example, we do not consider
(E35 Title, label, title) in our evaluation. From the 785 links in the correct
models, 418 links correspond to the semantic types because we have 418 attributes.
Thus, our evaluation measures the accuracy of inferring the remaining 367 links.</p>
      <p>Table 2 shows the average precision and recall for all 29 sources. An
interesting observation is that when we do not use the linked data patterns, the precision
and recall are close to zero. This low accuracy comes from the fact that in most
of the gold standard models, the attributes are not directly connected and there
are multiple paths between each pair of classes in the ontology (and thus in our
graph), and without additional information, we cannot resolve the ambiguity.
When we exploit the patterns with length one, there is a boost in precision and
recall. Since we are using the pattern frequencies in assigning the weights to the
links of the graph, using patterns of length one means that we are only taking
into account the popularity of the links in computing the semantic models. Once
we added the patterns with length two, our approach achieved more than 10%
improvement in both precision and recall. This means that considering
coherence (even to a small extent) in addition to the link popularity empowers our
approach to derive more accurate semantic models.</p>
      <p>The column time in Table 2 shows the running time of our algorithm on a
single machine with a Mac OS X operating system and a 2.3 GHz Intel Core i7
CPU. This is the time from combining pre-extracted LOD patterns into a graph
until generating and ranking candidate semantic models. The algorithm is much
faster when we only use the domain ontology as the background knowledge,
because adding LOD patterns will be excluded from the graph construction
process (lines 1-3 in Algorithm 1). The reason why more time is required when
we use patterns of length one comparing to the case where we use patterns of
both length one and two is related to details of our algorithm to compute top k
minimal trees. Although it takes longer to create the graph when we add patterns
of length two, the algorithm to generate the candidate models nds top k trees
faster reducing the total running time.
5</p>
      <sec id="sec-4-1">
        <title>Related Work</title>
        <p>There has been many studies in the Semantic Web to automatically describe the
semantics of data sources as a mapping from the source to an ontology. Since the
focus of our work is on inferring the semantic relations, we compare our work
with the ones that pay attention to inferring semantic relationship and not only
semantic labeling.</p>
        <p>
          Limaye et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] used YAGO10 to annotate web tables and generate binary
relationships using machine learning approaches. Venetis et al. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] presented a
scalable approach to describe the semantics of tables on the Web. To recover the
semantics of tables, they leverage a database of class labels and relationships
automatically extracted from the Web. They attach a class label to a column if
a su cient number of the values in the column are identi ed with that label in
the database of class labels, and analogously for binary relationships. Although
these approaches are very useful in publishing semantic data from tables, they
are limited in learning the semantics of sources as a united model. Both of these
approaches only infer individual binary relationships between pair of columns.
They are not able to nd the relation between two columns if no relationship
is directly instantiated between the values of those columns. Our approach can
connect one column to another one through a path in the ontology.
        </p>
        <p>
          Carman and Knoblock [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] use known source descriptions to learn a semantic
description that precisely describes the relationship between the inputs and
outputs of a source, expressed as a Datalog rule. However, their approach is limited
in that it can only learn sources whose models are subsumed by the models of
known sources. That is, the description of a new source is a conjunctive
combinations of known source descriptions.
        </p>
        <p>
          In our earlier Karma work [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], we build a graph from learned semantic types
and a domain ontology and use this graph to semi-automatically map a source
to the ontology. Since only using the ontology does not necessarily generates
accurate models, we had the user in the loop to interactively re ne the suggested
models. Later, we introduced an automatic approach that exploits the semantic
models of similar data sources in addition to the domain ontology to learn a
model for a new unknown source [
          <xref ref-type="bibr" rid="ref13 ref14">13,14</xref>
          ]. Our work in this paper complements our
previous work in cases where few, if any, known semantic models are available. In
fact, the number of known semantic models is limited in many domains, however,
there may be a huge amount of semantic data published in those domain. The
presented approach mines the small graph patterns from the available linked
data to infer the semantic relationships within data sources.
        </p>
        <p>
          Our work is closely related to other work leveraging the Linked Open Data
(LOD) cloud to capture the semantics of sources. Mulwad et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] used
Wikitology [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], an ontology which combines some existing manually built knowledge
systems such as DBPedia and Freebase [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], to link cells in a table to Wikipedia
entities. They query the background LOD to generate initial lists of candidate
classes for column headers and cell values and candidate properties for relations
between columns. Then, they use a probabilistic graphical model to nd the
correlation between the columns headers, cell values, and relation assignments. The
quality of the semantic data generated by this category of work is highly
dependent on how well the data can be linked to the entities in the LOD. While for
most popular named entities there are good matches in the LOD, many tables
contain domain-speci c information or numeric values (e.g., temperature and
age) that cannot be linked to the LOD. Moreover, these approaches are only
10 http://www.mpi-inf.mpg.de/yago-naga/yago
able to identify individual binary relationships between the columns of a table.
However, an integrated and united semantic model is more than fragments of
binary relationships between the columns. In a complete semantic model, the
columns may be connected through a path including the nodes that do not
correspond to any column in the table.
6
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Discussion</title>
        <p>We presented a novel approach to infer semantic relations within structured
sources. Understanding how the source attributes are related is an essential part
of building a precise semantic model for a source. Such models automate the
process of publishing semantic data. The core idea of our work is to exploit the
small graph patterns occurring in the Linked Open Data to hypothesize attribute
relationships within a data source.</p>
        <p>Our evaluation results support the theory that more accurate models can
be constructed when longer graph patterns from LOD are used. Although these
patterns can be pre-computed, using SPARQL queries to extract long patterns
from a large number of triples is challenging. For example, the Virtuoso response
time to the SPARQL query to extract patterns of length two with a chain shape
(c1 p!2c2 p!3c3) was approximately 90 seconds, and it was roughly 1 hour for the
query to extract patterns of length two having a V-shape (c1 p!2c2 p3 c3). We
were only able to collect a few patterns with length three and could not extract
any pattern with length four from our Virtuoso server in a 5-hour timeout.
E ciently mining more complex patterns from the linked data is one direction
of our future work.</p>
        <p>
          One limitation of the presented approach is that it heavily depends on the
linked data at hand. It assumes that there is su cient amount of linked data
available in the same domain that we are modeling the target data sources.
Another direction of our future work is to extend our approach to cases where
no or sparse data is available. We want to investigate how the current method
can be combined with our previous work that uses known semantic models to
learn semantic models of structured sources [
          <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
          ].
        </p>
        <p>Our work plays a role in helping communities to produce consistent Linked
Data so that sources containing the same type of data use the same classes and
properties when published in RDF. Often, there are multiple correct ways to
model the same type of data. A community is better served when all the data
with the same semantics is modeled using the same classes and properties. Our
work encourages consistency because our algorithms bias selection of classes and
properties towards those used more frequently in existing data.</p>
        <p>Acknowledgements. This research was supported in part by the National
Science Foundation under Grant No. 1117913 and in part by Defense Advanced
Research Projects Agency (DARPA) via AFRL contract number
FA8750-14-C0240. The U.S. Government is authorized to reproduce and distribute reprints for
Governmental purposes notwithstanding any copyright annotation thereon. The
views and conclusions contained herein are those of the authors and should not
be interpreted as necessarily representing the o cial policies or endorsements,
either expressed or implied, of NSF, DARPA, AFRL, or the U.S. Government.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bhalotia</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hulgeri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nakhe</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chakrabarti</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sudarshan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Keyword Searching and Browsing in Databases Using BANKS</article-title>
          .
          <source>In: Proceedings of the 18th International Conference on Data Engineering</source>
          . pp.
          <volume>431</volume>
          {
          <issue>440</issue>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bollacker</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evans</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paritosh</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sturge</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , J.:
          <article-title>Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge</article-title>
          .
          <source>In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data</source>
          . pp.
          <volume>1247</volume>
          {
          <fpage>1250</fpage>
          . SIGMOD '08,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Carman</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knoblock</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          :
          <article-title>Learning Semantic De nitions of Online Information Sources</article-title>
          .
          <source>Journal of Arti cial Intelligence Research</source>
          <volume>30</volume>
          (
          <issue>1</issue>
          ),
          <volume>1</volume>
          {
          <fpage>50</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>DiFranzo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michaelis</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendler</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          :
          <article-title>TWC data-gov corpus: incrementally generating linked government data from data.gov</article-title>
          . In: Rappa,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Freire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Chakrabarti</surname>
          </string-name>
          , S. (eds.) WWW. pp.
          <volume>1383</volume>
          {
          <fpage>1386</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parr</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sachs</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>RDF123: From Spreadsheets to RDF</article-title>
          .
          <source>In: The Semantic Web - ISWC</source>
          <year>2008</year>
          , pp.
          <volume>451</volume>
          {
          <issue>466</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Knoblock</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szekely</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ambite</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lerman</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muslea</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taheriyan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mallick</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Semi-Automatically Mapping Structured Sources into the Semantic Web</article-title>
          .
          <source>In: Proc. 9th Extended Semantic Web Conference</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Krishnamurthy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mittal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knoblock</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szekely</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Assigning Semantic Labels to Data Sources</article-title>
          .
          <source>In: Proceedings of the 12th Extended Semantic Web Conference (ESWC) (May</source>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Limaye</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarawagi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chakrabarti</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Annotating and Searching Web Tables Using Entities, Types and Relationships</article-title>
          .
          <source>PVLDB</source>
          <volume>3</volume>
          (
          <issue>1</issue>
          ),
          <volume>1338</volume>
          {
          <fpage>1347</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Mulwad</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Semantic Message Passing for Generating Linked Data from Tables</article-title>
          .
          <source>In: The Semantic Web{ISWC</source>
          <year>2013</year>
          , pp.
          <volume>363</volume>
          {
          <fpage>378</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Pol iet, S.,
          <string-name>
            <surname>Ichise</surname>
          </string-name>
          , R.:
          <source>Automated Mapping Generation for Converting Databases into Linked Data</source>
          . In: Polleres,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Chen</surname>
          </string-name>
          , H. (eds.)
          <source>ISWC Posters&amp;Demos. CEUR Workshop Proceedings</source>
          , vol.
          <volume>658</volume>
          .
          <string-name>
            <surname>CEUR-WS.org</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Saquicela</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blazquez</surname>
            ,
            <given-names>L.M.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corcho</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Lightweight Semantic Annotation of Geospatial RESTful Services</article-title>
          .
          <source>In: Proceedings of the 8th Extended Semantic Web Conference (ESWC)</source>
          . pp.
          <volume>330</volume>
          {
          <issue>344</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Syed</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Creating and Exploiting a Hybrid Knowledge Base for Linked Data</article-title>
          .
          <source>In: Agents and Arti cial Intelligence</source>
          , pp.
          <volume>3</volume>
          {
          <fpage>21</fpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Taheriyan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knoblock</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szekely</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ambite</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>A Graph-based Approach to Learn Semantic Descriptions of Data Sources</article-title>
          .
          <source>In: Procs. 12th International Semantic Web Conference (ISWC)</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Taheriyan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knoblock</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szekely</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ambite</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>A Scalable Approach to Learn Semantic Models of Structured Sources</article-title>
          .
          <source>In: Semantic Computing (ICSC)</source>
          ,
          <year>2014</year>
          IEEE International Conference on. pp.
          <volume>183</volume>
          {
          <issue>190</issue>
          (
          <year>June 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Venetis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madhavan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pasca</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miao</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Recovering Semantics of Tables on the Web</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>4</volume>
          (
          <issue>9</issue>
          ),
          <volume>528</volume>
          {538 (
          <year>June 2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>K.Q.</given-names>
          </string-name>
          :
          <article-title>Understanding Tables on the Web</article-title>
          . In: Atzeni,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Cheung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.W.</given-names>
            ,
            <surname>Ram</surname>
          </string-name>
          , S. (eds.)
          <source>ER. Lecture Notes in Computer Science</source>
          , vol.
          <volume>7532</volume>
          , pp.
          <volume>141</volume>
          {
          <fpage>155</fpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>