<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The GeoLink Framework for Pattern-based Linked Data Integration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adila Krisnadhi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff7">7</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yingjie Hu</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Krzysztof Janowicz</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pascal Hitzler</string-name>
          <xref ref-type="aff" rid="aff7">7</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Arko</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Suzanne Carbotte</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cynthia Chandler</string-name>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michelle Cheatham</string-name>
          <xref ref-type="aff" rid="aff7">7</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Douglas Fils</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Timothy Finin</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peng Ji</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthew Jones</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nazifa Karima</string-name>
          <xref ref-type="aff" rid="aff7">7</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kerstin Lehnert</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Audrey Mickle</string-name>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Narock</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Margaret O'Brien</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lisa Raymond</string-name>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adam Shepherd</string-name>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Schildhauer</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Wiebe</string-name>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Consortium for Ocean Leadership</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Computer Science</institution>
          ,
          <addr-line>Universitas</addr-line>
          <country country="ID">Indonesia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Lamont-Doherty Earth Observatory, Columbia University</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Marymount University</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of California</institution>
          ,
          <addr-line>Santa Barbara</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>University of Maryland</institution>
          ,
          <addr-line>Baltimore County</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>Woods Hole Oceanographic Institution</institution>
        </aff>
        <aff id="aff7">
          <label>7</label>
          <institution>Wright State University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>GeoLink is one of the building block projects within EarthCube, a major e ort of the National Science Foundation to establish a next-generation knowledge infrastructure for geosciences. Speci cally, GeoLink aims to improve data reuse and integration of seven geoscience data repositories through the use of ontologies. In this paper, we present the approach taken by this project, which combines linked data publishing and modular ontology engineering based on ontology design patterns to realize integration while respecting existing heterogeneity within the participating repositories.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        With the establishment dozens of data repositories, data integration is becoming
a major challenge faced by the ocean (and geo-)science research community. The
problem stemmed from the fact that data repositories were established to serve
speci c parts of the community, which leads to a very high degree of data
heterogeneity in formats, methods of access, and conceptualization. GeoLink project1,
a part of EarthCube, the National Science Foundation (NSF)'s larger e ort to
establish next-generation knowledge infrastructure, aims to develop a exible
and extendible data integration framework, starting from seven major ocean
science data repositories2 by leveraging Linked Data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Ontology Design
Patterns (ODPs) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. With Linked Data, repositories describe and publish their
data using standard model that includes links to other data in other
repositories. Meanwhile, horizontal alignment across di erent repositories with possibly
1 See www.geolink.org, schema.geolink.org, and data.geolink.org
2 BCO-DMO, DataONE, IEDA, IODP, LTER, MBLWHOI Library, and R2R
independent semantic models can be achieved with the help of ODPs. In this
paper, we present an ODP-based data integration employed in GeoLink and we
speci cally invite the conference attendees to our corresponding poster
presentation since this paper complements our ISWC 2015 ontology paper [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] { which
describes the details of the GeoLink ODP collection { with the description of
GeoLink architecture for cross-repository discovery that still provides su cient
exibility and extendibility for data providers.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Data Integration Framework with ODPs</title>
      <p>The GeoLink data integration framework, depicted in Figure 1, has essentially
three layers: the data sources/repositories, the global schema, and the user
interface. Here, the data sources are assumed to be linked data repositories, hence at
least, we have RDF as the standard data model. The more challenging problem
is integration at the level of semantic. Intuitively, the global schema's role is to
provide a common vocabulary that a user can use to perform data discovery.</p>
      <p>From the outset, this three-layered framework is similar to your typical
ontology-based data integration framework. In such a framework, the global
schema is realized in the form of one ontology, typically a monolithic, upper-level
ontology. The problem with this typical approach is that such an ontology is too
cumbersome to use and very hard to understand, especially for data providers
who may not have the necessary expertise. Moreover, semantic heterogeneity
in the data across di erent repositories exponentially increases the di culty in
using and maintaining the ontology. When a new data repository wishes to join
The GeoLink Framework for Pattern-based Linked Data Integration
CONSTRUCT {
?x a :Cruise ;</p>
      <p>:providesAgentRole [a :ChiefScientistRole; :isPerformedBy ?p ] .
} WHERE {
?x a bcdmo:Deployment ; bcdmo:ofPlatform [a bcdmo:Vessel] ;
bcodmo:hasChiefScientist ?p . }
the framework, a complicated adjustment of the ontology may become necessary
to ensure the existing integration does not break.</p>
      <p>
        To alleviate this problem, GeoLink framework employs a set of ODPs as the
global schema, instead of one monolitihic ontology. Each ODP models a generic
notion in a particular domain. So, the key part of the approach is identifying
a number of generic notions in ocean science relevant for the data repositories
involved in the project. The project then proceeded by modeling each notion
one by one collaboratively in a modular way, identifying widely reusable and
reoccurring aspects in those notions. Each notion gives us one ODP realized
as a self-contained, highly modular ontology, which are su cient to de ne the
given notion precisely without putting too strong ontological commitments. A
high-level overview of the set of ODPs for GeoLink can be seen in the middle
layer of Figure 1. Details of these ODPs are given in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], including collaborative
modeling e orts between the ontology engineers and the domain experts needed
to make sure the ODPs are grounded in representative, real use cases.
      </p>
      <p>Data providers can then exibly join this integration framework by
populating the ODPs : making RDF triples representing their data are annotated with
vocabulary from ODPs and available as linked data. This can be done in di
erent ways, e.g., exposing the data via SPARQL endpoint or providing dumps of
RDF triples. This constitutes the intermediate layer of \alignment" between the
data repositories and the ODPs in Figure 1. From users' perspective, if the data
from all data providers are annotated with the ODPs, they will in principle only
see one RDF graph (not necessarily residing in a central hub), which aggregates
data from all participating repositories. Vocabulary in the ODPs can then be
used as the language with which federated queries to the data can be formulated.
For data providers, however, populating ODPs may not be as straightforward.</p>
      <p>Essentially, the GeoLink framework o ers two approaches for data providers
to populate the ODPs. First approach, data providers annotate their data by
directly employing the vocabulary de ned by the ODPs. Speci cation of the
vocabulary can be easily obtained as OWL les from GeoLink. Unfortunately, it
is possible that some data providers, especially if they join already have their
own linked data schema or use their own vocabulary of choice, are reluctant to
do the rst approach. So, the second approach is that data providers provide
a schema mapping to GeoLink ODPs. Such a schema mapping can be expressed
using a SPARQL CONSTRUCT query, which can either be used to generate
RDF triples in batch or on the y. For example, the schema in BCO-DMO does
not contain the class Cruise explicitly, but rather, understands Cruise as a
Deployment whose platform is a Vessel. The Cruise class in the Cruise ODP can
thus be populated by executing a query like the one in Figure 2. The query also
generates a new node for ChiefScientistRole since BCO-DMO models chief
scientist of a Deployment using a property hasChiefScientist. This query illustrates
the exbility of the framework since data providers do not need to change their
schema, nor the patterns need to be modi ed so long as such a data
transformation query can be written. This second approach also opens up the possibility
for data providers who prefer repository-speci c schema over direct use of the
ODPs, hence lowering the barrier of integration.</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions and Outlook</title>
      <p>We have presented a data integration framework based on ontology design
patterns. The use of ODPs allows us to achieve cross-repository discovery, while
respecting semantic heterogeneity residing within each repository. For future
work, in the context of GeoLink project, we plan to reach out to more partners
from other EarthCube projects to participate in the framework, also to test the
e ectiveness and robustness of the approach, which may include extending the
current set of ODPs. We also plan to explore possibilities to automate some
parts of the framework, for instance, leveraging advances in ontology alignment
to help data providers establish alignment to the patterns. We are also looking
at di erent computational issues with the implementation of the framework as
well as bringing reasoning into the picture, e.g., for detecting inconsistency and
incompleteness in the data, or smarter discovery. Finally, we also plan to do a
usability test from the perspective of the data consumers, i.e., the geoscientists.
Acknowledgement. The presented work has been primarily funded by the
National Science Foundation under the award 1440202 \EarthCube Building Blocks:
Collaborative Proposal: GeoLink { Leveraging Semantics and Linked Data for
Data Sharing and Discovery in the Geosciences."</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Linked Data { The Story So Far</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          <volume>5</volume>
          (
          <issue>3</issue>
          ),
          <volume>1</volume>
          {
          <fpage>22</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gangemi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Ontology design patterns for semantic web content</article-title>
          . In: Gil,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          , et al. (eds.)
          <source>The Semantic Web - ISWC</source>
          <year>2005</year>
          , 4th International Semantic Web Conference,
          <string-name>
            <surname>ISWC</surname>
          </string-name>
          <year>2005</year>
          , Galway, Ireland, November 6-
          <issue>10</issue>
          ,
          <year>2005</year>
          ,
          <source>Proceedings. Lecture Notes in Computer Science</source>
          , vol.
          <volume>3729</volume>
          , pp.
          <volume>262</volume>
          {
          <fpage>276</fpage>
          . Springer (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Krisnadhi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Janowicz</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hitzler</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arko</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carbotte</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chandler</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cheatham</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fils</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karima</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehnert</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mickle</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Narock</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>O'Brien</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raymond</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shepherd</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schildhauer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiebe</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>The GeoLink modular oceanography ontology</article-title>
          .
          <source>In: The Semantic Web - ISWC 2015 - 14th International Semantic Web Conference</source>
          , Betlehem, PA, USA, October
          <volume>11</volume>
          -
          <issue>15</issue>
          ,
          <year>2015</year>
          (
          <year>2015</year>
          ),
          <article-title>accepted for publication.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>