<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Adding Biodiversity Datasets from Argentinian Patagonia to the Web of Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marcos Zarate</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>German Braun</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pablo Fillottrani</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro para el Estudio de Sistemas Marinos, Centro Nacional Patagonico</institution>
          ,
          <addr-line>CESIMAR-CENPAT</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Comision de Investigaciones Cient cas de la provincia de Buenos Aires</institution>
          ,
          <addr-line>CIC</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Computer Science and Engieneering Department, Universidad Nacional del Sur</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Consejo Nacional de Invenstigaciones Cient cas y Tecnicas</institution>
          ,
          <addr-line>CONICET</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Universidad Nacional de la Patagonia San Juan Bosco</institution>
          ,
          <addr-line>UNPSJB</addr-line>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Universidad Nacional del Comahue</institution>
          ,
          <addr-line>UNCOMA</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This poster presents a framework to publish biodiversity data from Argentinian Patagonia as Linked Open Data (LOD). These datasets contains information of biological species (mammals, plants, parasites, among others) that have been collected by researchers from the Centro Nacional Patagonico (CENPAT), and have initially been made available as Darwin Core Archive (DwC-A).</p>
      </abstract>
      <kwd-group>
        <kwd>Biocollections</kwd>
        <kwd>Darwin Core</kwd>
        <kwd>Linked Open Data</kwd>
        <kwd>RDF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Animal, plant and marine biodiversity comprise the \natural capital" that keeps
our ecosystems functional and economies productive. However, since the world
is experiencing a dramatic loss of biodiversity an analysis about this impact is
being done by digitising and publishing biological collections. To this end, the
biodiversity community has standardised shared common vocabularies such as
Darwin Core (DwC) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] together with platforms as the Integrated Publishing
Toolkit (IPT) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] aiming at publishing and sharing biodiversity data. As a
consequence, the biodiversity community now have hundreds of millions of records
published in common formats and aggregated into centralised portals.
Therefore, new challenges emerge from this initiative for e ectively using such a large
volume of data. In particular, as the numbers of species, geographic regions, and
institutions continue growing, answering questions about the complex
interrelationships among these data becomes increasingly di cult. The Semantic Web
(SW) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] provides possible solutions to these problems by enabling the Web of
Linked Data (LD) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], where data objects are uniquely identi ed and the
relationships amongst them are explicitly de ned. LD is a powerful and compelling
approach for spreading and consuming scienti c data. It involves publishing,
sharing, and connecting data on the Web, and o ers a new way of data
integration and interoperability. Moreover, there is an increasing recognition of the
advantages of LD technologies in the life sciences. In this same direction,
CENPAT1 has started to publicly share its data under Open Data licence2 through
the IPT.3 Data are available as Darwin Core Archive (DwC-A), which is a
biodiversity data standard that makes use of the DwC terms, it is composed of a
set of les for describing the structure and relationships of the raw data along
with metadata les conforming the DwC standard. Nevertheless, the well-known
IPT platform focuses on publishing content in unstructured or semi-structured
formats but reducing the possibilities to interoperate with other datasets and
make them accessible for machines. To enhance this approach, we present a
transformation process from data extraction until its publishing as RDF datasets. This
process uses OpenRe ne4 for generating RDF triples from semi-structured data
and de ne URIs. It also uses GraphDB, for storing, browsing, accessing and link
data with external RDF datasets.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Architecture Overview</title>
      <p>Publishing data as LD involves data cleaning, mapping and a conversion process
from DwC-A to RDF triples. The architecture of such a process is shown in
Fig. 1 and has been structured as described below.
(1) Data Extraction, Cleaning and Reconciliation Process, the DwC-A
are manually extracted from the IPT repository and the les of occurrences
are pre-processed (cleaning, conversion of data types, elimination of null
values, etc.) using OpenRe ne tool. OpenRe ne also allows adding reconciliation
services based on SPARQL endpoints, which return candidate resources
belonging to external datasets for reconciling with elds from local datasets. In our
process we use DBpedia endpoint to reconcile the Country column with the
1 Patagonian National Research Centre, http://www.cenpat-conicet.gob.ar/
2 https://creativecommons.org/licenses/by/4.0/legalcode
3 https://www.gbif.org/ipt
4 http://openrefine.org/
dbo:country resource in DBpedia. Another reconciliation service was provided
by Encyclopedia of Life (EOL)5 which allows to reconcile taxonomic names. This
service is applied to the scientificName column to obtain the URL of the EOL
page describing the specie. (2) RDF Schema Alignment and URI De
nition, after data cleaning and reconciliation, data are converted to RDF triples
using RDF Re ne.6 The RDF schema alignment skeleton speci es the subject,
predicate and the object of the triples to be generated. The next step is to set
up pre xes for well-known vocabularies such as the W3C Basic Geo ontology,
DBpedia, FOAF, DwC and Darwin-SW to establish relationships between DwC
classes. Each resource must have an URI link that resource to other resources
both within this dataset and others anywhere on the web. The common base
URI for all the resources we de ne is http://crowd.fi.uncoma.edu.ar:3333/.
(3) Interlinking, OpenRe ne reconciliation service is able to match some links
to DBpedia, but since it is still limited, our process should use a more powerful
tool to discover links to other datasets. In this context, SILK7 o ers a graphical
editor that can be used to create linkage rules. For example, the links to DBpedia
have been generated taking into account the genus of the species, described by
the term dwc:genus of the DwC and dbo:genus in DBpedia. The links between
our RDF and DBpedia use the owl:sameAs predicate to link the two datasets.
(4) Publishing and Accessing Data, the transformed biodiversity data have
been published, and can to be accessed through GraphDB8 allowing users to
explore the hierarchy of RDF classes (Class hierarchy), similarly relationships
among these classes also can be explored giving an overview about how many
links exist between instances of the two classes (Class relationship).
3</p>
    </sec>
    <sec id="sec-3">
      <title>Case Study: Conservation Status of Species</title>
      <p>In this section we present a simple SPARQL query9 to determine the
conservation status of the species in our dataset. Since this information is not present,
we can obtain it from the links to DBpedia using the property owl:sameAs, in
this way our dataset bene ts from information that did not previously exist.
PREFIX dbo : &lt; http :// dbpedia . org / ontology /&gt;
PREFIX owl : &lt; http :// www . w3 . org /2002/07/ owl #&gt;
PREFIX dwc : &lt; http :// rs . tdwg . org / dwc / terms /&gt;
PREFIX txn : &lt; http :// lod . taxonconcept . org / ontology / txn . owl #&gt;
SELECT ? scname ? eol_page ? c_status
WHERE { ?s a dwc : Taxon .</p>
      <p>?s dwc : scientificName ? scname .
?s txn : hasEOLPage ? eol_page .</p>
      <p>?s owl : sameAs ? resource .</p>
      <p>SERVICE &lt; http :// dbpedia . org / sparql &gt; {</p>
      <p>? resource dbo : conservationStatus ? c_status .}}</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>In order to test our architecture we use the datasets belonging to CENPAT,
which are available as DwC-A in an institutional IPT server. These datasets
include collections of marine, terrestrial, parasites and plant species mainly from
Argentinean Patagonia. Up to July 2017, CENPAT owns 33 datasets
representing about 273.419 occurrence records, where 80% of them have been
georeferenced. In this initial stage only three datasets were converted to RDF, our
platform stored 202.119 RDF triples. Also for the user to be able to exploit the
dataset we de ne some SPARQL queries and their corresponding visualisation,
for this we use the statistical software R, the scripts can be downloaded from10
and a complete description of the proposed architecture can be found in.11
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Outlook</title>
      <p>In this poster we have presented the CENPAT Linked Open Biodiversity
dataset, which exposes public biodiversity data related mainly to species from
Argentinean Patagonia as LOD. The aim is to facilitate the access of researchers
to important data and thus giving a valuable support to the scienti c analysis
of the biodiversity. In addition, this work is the rst Argentinian initiative to
convert biodiversity data according to the criteria established by LOD.</p>
      <p>Finally, our approach have some limitations that we consider as future work.
We need provide more advanced options and support automated execution of
the extraction and conversion pipelines using, for example, LinkedPipes ETL12
since this process is currently done manually.
10 https://github.com/cenpat-gilia/CENPAT-GILIA-LOD/tree/master/r-scripts
11 https://github.com/cenpat-gilia/CENPAT-GILIA-LOD/wiki
12 https://etl.linkedpipes.com/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>John</given-names>
            <surname>Wieczorek</surname>
          </string-name>
          , David Bloom,
          <string-name>
            <given-names>Robert</given-names>
            <surname>Guralnick</surname>
          </string-name>
          , Stan Blum, Markus Doring, Renato Giovanni, Tim Robertson, and David Vieglais.
          <article-title>Darwin core: An evolving community-developed biodiversity data standard</article-title>
          .
          <source>PLoS ONE</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Tim</given-names>
            <surname>Robertson</surname>
          </string-name>
          , Markus Doring, Robert Guralnick, David Bloom, John Wieczorek, Kyle Braak, Javier Otegui, Laura Russell, and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Desmet</surname>
          </string-name>
          .
          <article-title>The gbif integrated publishing toolkit: facilitating the e cient publishing of biodiversity data on the internet</article-title>
          .
          <source>PLoS One</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ):e102623,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Tim</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>James</given-names>
            <surname>Hendler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ora</given-names>
            <surname>Lassila</surname>
          </string-name>
          , et al.
          <article-title>The semantic web</article-title>
          .
          <source>Scienti c american</source>
          ,
          <volume>284</volume>
          (
          <issue>5</issue>
          ):
          <volume>28</volume>
          {
          <fpage>37</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          , Tom Heath, and
          <string-name>
            <surname>Tim</surname>
          </string-name>
          Berners-Lee.
          <article-title>Linked data-the story so far</article-title>
          .
          <source>Semantic services, interoperability and web applications: emerging concepts</source>
          ,
          <source>pages</source>
          <volume>205</volume>
          {
          <fpage>227</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>