<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LEAPS: A Semantic Web and Linked data framework for the Algal Biomass Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Monika Solanki</string-name>
          <email>m.solanki@aston.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johannes Skarka</string-name>
          <email>johannes.skarka@kit.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aston University</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Karlsruhe Institute of Technology</institution>
          ,
          <addr-line>ITAS</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we present, LEAPS , a Semantic Web and Linked data framework for searching and visualising datasets from the domain of Algal biomass. LEAPS provides tailored interfaces to explore algal biomass datasets via REST services and a SPARQL endpoint for stakeholders in the domain of algal biomass. The rich suite of datasets include data about potential algal biomass cultivation sites, sources of CO2, the pipelines connecting the cultivation sites to the CO2 sources and a subset of the biological taxonomy of algae derived from the world's largest online information source on algae.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Motivation</title>
      <p>
        Algal biomass holds huge promises. The use of microalgae as a food source for
humans has been considered for overpopulated countries and for space travel
since as early as 1961 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. If algae is grown under proper environmental
conditions, the protein yield from it may be quite high. Algae have been collected for
more than 4000 years in China and Japan for use as human food 3,
      </p>
      <p>
        Recently the idea that algae biomass based biofuels could serve as an
alternative to fossil fuels has been embraced by councils across the globe. Major
companies [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], government bodies [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and dedicated non-pro t organisations such
as ABO (Algal Biomass Organisation) 4 and EABA(European Algal Biomass
Association)5 have been pushing the case for research into clean energy sources
including algae biomass based biofuels.
      </p>
      <p>It is quickly evident that because of extensive research being carried out,
the domain itself is a very rich source of information. Most of the knowledge is
however largely buried in various formats of images, spreadsheets, proprietary
data sources and grey literature that are not readily machine
accessible/interpretable. A critical limitation that has been identi ed is the lack of a knowledge
level infrastructure that is equipped with the capabilities to provide semantic
grounding to the datasets for algal biomass so that they can be interlinked,
shared and reused within the biomass community.</p>
      <p>Integrating algal biomass datasets to enable knowledge representation and
reasoning requires a technology infrastructure based on formalised and shared
vocabularies. Stakeholders in the domain who would bene t from such a
structured, unambiguous and machine interpretable representation of data include
researchers, algae producers and users, biofuels producers, oil companies, airline,
cars and aerospace industry, national public authorities, international
organisation and NGOs amongst others.</p>
      <p>In this paper, we present LEAPS 6, a Semantic Web/Linked data framework
for the representation and visualisation of knowledge in the domain of algal
biomass. One of the main goals of LEAPS is to enable the stakeholders of the
algal biomass domain to interactively explore, via linked data, potential algal
sites and sources of their consumables across NUTS (Nomenclature of Units for
Territorial Statistics)7 regions in North-Western Europe.</p>
      <p>Some of the objectives of LEAPS are,
{ motivate the use of Semantic Web technologies and LOD for the algal biomass
domain.
{ laying out a set of ontological requirements for knowledge representation
that support the publication of algal biomass data.
{ elaborating on how algal biomass datasets are transformed to their
corresponding RDF model representation.
{ interlinking the generated RDF datasets along spatial dimensions with other
datasets on the Web of data.
{ visualising the linked datasets via an end user LOD REST Web service.
{ visualising the scienti c classi cation of the algae species as large network
graphs.</p>
      <p>The paper is structured as follows: Section 2 presents a brief overview of the
dataset transformation process. Section 3 presents a description of the system
architecture. Section 4 presents an overview of the querying mechanism
underlying the LEAPS interface.
2</p>
    </sec>
    <sec id="sec-2">
      <title>LEAPS Datasets</title>
      <p>The transformation of the raw datasets to linked data takes place in two steps.
The rst part of the data processing and the potential calculation are performed
in a GIS-based model which was developed for this purpose using ArcGIS 8 9.3.1.</p>
      <p>The second step of lifting the data from XML to RDF is carried out using
a bespoke parser that exploits XPath 9 to selectively query the XML datasets
and generate linked data using the ontologies. While in most cases, transforming
6 http://www.semanticwebservices.org/enalgae
7 http://bit.ly/I7y5st
8 http://www.esri.com/software/arcgis/index.html
9 http://www.w3.org/TR/xpath/
XML datasets to their linked data counterparts is done assuming a simplistic
one-to-one mapping between the XML elements and RDF entities, in our
scenario, the original data sources had several limitations and a one-to-one
transformation was not possible. In order to produce a linked data representation of
the datasets, that directly interlinked the resources of sites, sources, pipelines
and region potential to each other and their NUTS regions of location, a
bespoke parser that utilised a complex underlying data structure to facilitate the
transformation was implemented.</p>
      <p>The transformation process yielded four datasets which were stored in
distributed triple store repositories: Biomass production sites, CO2 sources, pipelines
and region potential. We stored the datasets in separate repositories to simulate
the realistic scenario of these datasets being made available by distinct and
dedicated dataset providers in the future. While a linked data representation of the
NUTS regions data 10, was already available there was no SPARQL endpoint or
service to query the dataset for region names. We retrieved the dataset dump and
curated it in our local triple store as a separate repository. The NUTS dataset
was required to link the biomass production sites and the CO2 sources to
regions where they would be located and to the dataset about the region potential
of biomass yields. The transformed datasets interlinked resources de ning sites,
CO2 sources, pipelines, regions and NUTS data using link predicates de ned in
the ontology network.</p>
      <p>Datasets about algae cultivation can become more meaningful and useful to
the biomass community, if they are integrated with datasets about algal strains.
This can help the plant operators in taking judicious decisions about which
strain to cultivate at a speci c geospatial location. Algaebase11 provides the
largest online database of algae information. While Algaebase does not make
RDF versions of the datasets directly available through its website, they can
be programmatically retrieved via their LSIDs (Life Science Identi ers) from
the LSID Web resolver 12 made available by Biodiversity Information Standards
(TDWG)13 working group.</p>
      <p>We retrieved RDF metadata for 113061 species of algae14 and curated in our
triple store. We then used the Semantic import plugin with Gephi to visualise
the biological taxonomy of the algae species.
3</p>
    </sec>
    <sec id="sec-3">
      <title>System Description</title>
      <p>LEAPS provides an integrated view over multiple heterogeneous datasets of
potential algal sites and sources of their consumables across NUTS regions in
NorthWestern Europe. Figure 1 illustrates the conceptual architecture of LEAPS . The
main components of the application are
10 http://nuts.geovocab.org/
11 http://www.algaebase.org/about/
12 http://lsid.tdwg.org/
13 http://www.tdwg.org/
14 The retrieval algorithm ran on an Ubuntu server for three days
{ Parsing modules: As shown in Figure 1, the parsing modules are
responsible for lifting the data from their original formats to RDF. The lifting process
takes place in two stages to ensure uniformity in transformation.
{ Linking engine: The linking engine along with the bespoke XML parser
is responsible for producing the linked data representation of the datasets.
The linking engine uses ontologies, dataset speci c rules and heuristics to
generate interlinking between the ve datasets. From the LOD cloud, we
currently provide outgoing links to DBpedia15 and Geonames16.
{ Triple store: The linked datasets are stored in a triple store. We use</p>
      <p>OWLIM SE 5.0 17.
{ Web services: Several REST Web services have been implemented to
provide access to the linked datasets.
{ SPARQL endpoints: SPARQL endpoints that provide access to individual
dataset repositories are available. Snorql has been customised as the front
end for the endpoint. An endpoint for federated queries is planned to be
implemented as part of future work.
{ Ontologies: A suite of OWL ontologies for the algal biomass domain have
been designed and made available.
{ Interfaces: The Web interface provides an interactive way to explore various
facets of sites, sources, pipelines, regions, ontolgoies and SPARQL endpoints.
15 http://dbpedia.org/About
16 http://sws.geonames.org/
17 http://www.ontotext.com/owlim/editions</p>
      <p>The map visualisation has been rendered using Google maps. Besides the
SPARQL endpoint and the interactive Web interface, a REST client has
been implemented for access to the datasets. Query results are available in
RDF/XML, JSON, Turtle and XML formats.
{ Biological taxonomy visualisation: A subset of the Algaebase database
which is the largest information source of algae on the Web, has been
retrieved and curated in our triple store. This dataset when integrated with the
dataset for algal cultivation site, can inform stakeholders about the strains
of algae that can be harvested on that site. Further, the Semantic Import
plugin18 of Gephi19 has been exploited to visualise the biological taxonomy
of algae. This visualisation is also made available via the LEAPS interface.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Application access</title>
      <p>LEAPS 20 is available on the Web. The interface currently provides visualisation
and navigation of the algae cultivation datasets in a way most intuitive for the
phycologists. The application has been demonstrated to several stakeholders of
the community at various algae-related workshops and congresses. They have
found the navigation very useful and made suggestions for future dataset
aggregation. At the time of this writing, data retrieval is relatively slow for some
queries because of their federated nature, however optimisation work on the
retrieval mechanism is in progress to enable faster retrieval of information.
Acknowledgments
The research described in this paper is partly supported by the Energetic Algae project
(EnAlgae), a 4 year Strategic Initiative of the INTERREG IVB North West Europe
Programme.
18 http://wiki.gephi.org/index.php/SemanticWebImport
19 https://gephi.org/
20 http://www.semanticwebservices.org/enalgae</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A. H. Claire</given-names>
            <surname>Smith</surname>
          </string-name>
          .
          <article-title>Research needs in ecosystem services to support algal biofuels, bioenergy and commodity chemicals production in the uk</article-title>
          .
          <source>Technical report, NNFCC</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Oilgae</surname>
          </string-name>
          .
          <article-title>Oilgae comprehensive report, energy from algae: Products, market, processes and strategies</article-title>
          .
          <source>Technical report, Oilgae</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Powell</surname>
          </string-name>
          and
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Nevels</surname>
          </string-name>
          .
          <article-title>Algae feeding in humans</article-title>
          .
          <source>Journal of Nutrition</source>
          ,
          <year>1961</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. U.S. Department of Energy.
          <source>National Algal Biofuels Technology Roadmap. Technical report, accessed June</source>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>