Linked Environment Data

                            Getting Things Connected


                          Thomas Bandholtz1, Joachim Fock2
                 1
                  innoQ Deutschland GmbH, Monheim am Rhein, Germany
                             thomas.bandholtz@innoq.com
              2
                Federal Environment Agency (UBA), Dessau-Roßlau, Germany
                                 joachim.fock@uba.de


       Abstract. After three years of discussion and early prototypes, the Federal En-
       vironment Agency (UBA), Germany, now has launched a two-year research &
       development project on Linked Environment Data (LED) with innoQ Deutsch-
       land GmbH as a contractor. This project will set up a core cloud of environment
       data with a well-elaborated domain terminology as its semantic backbone. Data
       will be taken from the “Environmental Specimen Bank”, the “German Metadata
       Portal on Soil” and further databases such as the “Joint Substance Data Pool of
       the German Federal Government and the German Federal States” as well as the
       environmental library and research databases. The infrastructure will support a
       sustainable process of keeping the data permanently up-to-date, and there will
       be a dynamic and intuitive user interface. All the work will be fully Semantic
       Web compliant, based on vocabularies such as SKOS, SCOVO or Data Cubes,
       and Dublin Core.


       Keywords. Environmental protection, domain terminology, observation data,
       linking open data.


1      Introduction

Networking among comprehensive observation data and domain terminology has
been a basic concern of the UBA since the 1990s with various project generations
(named UMPLIS, UDK, GEIN, SNS and PortalU). All these implementations so far
have two common weaknesses:

 The linkage established by these systems has connected data containers (data ba-
  ses, information systems, complex Web pages) but not individual data records.
 There was no shared data structure to be accessed for exploitation, so that every
  link ended up so to say in front of the door of the referenced database, at best on a
  Web page describing the respective data access.
Linked Data, however, stands for linking individual data records that can be easily
dereferenced. Tim Berners-Lee has summarized the four principles already in 2006
[1]:
1. “Use URIs as names for things
2. Use HTTP URIs so that people can look up those names.
3. When someone looks up a URI, provide useful information, using the standards
   (RDF*, SPARQL)
4. Include links to other URIs. so that they can discover more things.”

In 2009 he added a “5 star rating” to make this more clear and to acknowledge the
Linking Open Data movement:

        *    “Available on the web (whatever format) but with an open license, to be
             Open Data
       **    Available as machine-readable structured data (e.g. excel instead of im-
             age scan of a table)
     ***     as (2) plus non-proprietary format (e.g. CSV instead of excel)
    ****     All the above plus, Use open standards from W3C (RDF and SPARQL)
             to identify things, so that people can point at your stuff
    *****    All the above, plus: Link your data to other people’s data to provide
             context”

Here we see that Linked Data has been envisioned without an explicit demand of
“openness” in mind, and actually Linked Data can be perfectly applied within closed
communities as well.
    The environmental authorities in Europe have a strong tradition of publishing open
data which has been expressed by the Aarhus Convention [2] in 1998 and Directive
2003/4/EC on public access to environmental information [3] in 2003. So 1 and in
parts 2 and 3 star data has been provided by these authorities since years. While there
certainly is some remaining discussion about legal limitations of this openness, the
real input is the “Linked” aspect in this domain, which has been described more in-
depth by Tom Heath and Chris Bizer in 2010 [4].
    The vision of Linked Environment Data came up at the eTerminology workshop
[5] at the e-Envi conference in Prague in March 2009 and was elaborated during the
5th Ecoterm meeting [6] in Rome in October of the same year.
    In 2010 the European Environment Agency made the General Environmental Mul-
tilingual Thesaurus (GEMET)1 and the European Nature Information System
(EUNIS)2 available as Linked Open Data, followed by the Environmental Applica-
tions Reference Thesaurus (EARTh)3 provided by Istituto Inquinamento Atmosferico
in Italy. In December there was a 2-day Ecoinformatics International Webinar on


1
  http://www.eionet.europa.eu/gemet/
2
  http://eunis.eea.europa.eu
3
  http://uta.iia.cnr.it/earth_eng.htm
Linked Open Data4. LED was also discussed by the W3C eGovernement Interest
Group5 and topic several conference contributions.
   In 2011, the German “Umwelt-Thesaurus” UMTHES6 has been published as
Linked Data as well, and a (strictly non-open) species taxonomy in the context of
substances approval. There was an early (open) Linked Data test-bed of the German
Environmental Specimen Bank (ESB)7 which was not deployed into production. The
yearly EnviroInfo8 conference hosted a full day session on „Linked Open Data, Se-
mantic Search and Interoperability“, and there will be a follow-up in 2012: “Linked
Environment Data – Getting Things Connected”.
   However, these early implementations have been rather scattered and have domi-
nant focus on domain terminology, not so much observation data. In a „Use Case
Crosslinking Environment Data and the Library“9 you can read about the German
contributions: “The most prominent obstacle is the lack of a dedicated funding for this
initiative. There are some projects of the participating systems that draw up some of
their budget for pieces of the puzzle, but there is no overall plan of the agency so far.”
   This use case drafts a scenario where observation and library data get cross-linked
among each other and with the domain terminology which has been seized by the
Linked Environment Data research & development project (UFOPLAN 3712 12 100)
finally launched by the German agency by the time this is written.


2      Strategic Issues of the LED Project

2.1    Master Plan and Project Portfolio
By end of 2012 there will be a master plan, inter-coordinated with all stakeholders,
which provides a strategic foundation beyond the borders of the two-year project.
There will be prioritised work packages, some of which may be implemented in 2012
as well.
   The overall portfolio will be highly dependent on how far the corresponding pro-
jects can work on their interfaces themselves or have to delegate this to the LED pro-
ject. Currently we cannot make certain assumptions.
   In any case we aim for a - more or less comprehensive – pilot system (or pilot
cloud) which makes the aspired “added information value through interlinked data” a
real experience. Moreover there must be a demonstration of how the standardised
RDF interfaces and the LED workbench simplify the integration of further data.


4
      http://projects.eionet.europa.eu/ecoinformatics/library/ecoinformatics_indicator/meeting_6-
     7122010
5
   http://www.w3.org/egov/wiki/Linked_Environment_Data
6
   http://data.uba.de
7
   http://umweltprobenbank.de
8
   http://www.ec-gis.org/Workshops/EnviroInfo2011
9
  http://www.w3.org/2005/Incubator/lld/wiki/Use_Case_Crosslinking_Environment_Data_and_
     the_Library
2.2    Project Infrastructure
During the first month we will decide on the project infrastructure together with the
computer centre of the agency. It will consist of:

 Production system with man/machine interface (content negotiation)
 Triple store
 Registry based on the vocabulary of interlinked data sets (VoID)10
 Cross database data-recall client
 (geo-)graphic visualisation services
 Workbench with tools enabling RDF interfaces and data-linking

One special part of this infrastructure is iQvoc 11, an open source terminology man-
agement tool that we have developed jointly over the last two years.
   All this is glued together by a careful selection and extension of standardised RDF
vocabularies such as VoID, SKOS12, SCOVO13 or Data Cubes14 which are “under-
stood” and interpreted by the machine.
   The registry will know which participant uses which standard und can even de-
scribe local extensions, so that code extensions are not necessary. Of course such
extensions have a limited freedom, which needs to be defined and communicated.


2.3    Integration and Extension of Existing Approaches
The existing LED prototypes of the agency have to be aligned with the LED master
plan. They all include native methods for RDF data rendering and can synchronise
with a triple store incrementally. However, these methods have been developed and
need to be revisited, refactored, and extended. The same applies to the RDF formats
and the linkage.

Environment Specimen Bank (ESB)

The Environmental Specimen Bank records the accumulation of (harmful) substances
in defined samples at certain locations and times. However the ESB itself is not re-
sponsible for the comprehensive description of all relevant elements, so specialized
information should be referenced instead. For substances such data is provided by
GSBL, for species there is EUNIS, for locations and times SNS's geo thesaurus and
environmental chronicle, respectively. The environmental thesaurus (UMTHES) pro-
vides an overarching envelope which is in turn linked with the international GEMET.


10
   http://www.w3.org/TR/void/
11
   https://github.com/innoq/iqvoc
12
   http://www.w3.org/2004/02/skos/
13
   http://vocab.deri.ie/scovo
14
   http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html
   In the early test-bed the ESB data model was represented in SCOVO, but today we
consider the Data Cubes vocabulary which needs to be decided. Some extensions are
required to represent the domain-specific dimensions (specimen, analyte, location).
Each record in the ESB can link directly to the information from those specialized
systems. Ideally those provide a back-reference, enabling two-way navigation.
   In addition to the information systems mentioned so far, there are numerous spe-
cialized systems operated independently from governmental agencies, e.g. Chemical
Entities of Biological Interest ChEBI 15 or GeoNames16. Whether those should be ref-
erenced is merely a matter of policy - the technical opportunity exists.

Semantic Network Service (SNS)
SNS17 has been developed since 2001 based on ISO Topic Maps18 and the XML Top-
ic Maps interface. Unfortunately the Topic Maps community has rejected a fusion
with the Semantic Web which means we have to abandon their paradigm.
   SNS includes a thesaurus, a gazetteer, and a chronicle. The thesaurus has already
been implemented based on iQvoc, the Simple Knowledge Management System
(SKOS) and the complementing “Extension for Lables” (SKOS-XL). The gazetteer is
currently being implemented in a similar way, combining SKOS and the GeoNames
Ontology. The chronicle will have to follow, based on SKOS and the Linked Events
Ontology19.


2.4     Data Lifting
Most databases at the Agency are not able to render RDF, and many of them don’t
even have any defined interfaces like a Web Service or CSV export. We have to take
some examples and look for reference solutions for typical cases.
   One example should be the library metadata system which is also used to describe
research projects. This legacy system is not maintained anymore and may be replaced
in the future, possibly based on an RDF representation of the data. It provides a clas-
sical OPAC interface, and this may be the key to access the data from outside.
   Another example is the already mentioned GSBL, which has a Web Service inter-
face to provide its Web client with the data, and it may provide LED as well.
   Currently under development is the Soil Metadata Portal which will include an
INSPIRE20 compliant Web Catalogue Service (CSW). This year’s INSPIRE confer-
ence which will take place in Istanbul at the end of June will host a tutorial on Geo-
graphical Linked Data21, and we will carefully observe the patterns presented there, as


15
   http://www.ebi.ac.uk/chebi/
16
   http://www.geonames.org/
17
   http://www.semantic-network.de
18
   http://isotopicmaps.org/
19
   http://linkedevents.org/ontology/
20
   http://inspire.jrc.ec.europa.eu/
21
   http://datalift.org/en/node/21
implementing INSPIRE through Linked Data is not yet regulated (and in INSPIRE
everything has to be regulated).
   If there is absolutely no existing data interface we have to go down to the physical
data model and use D2RQ22, but most of the legacy data models are badly document-
ed and rather cryptic.


2.5       Front End
So far we will have millions (or even billions) of HTTP URIs that can be resolved in
RDF, we have links to be followed, and we have a SPARQL endpoint. This is not
enough to convince humans (and especially decision makers) of any added infor-
mation value – we need a human-oriented interface so they can explore the data and
visualize the results in tables, diagrams, and maps. This should be generic enough to
work on any data that conforms the supported standards (SKOS, SCOVO, etc.), but
should also specific enough to compete with the native user interfaces of the integrat-
ed systems.
   Some of these systems have very elaborate interfaces dealing with all the subtle-
ness of their respective individual model, and we will have to leave some of this to
them. We cannot go into every individual detail, but we offer a transparent integration
point for all.
   As the registry knows all the properties and notably which properties link between
databases, it should be possible to demonstrate walk-throughs like starting with a
specimen in the ESB, look-up the GSBL about the characteristics of the observed
substance and then retrieve all the soil observation programs dealing with the same
substance and maybe share location with the ESB specimen. This is something that
has been envisioned by decision makers for many times but it never has come true.


2.6       Sustainability
In the domain of environmental protection sustainability is a strategic asset, and this
should also be valid in case of information systems. Many of the systems we are talk-
ing about have been working over 10 years and more, and the outcome of LED should
be able to do the same.
   In parts this is an organisational matter that cannot be regulated by the LED pro-
ject, but the implementation can support easy continuation and evolution.
   Linked Data contributions that make data available once and then move over to the
next node will not survive. So the key issue is implementing self-updating interfaces,
either by direct life access to the native production data or by continuous incremental
one-way synchronisation into the LED triple store.
   The second key issue is the transparency of the integration work bench so that fur-
ther systems can be easily integrated even after the LED project has been completed.


22
     http://d2rq.org/
3      Summary and Conclusion

The launch of a dedicated R&D project by the German agency will raise the previous
LED initiatives to a new level by:

 implementing a national core cloud with links to the EEA terminology and nature
  information system;
 developing sustainable integration patterns and tools;
 producing reusable software components that may be adopted by others.
 establishing a comprehensive reference terminology on the national level;
 providing an intuitive user interface on top of the most convenient RDF standards
  (SKOS, SCOVO …);
 generating added information value by cross-database walk-through patterns.

As usual in research, we cannot anticipate the outcome in detail, and there may be
unpredictable ideas at any time during the contract period. Anyway, as the data is
provided by a governmental agency, LED will provide a reliable, always topical in-
formation source to the public.


References

See also Web-links in footnotes on the previous pages.

 1. Berners-Lee, T.: Linked Data. W3C Design Issues. (2006/9).
    http://www.w3.org/DesignIssues/LinkedData.html
 2. Convention on Access to Information, Public Participation in Decision-making and Access
    to Justice in Environmental Matters" by the United Nations Economic Commission for Eu-
    rope (UNECE). http://www.unece.org/fileadmin/DAM/env/pp/documents/cep43e.pdf
 3. Directive 2003/4/EC of the European Parliament and of the Council of 28 January 2003 on
    public access to environmental information and repealing Council Directive 90/313/EEC.
    http://europa.eu/legislation_summaries/environment/general_provisions/l28091_en.htm
 4. Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space. (1st edi-
    tion). Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136. Mor-
    gan & Claypool (2011). http://linkeddatabook.com/editions/1.0/
 5. Bandholtz, T., Schleidt, K.: Summary of W4 eEnvironment Terminology. In: Hřebíček, J.,
    Hradec, J., Pelikán, E., Mírovský, O., Pillmann, W., Holoubek, I., Bandholtz, T. (Eds.):
    Proceedings of the European conference of the Czech Presidency of the Council of the EU
    TOWARDS eENVIRONMENT. Opportunities of SEIS and SISE: Integrating Environ-
    mental            Knowledge            in          Europe.          Prague           (2009)
    http://www.e-envi2009.org/SummaryTerminologyW4.pdf
 6. Hodge, G.: Report on the Outcome of the Ecoterm V Workshop, U.N. Food and Agricul-
    ture        Organization,         Rome        5-6         October       2009.        (2010)
    http://projects.eionet.europa.eu/ecoinformatics/library/ecoinformatics_indicator/ecoterm_5
    -6102009