-

Elevating Natural History Museums' Cultural Collections to the Linked Data Cloud

Giannis Skevakis

skevakis@ced.tuc.gr 0

Konstantinos Makris

makris@ced.tuc.gr 0

Polyxeni Arapi

xenia@ced.tuc.gr 0

Stavros Christodoulakis

stavros@ced.tuc.gr 0 0 Laboratory of Distributed Multimedia Information Systems and Applications, Technical University of Crete (TUC/MUSIC) , 73100 Chania , Greece

2013

40 51

An impressive abundance of high quality scientific content about Natural History and Biodiversity is produced in a distributed, open fashion by Natural History Museums (NHMs) using their own established standards and best practices. Managing publication of such richness and variety of content on the Web, and also supporting distributed, interoperable content creation processes, poses challenges that traditional publication approaches are not adequate to meet. The Natural Europe project offers a coordinated solution to those challenges at European level that aims to improve the availability, discoverability and relevance of environmental cultural content for education and life-long learning use, in a multilingual and multicultural context. Cultural heritage content is collected from six Natural History Museums around Europe into a federation of European Natural History Digital Libraries that is directly connected with Europeana. In this paper we present the architecture of the semantic infrastructure developed for the transition of the Natural Europe federation of NHMs' cultural repositories to the Semantic Web, as well as the methodology followed for ingesting and converting the NHMs' cultural heritage metadata into Linked Data.

digital curation preservation metadata Europeana Linked Data

Cultural heritage and biodiversity data are syntactically and semantically heterogeneous, multilingual, semantically rich, and highly interlinked. They are produced in a distributed, open fashion by organizations like museums, libraries, and archives, using their own established standards and best practices [ 4 ]. As a result, an impressive abundance of high quality scientific content available around the world remains largely unexploited. Managing the publication of rich content on the Web and supporting distributed, interoperable content creation processes, poses challenges that traditional publication approaches are not adequate to meet.

The Semantic Web and Linked Data is a promising approach to address these problems. The Semantic Web standards and best practices provide a basis on which interoperable Web systems can be built in a well defined manner. W3C recommendations like RDF(S), SKOS, SPARQL, and OWL are considered as corner-stones for cross-domain and domain-independent interoperability. Moreover, the exploitation of common ontologies, taxonomies and published datasets make the reusability of existing data possible. The exploitation of the aforementioned standards and practices in the Cultural Heritage domain and their adaptation by collaborative tools allowing open content publishing on the Semantic Web leads to: (a) semantically richer content, (b) creation of large national and international Cultural Heritage portals, such as Europeana, (c) large open data repositories, such as the Linked Open Data Cloud, and (d) massive publications of linked library data [ 4 ].

The Natural Europe project [ 11 ] offers a coordinated solution at European level that aims to improve the availability and relevance of environmental cultural content for education and life-long learning use, in a multilingual and multicultural context. Cultural heritage content related to natural history, natural sciences, and nature/environment preservation, is collected from six Natural History Museums (NHMs) around Europe into a federation of European Natural History Digital Libraries, directly connected with Europeana. The Natural Europe project adopts and integrates the strong requirements for metadata management and interoperability with cultural heritage, biodiversity, and learning repositories. It offers appropriate tools and services that allow the participating NHMs to: (a) uniformly describe and semantically annotate their content according to international standards and specifications, as well as (b) interconnect their digital libraries and expose their Cultural Heritage Object (CHO) metadata records to Europeana.eu.

In this paper we present the architecture of the semantic infrastructure developed for the transition of the Natural Europe Cultural Digital Libraries Federation to the Semantic Web, as well as the methodology followed for ingesting and converting the NHMs’ cultural heritage metadata to Linked Data, supporting the Europeana Data Model (EDM) [ 2 ]. 2

The Natural Europe Cultural Digital Libraries Federation and the Transition to the Semantic Web

In the context of Natural Europe, the participating NHMs provide metadata descriptions about a large number of Natural History related CHOs. These descriptions are semantically enriched with Natural Europe shared knowledge (vocabularies, taxonomies, etc.) using project provided annotation tools and services. The enhanced metadata are aggregated by the project, harvested by Europeana and exploited for educational purposes. The architecture of the Natural Europe Cultural Federation (Fig. 1) consists of the following components: • The Natural Europe Cultural Environment (NECE) [ 6 ], which facilitates the complete metadata management lifecycle (i.e., ingestion, maintenance, curation, and dissemination) of CHO metadata and specifies how legacy metadata are migrated into Natural Europe. NECE provides (among others) the following tools and services for each participating NHM: ─ The MultiMedia Authoring Tool (MMAT)1 is a multilingual web-based management system for museums, archives and digital collections, which facilitates the authoring and metadata enrichment of CHOs. It employs modules for CHO/ multimedia manipulation, persistency and vocabulary management. ─ The CHO Repository is the underlying repository of MMAT, responsible for the ingestion, maintenance and dissemination of both content and metadata. It is backed up by an eXist XML database and exposes an OAI-PMH interface, able to disseminate metadata records complying with the Natural Europe CHO Application Profile. ─ The Vocabulary Management module enables the access to taxonomic terms, vocabularies, and authority files (persons, places, etc.). • The Natural Europe Cultural History Infrastructure (NECHI) [ 7 ] interconnects the NHM digital libraries and exposes their metadata records to Europeana.eu. It provides (among others) the following tools and services: ─ The Natural Europe Harvester manages the harvesting of metadata records provided by Natural Europe content providers. It employs modules for persistent identification, metadata transformation and metadata validation. ─ The Metadata Repository is the underlying repository of the Natural Europe Harvester, responsible for the maintenance of the harvested metadata records. It is backed up by an RDBMS and exposes an OAI-PMH service interface, able to disseminate metadata records complying with Europeana Semantic Elements (ESE) [ 3 ]. Our aim was to develop a semantically rich cultural heritage infrastructure for NHMs, providing a Semantic Web perspective to the Natural Europe cultural content in terms of: (a) creating the Natural Europe Ontology in order to introduce semantics to the current Natural Europe Schema for inferring new knowledge, (b) using the RDF data model to publish the Natural Europe data on the Web, (c) linking the Natural Europe’s cultural content to external commonly used vocabularies, thesaurus and published datasets, (d) enabling data retrieval through SPARQL, and (e) supporting interoperability with the Europeana Semantic Layer by offering the appropriate Europeana Data Model (EDM) [ 2 ] dissemination mechanisms.

In order to achieve the above objectives, the modules of the federated instances (NECE) and the federal node (NECHI) of the Natural Europe Cultural Federation have been enhanced with software components supporting the Semantic Web technologies. The Natural Europe RDF data are aggregated to the federal node in order to allow the inference of new knowledge from all NHM federated nodes. This data format allows the execution of domain specific queries. Each federated node provides an RDF store which allows the retrieval of a single museum’s data through SPARQL, and enables the future connection with MMAT which will be modified to provide upto-date triples. Specifically, NECE has been enhanced with the following modules/functionality: • The Vocabulary Server has been extended with published taxonomies expressed in

RDF/SKOS format. • The RDF Store is managed by the CHO Repository and keeps the triples generated from the conversion of the XML metadata to the RDF format. • The SPARQL Endpoint, exposed by CHO Repository, enables semantic queries on top of the triples stored in the RDF Store. • The OAI-PMH Target has been refactored to support the harvesting of OAI-ORE packages by NECHI, which in turn allows the further exploitation of the data.

NECHI has been enhanced with the following modules/functionality: • The RDF Store, managed by the Metadata Repository, keeps the harvested triples of NECE instances, along with knowledge inferred from the aggregated datasets. • The SPARQL Endpoint, exposed by Metadata Repository, allows external systems to query the aggregated data from the entire Natural Europe infrastructure. • The OAI-PMH Target has been refactored to support the dissemination of the aggregated linked data to Europeana, when the respective Europeana services become available. 2.1

From the Natural Europe Schema to the Natural Europe Ontology

The Natural Europe data complies with the Natural Europe CHO Application Profile [ 10 ], which is a superset of the Europeana Semantic Elements (ESE) [ 3 ] metadata format. The Natural Europe CHO Application Profile describes the cultural heritage objects as records and consists of the following parts: • The Cultural Heritage Object (CHO) information that provides metadata information about the analog resource or born digital object (specimen, exhibit, cast, painting, documentary, etc.). It is composed of the following sub-categories: ─ The Basic information holds general descriptive information (mostly scientific) about a Cultural Heritage Object. ─ The Species information holds information related to the species of a described specimen (animals, plants, minerals, etc.) in the context of Natural Europe. ─ The Geographical information contains metadata for the location in which the specimen has been collected. • The Digital Object information that provides metadata information about a digital (photo, video, etc.) or digitized resource (scanned image, photo, etc.) in the context of Natural Europe. It is composed of the following sub-categories: ─ The Basic information deals with general descriptive information about a digital or digitized resource. ─ The Content information is related to the physical characteristics and technical information exclusive to a digital or digitized resource (URL, Format, etc.). ─ The Rights information describes the intellectual property rights and the accessibility to a digital or digitized resource. • The Meta-metadata information that provides metadata information for a CHO record. These include the creator of the record, the languages that appear in the metadata, the history of the record during its evolution in the MMAT, etc. • The Collection information that provides metadata information for logical groupings of contributed CHOs within a museum.

When creating a rich cultural heritage infrastructure that aims to provide a Semantic Web perspective to the Natural Europe cultural content, it is not sufficient to use a flat Schema or a Schema providing weak semantics. With this objective in mind we described the Natural Europe Schema as an OWL Ontology, exploiting the use of class and property axioms in order to enable the inference of new knowledge out of the existing data. Notions such as CHO, CHO collection, specimen, observation, multimedia object, person, and organization have been described as OWL classes, while the underlying attributes have been described using object and datatype properties. As a result, the contributed flat Natural Europe records can be organized in aggregations of different kinds of objects, e.g., a specimen may be described by multiple observations and an observation may contain multiple multimedia objects.

The Natural Europe Ontology references other well-known Ontologies/Schemas (e.g., SKOS) and has been aligned with EDM, allowing any system supporting the Natural Europe Ontology to work seamlessly with other systems/organizations supporting EDM. 3

Vocabularies

Exposing data to the Linked Data cloud is not only about creating an Ontology with sufficient semantics and converting them to RDF. The quality of the exposed Linked Data is measured by their linkage with already published, external data. To this end, we tried to find external sources that provide data which we can be linked to our datasets. Most of the vocabularies that we came across already provided RDF data and ways to access/query them. Nevertheless, the Catalogue of Life (CoL) which is used extensively in the biodiversity context did not expose any data in RDF format. To overcome this issue, we supported the publishing of its database to RDF (Section 3.1). In the context of the Natural Europe infrastructure, we chose the following services/datasets: • GeoNames: Geographical database containing over 10 million geographical names and consisting of over 8 million unique features whereof 2.8 million populated places and 5.5 million alternate names. The GeoNames Ontology is described using OWL, exploiting its inferencing capabilities for extracting new knowledge. In addition, it supports interoperability by being mapped to several well-known ontologies, including schema.org, linkedgeodata.org, dbpedia.org, and INSEE. The GeoNames website offers numerous web services2 for searching any kind of information available in the system. The response data is available in multiple formats, like XML and JSON. • DBpedia: A knowledge base describing more than 3.64 million things, including 764,000 persons, 573,000 places, and 202,000 species. The dataset has been mainly created by extracting structured data from Wikipedia and has been classified in a consistent cross-domain Ontology. The data can be accessed through web services using the provided SPARQL Endpoint3, or the XML based search api4. • Catalogue of Life (CoL): A comprehensive catalogue of all known species of organisms on Earth. It is compiled by 99 taxonomic databases from around the world providing critical species information on: (a) synonymy, enabling the effective referral of alternative species names to an accepted name, (b) higher taxa, within which a species is clustered, and (c) distribution, identifying the global regions from which a species is known. It is being used by several Global Biodiversity Programmes including GBIF5 and EoL6. The CoL website offers web services for searching the latest taxonomy7 and the available response formats are: JSON/XML/PHP-based. • Uniprot: A comprehensive, high-quality and freely accessible database of protein sequence and functional information including among others, a taxonomic classification, literature citations and keywords. The dataset is also available in the RDF format, conforming to the highly structured Uniprot OWL Ontology, while the Uniprot taxonomic classification has been described with SKOS. The Uniprot database can be downloaded8 or queried through the provided RESTful services9. • GEMET: A general multilingual thesaurus aimed to define a common language and core terminology for the environment. GEMET’s data is available in SKOS (RDF/XML) format and can be either downloaded10 or accessed through the provided RESTful and XML-RPC interfaces11. 3.1

SKOSification of Catalogue of Life

The extended use of taxonomies in the biodiversity domain dictates for a formal way of describing complex vocabularies and taxonomies, in compliance to the Semantic Web standards. The most popular standard for describing these types of controlled vocabularies is SKOS (Simple Knowledge Organization System) [ 8 ]. SKOS is formally described as an OWL Full Ontology, providing the basic notions and semantics needed for describing knowledge in knowledge organization systems. Its use facilitates the semantic linkage of museum objects to well-established KOS, including GEMET and Uniprot which have already been expressed in SKOS. Another system that is widely used in biological classification but is not available in SKOS format is the Catalogue of Life, which has been described above.

The current implementation of CoL provides a web-based system for browsing the taxonomy of the species, as well as services for searching, but lacks support for persistent URIs able to be referenced by external applications, and RDF representation of its data. Towards this end, we have worked on a method of exposing the taxonomy of CoL to RDF, and more specifically SKOS, using the annual checklist, which is a downloadable package containing the relational database of the CoL. For the conversion of the CoL dataset to SKOS we used the D2R Server [ 1 ], which allows the publishing of relational databases in RDF format. The features of the SKOS model that we employed are: (a) the class Concept, and (b) the properties broader, narrower, prefLabel and altLabel. The first step was the representation of all the taxonomy nodes as Concepts. The scientific name of each node was transformed into a prefLabel, and the common names into altLabels. Finally, the hierarchy of the 10 http://www.eionet.europa.eu/gemet/rdf 11 http://taskman.eionet.europa.eu/projects/zope/wiki/GEMETWebServiceAPI taxonomy was retained by connecting the parent and children nodes with the properties broader and narrower. An example of the CoL SKOSified data in the form of a graph is shown in Fig. 2. 4

Methodology

The methodology for the transition of the Natural Europe Cultural Federation and cultural data to the Semantic Web and the Linked Data Cloud includes the following stages: (1) enrichment of Natural Europe metadata records with knowledge from wellknown vocabularies and thesaurus (e.g., Geonames, DBpedia, GEMET and CoL/ Uniprot), (2) conversion of metadata from XML to RDF, (3) connection of Natural Europe LOD node to the Linked Data cloud (RDF store, SPARQL endpoint), and (4) transition to EDM. These stages are described in the following sections. 4.1

Metadata Enrichment

The metadata enrichment is a very crucial step in the production of rich Linked Data, especially in the case where data already exist in other legacy formats (Relational Databases, XML Databases, etc.). Existing data in these systems are rarely connected to external data because of the structure of the information storage and the fact that most of these have been created long before the introduction of the Open Data.

The Natural Europe datasets have been linked to the above vocabularies/thesaurus by executing customized batch operations that exploit the services exposed by the datasets. More specifically, the spatial information of a Natural Europe CHO record that generally describes places is matched to place names in Geonames. This provides unique references for places and enables spatial information enhancement with: (a) multiple multilingual versions of place names, (b) geographic coordinates and (c) broader geographic areas associated with the places. Unique references for places have also been retrieved from Geonames by exploiting any available geographic coordinates associated with the CHO.

The CHO scientific names of the species information are matched to the accepted scientific names of CoL/Uniprot. By doing so, unique references to well-known taxonomic databases are established and scientific information regarding species common names, species distribution and literature citations is added to the CHO records.

The scientific information of CHO records is further enriched with knowledge retrieved from the DBpedia database. To this end, the scientific names appearing in the Natural Europe CHO records are matched to DBpedia resources, providing links to external bibliographic references, as well as additional information such as the abstract description and conservation status of the CHO’s referred species.

The keywords describing CHOs and CHO collections are matched to terms in the Gemet thesaurus. This provides unique references for keywords, and enhances the CHO information with terms in multiple languages, as well as labels or references of broader terms. Apart from the use of external vocabularies and thesauri, person authority files have been created using information about CHO record creators and contributors. This way, information about persons in the CHO records has been replaced with references to the created authority files. It is worth to note that the authority file metadata are available in RDF, becoming resolvable and linkable by other external applications. In addition, information regarding CHO relations within each federated node’s repository is enriched by matching existing CHO records based on their scientific name. An example of a semantically enriched Natural Europe record is presented in Fig. 3. Although the CHO metadata enrichment process has been performed automatically, we plan to support the inspection of the results using MMAT. 4.2

Conversion of Metadata from XML to RDF

Generally, the basic operations that have to take place in order to convert XML data to RDF include: (a) mapping of every complex XML element to a resource (often a blank node) and of every atomic attribute to an attribute of this resource, and (b) assignment of a namespace prefix to each XML name to create fully qualified URIs.

In the case of Natural Europe, the XML to RDF data conversion has been performed through automatic transformation processes, taking into account the Natural Europe Ontology. The Identification module of the Natural Europe’s federal node has a central role in this process by providing unique identifiers for previously anonymous objects. The generated RDF data have been persisted in an RDF store and can be queried through SPARQL. The use of the Natural Europe Ontology allows the inference of new RDF statements by applying well known reasoning techniques that exploit OWL axioms. An example of the Natural Europe RDF data in the form of a graph is shown in Fig. 4.

Connection of Natural Europe LOD Node to the Linked Data Cloud

The exposure of the Natural Europe data in RDF format as well as the availability of the semantic services (SPARQL Endpoint, Resolvable URIs) allows all the museums’ specimens to be available on the Linked Data cloud. This way anyone can reference any node of the knowledge graph, based on the Linked Data paradigm.

All the data in the Natural Europe environment (even those coming from different institutes) have been automatically interconnected using the aforementioned vocabularies/taxonomies. As an example, consider two museums that have described a specimen of a gray wolf (canis lupus). During the enrichment step the CHOs are connected to the SKOS Concept describing “Canis lupus”, and as soon as the data are available in RDF triples, there will be at least two resources of the class Specimen that are linked to “Canis lupus”. Using the SPARQL endpoint in the federated node, or the feature of the federated query of SPARQL 1.1 specification, we can utilize the relation between these two specimens. 4.4

Transition to EDM

From a technical point of view, EDM adheres to the modeling principles that underpin the approach of the Web of Data ("Semantic Web"). In this approach, there is no such thing as a fixed schema that dictates just one way to represent the data. A common model like EDM can be seen instead as an anchor to which various finer‐ grained models can be attached, making them at least partly interoperable at the semantic level, while the data retain their original expressivity and richness. It does not require changes in the local approaches, although any changes that increase the cross‐ domain usefulness of the data are encouraged (e.g., the usage of publicly accessible vocabularies for persons, places, subjects etc.).

Nevertheless, an ingestion mechanism is yet to be provided by Europeana. Until such an option is available, the only way to expose external data to the system is through the ingestion of XML records in ESE format. Our approach ensures that the generated data complies with the EDM specification, thus allowing the immediate dissemination to the Europeana infrastructure. To this end, we plan to support the ingestion of EDM (OAI-ORE) packages through the OAI-PMH protocol on the federated node. This will be implemented very closely to the way that the data are aggregated from the federated to the federal node. 5

Related Work

The STERNA project [ 12 ] focuses on the enrichment of existing content in the natural history domain. It has developed a methodology on how to integrate one’s content into the STERNA information space. Its Reference Network Architecture (RNA) is a web-based information architecture that allows connecting various knowledge resources and provides an accessible and unambiguous way of retrieving the heterogeneous content within those resources. RNA’s architecture is based on RDF and SKOS. In an RNA environment, content items can be stored in several different RDF stores that can be located on different servers and on various locations. However, they can still be approached as one integrated environment when using the RNA Toolset or when searching the RNA environment.

The MultimediaN E-Culture project [ 9 ] developed a search portal and engine served as a joint prototype Semantic Web application for subsets of digital collections and thesauri from a number of heritage institutions. Several datasets from Dutch art and ethnographic collections have been ported to the Semantic Web. The core Getty vocabularies (AAT, TGN and ULAN) have been converted from the Getty XML files into RDF, and together with the SKOSified thesauri and other controlled vocabularies form the RDF graph underlying the E-Culture semantic search portal demonstrator. The project has developed a generic Java-based framework for converting collection metadata and controlled vocabularies into RDF/SKOS (AnnoCultor).

The STITCH project [ 13 ] examined the extent to which current Semantic Web techniques can solve issues presented by the heterogeneity of cultural heritage collection databases and controlled vocabularies. To this purpose, STITCH developed methods for aligning and browsing reference structures such as SKOSified thesauri and classification systems. SKOS representations of Iconclass and Aria thesaurus aligned these representations using state-of-the-art mapping tools, and implemented a faceted Web browsing environment to visualize and examine the results. 6

Conclusion

We presented a semantic infrastructure and a methodology making possible the transition of the Natural Europe Cultural Digital Libraries Federation, providing cultural and biodiversity content, to the Semantic Web and the Linked Data Cloud. The methodology includes the following stages: (a) enrichment of Natural Europe metadata, (b) conversion of metadata from XML to RDF, (c) connection of Natural Europe LOD node to the Linked Data cloud (RDF store, SPARQL endpoint), and (d) transition to EDM. This methodology can be applied in other domains as well, exploiting their schemes, and related with the domain vocabularies/taxonomies.

Our current research focuses on investigating the integration of the Natural Europe NHM federated nodes with cultural heritage and biodiversity RDF data providers, utilizing different metadata schemas (e.g., ABCD), in an ontology-based mediator system. Such an infrastructure is extremely important for Semantic Web applications and end users, since it will enable the retrieval of up-to-date triples, unlike the data warehousing approaches applied by data aggregators. To this end, the SPARQL-RW Framework [ 5 ], developed by TUC/MUSIC Lab, is considered as a corner-stone component for transparently accessing federated RDF data sources complying to different Ontology Schemas.

Acknowledgements. This work has been carried out in the scope of the Natural Europe Project (Grant Agreement 250579) funded by EU ICT Policy Support Programme.

1. Bizer

, Cyganiak

R.:

D2r server-publishing relational databases on the semantic web . In: Proceedings of the 5th International Semantic Web Conference (ISWC) , ( 2006 ).

Europeana

Data Model Definition

5 . 2 .3, http://pro.europeana.eu/documents/900548/bb6b51df-ad11 - 4a78 - 8d8a-44cc41810f22

Europeana

Semantic Elements Specification

3 . 4 .1, http://pro.europeana.eu/documents/900548/dc80802e-6efb - 4127 - a98e-c27c95396d57

4. Hendler

, Ding

: Publishing and Using Cultural Heritage Linked Data on the Semantic Web . Synthesis Lectures on Semantic Web: Theory and Technology . Morgan & Claypool Publishers series ( 2012 ).

5. Makris

, Bikakis

, Gioldasis

, Christodoulakis

S.:

SPARQL-RW: Transparent Query Access over Mapped RDF Data Sources . In: Proceedings of the 15th International Conference on Extending Database Technology (EDBT) , Berlin ( 2012 ).

6. Makris

, Skevakis

, Kalokyri

, Arapi

, Christodoulakis

: Metadata Management and Interoperability Support for Natural History Museums . In: Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries (TPDL) , Malta ( 2013 ).

7. Makris

, Skevakis

, Kalokyri

, Gioldasis

, Kazasis

, Christodoulakis

: Bringing Environmental Culture Content into the Europeana.eu Portal: The Natural Europe Digital Libraries Federation Infrastructure . In: Proceedings of the 5th Metadata and Semantics Research Conference (MTSR) , Izmir ( 2011 ).

8. Miles

, Matthews

, Wilson

, Brickley

: SKOS Core: Simple knowledge organization for the Web , ( 2005 ).

9. MultimediaN E-Culture

project

, http://e-culture.multimedian.nl

10. Natural Europe Cultural Heritage Object Application Profile , http://wiki.naturaleurope.eu/index.php?title=Natural_Europe_Cultural_Heritage_Object_Application_Profile

11. Natural Europe Project, http://www.natural-europe.eu

12. STERNA project, http://www.sterna-net.eu/

13. STITCH project, http://www.cs.vu.nl/STITCH/