Adding Biodiversity Datasets from Argentinian Patagonia to the Web of Data Marcos Zárate1,2,4 Germán Braun3,4 Pablo Fillottrani5,6 1 Centro para el Estudio de Sistemas Marinos, Centro Nacional Patagónico (CESIMAR-CENPAT), Argentina 2 Universidad Nacional de la Patagonia San Juan Bosco (UNPSJB), Argentina 3 Universidad Nacional del Comahue (UNCOMA), Argentina 4 Consejo Nacional de Invenstigaciones Cientı́ficas y Técnicas (CONICET), Argentina 5 Universidad Nacional del Sur (UNS), Argentina 6 Comisión de Investigaciones Cientı́ficas de la provincia de Buenos Aires (CIC), Argentina Abstract In this work we present a framework to publish biodiversity data from Argentinian Patagonia as Linked Open Data (LOD). These datasets contains information of biological species (mammals, plants, parasites, among others) have been collected by researchers from the Centro Nacional Patagónico (CENPAT), and have initially been made available as Darwin Core Archive (DwC-A) files. We introduce and detail a transformation process and explain how to access and exploit them, promoting integration with other repositories. Keywords: Biocollections, Darwin Core, Linked data, RDF, SPARQL 1 Introduction Animal, plant and marine biodiversity comprise the “natural capital” that keeps our ecosystems functional and economies productive. However, since the world is experiencing a dramatic loss of biodiversity [1,2], an analysis about its impact is being done by digitising and publishing biological collections [3]. To this end, the biodiversity community has standardised shared common vocabularies such as Darwin Core (DwC) [4] together with platforms as the Integrated Publishing Toolkit (IPT) [5] aiming at publishing and sharing biodiversity data. As a con- sequence, the biodiversity community now have hundreds of millions of records published in common formats and aggregated into centralised portals. Neverthe- less, new challenges emerged from this initiative for effectively using such a large volume of data. In particular, as the number of species, geographic regions, and institutions continue growing, answering questions about the complex interre- lationships among these data become increasingly difficult. The Semantic Web (SW) [6] provides possible solutions to these problems by enabling the Web of Linked Data (LD) [7], where data objects are uniquely identified and the rela- tionships among them are explicitly defined. LD is a powerful and compelling approach for spreading and consuming scientific data. It involves publishing, sharing and connecting data on the Web, and offers a new way of data integra- tion and interoperability. The driving force to implement LD spaces is the RDF technology. Moreover, there is an increasing recognition of the advantages of LD technologies in the life sciences [8,9]. In this same direction, CENPAT1 has started to publicly share its data un- der Open Data licence.2 Data are available as Darwin Core Archive (DwC-A) [10], which are a set of files for describing the structure and relationships of the raw data along with metadata files conforming the DwC standard. Nevertheless, the well-known IPT platform focuses on publishing content in unstructured or semi-structured formats but reducing the possibilities to interoperate with other datasets and make them accessible for machines. To enhance this approach, we present a transformation process to publish these data as RDF datasets. This process uses OpenRefine [11] for generating RDF triples from semi-structured data and define URIs. It also uses GraphDB [12], previously known as OWLIM [12], for storing, browsing, accessing and linking data with external RDF data- sets. Along this process, we follow the stages defined in the LOD Life-Cycle proposed in [13]. We claim that this work is an opportunity to exploit data from biodiversity in Argentina because they had been never published as LOD. This work is structured as follows. Section 2 describes the main features of the datasets selected and their relationships with DwC. Section 3 describes the transformation process to RDF, while section 4 presents its publication and its access. Section 5 shows the framework to discover links to other datasets. Next, section 6 presents the exploitation of the dataset. Finally, we draw conclusions and suggest some future improvements. 2 CENPAT Data Sources In this section, before describing our datasets, we briefly explain the DwC stand- ard and DwC-A, which these datasets are based on. 2.1 Darwin Core Terms and Darwin Core Archive DwC [4] is a body of standards for biodiversity informatics. It provides stable terms and vocabularies for sharing biodiversity data. DwC is maintained by TDWG3 (Biodiversity Information Standards, formerly The International Work- ing Group on Taxonomic Databases). Its terms are organised into nine categories (often referred to as classes), six of which cover broad aspects of the biodiversity domain. Occurrence refers to existence of an organism at both particular place and time. Location is the place where the organism were observed (normally a geographical region or place). Event is the relationship between Occurrence and Location and register protocols and methods, dates, time and field notes. 1 http://www.cenpat-conicet.gob.ar/ 2 https://creativecommons.org/licenses/by/4.0/legalcode 3 http://www.tdwg.org/ Finally, Taxon refers to scientific names, vernacular names, etc. of the organism observed. The remaining categories cover relationships to other resources, meas- urements, and generic information about records. DwC also makes use of Dublin Core terms [14], for example: type, modified, language, rights, rightsHolder, ac- cessRights, bibliographicCitation, references. In the same direction, Darwin Core Archive (DwC-A) [10] is a biodiversity informatics data standard that makes use of the DwC terms to produce a single, self-contained dataset and thus sharing both species-level (taxonomic) and species-occurrence data. Moreover, each DwC-A includes these files. Firstly, the core data file (mandatory) consists of a standard set of DwC terms to- gether with the raw data. This file is formatted as fielded text, where data records are expressed as rows of text, and data elements (columns) are separated with a standard delimiter such as a tab or comma. Its first row specifies the headers for each column. Secondly, the descriptor metafile defines how the core data file is organised and maps each data column to a corresponding DwC term. Lastly, the resource metadata provides information about the dataset itself such as its description (abstract), agents responsible for authorship, publication and doc- umentation, bibliographic and citation information, collection method, among others. 2.2 Dataset Features The datasets analysed belong to CENPAT and are available as DwC-A in an IPT server from this institution. They include collections of marine, terrestrial, parasites and plant species mainly registered from several points of the Argen- tinian Patagonia. Data are generated in different ways: some of them by means of electronic devices placed in different animals to study environmental variables, while others are observations of species in their natural habitat or species stud- ied in laboratories. To ensure the quality of these data, the records have been structured according to the procedure described in [15]. Up to May 2017, CENPAT owns 33 datasets representing about 273.419 occurrence records, where 80% of them have been also georeferenced. Some of these collections contain unique data never published because of the age of the records (1970s). As a consequence, making this information available as LOD is so important for researchers, who are studying species conservation and the impact of man in biodiversity along the last years [16,17]. 3 Linked Data Creation Publishing data as LD involves data cleaning, mapping and conversion processes from DwC-A to RDF triples. The architecture of such a process is shown in Fig. 1 and has been structured as described in the following subsections. Figure 1. Transformation process for converting biodiversity datasets 3.1 Data Extraction, Cleaning and Reconciliation Process The DwC-A are manually extracted from the IPT repository and their occur- rences files (occurrence.txt) are processed using OpenRefine tool [11]. There, occurrences are cleaned and converted to standardised data types such as dates, numerical values, etc. and empty columns are removed. OpenRefine also allows adding reconciliation services based on SPARQL endpoints, which return candid- ate resources from external datasets to be matched to fields in the local datasets. In our process, we use DBpedia [18] endpoint4 to reconcile the Country column with the dbo:country resource in DBpedia, the link between the resources is made through the property owl:sameAs. After that, if the reconciliation is done, we create a new column for the corresponding URI of the resource. In particular, we add the column named dbpediaCountryURI for the original Country. Another reconciliation service5 used, it was based on a taxonomic database Encyclopedia of Life (EOL)6 which allows to reconcile accepted names in EOL database. Specifically, the reconciliation is applied to the column scientificName so that we create a new column named EOL page for the EOL page describing the specie. Unfortunately, this whole process is time-consuming because not all values are automatically matched and thus ambiguous suggestions must be fixed. Moreover, in this phase only two columns have been possible to reconcile because the process returns unsuitable results using DBpedia services some columns like institutionCode or locality. 4 https://dbpedia.org/sparql 5 http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_eol. php 6 http://www.eol.org/ 3.2 RDF Schema Alignment and URI Definition After cleaning and reconciling, data are converted to RDF triples using RDF Refine7 , which is an extension of OpenRefine tool. RDF Refine allows users to go through a graphical interface describing the RDF scheme alignment skeleton to be shared among different datasets. The RDF skeleton specifies the subject, predicate and the object of the triples to be generated. The next step in the pro- cess is to set up prefixes. Since datasets include localities, locations and research institutes, we set up prefixes for well-known vocabularies such as the W3C Basic Geo ontology [19], Geonames [20], DBpedia, FOAF [21], Darwin-SW [22] for es- tablishing relationships among DwC classes and Taxon Concept.8 Table 1 shows the prefixes used. Table 1. Prefix used in the mapping process. Prefix Description URI cnp-gilia Base URI http://crowd.fi.uncoma.edu.ar:3333/ dwc Darwin Core http://rs.tdwg.org/dwc/terms/ dws Darwin-SW http://purl.org/dsw/ foaf Friend of a Friend http://xmlns.com/foaf/0.1/ dc Dublic Core http://purl.org/dc/terms/ geo-pos WGS84 lat/long vocab http://www.w3.org/2003/01/geo/wgs84 pos# geo-ont GeoNames http://www.geonames.org/ontology# wd Entitys in Wikidata http://www.wikidata.org/entity/ wdt Properties in Wikidata http://www.wikidata.org/prop/direct/ txn Taxon Concept Ontology http://lod.taxonconcept.org/ontology/txn.owl# In order to generate URI for each resource, in this approach we used GREL (General Refine Expression Language) also provided by OpenRefine, the general structure of the URIs is described below: http://[base uri]/[DwC class]/[value] where: [base uri] is the one specifies in Table 1, [DwC class] is the respective DwC class and [value] is the value of the cells in the file of occurrences. It is also important to note that the generated URIs are instances of the classes defined in the DwC standard. Finally, the resulting RDF triple for an occurrence is: SUBJECT : < base_uri / occurrence / f6bbf85d -85 ea -4605 -87 fa - d81aca73a1cd > PREDICATE : rdf : type OBJECT : dwc : Occurrence Table 2 describes the mapping performed and which columns have been used to generate the main URIs. 7 http://refine.deri.ie/ 8 http://lod.taxonconcept.org/ontology/txn.owl Table 2. The first part of the table shows the main classes corresponding to the categories of the DwC standard. Moreover, the columns of the DwC-A file used to generate URIs. The second part shows the properties used and an example of the literals obtained from the columns of the file of occurrences.txt. For simplicity, the table shows only the main properties, see the complete scheme at https://github.com/cenpat-gilia/CENPAT-GILIA-LOD/blob/master/Open_refine_scripts/rdf_skelton.json Class Columns used to create URI URI example dwc:Taxon genus + specificEpithet dwc:Occurrence id dwc:Event id dwc:Dataset dataset dc:Location id foaf:Agent institutionCode Property Columns used Example dwc:class class “Mammalia”∧∧xsd:string dwc:family family “Phocidae”∧∧xsd:string dwc:genus genus “Mirounga”∧∧xsd:string dwc:kingdom kingdom “Animalia”∧∧xsd:string dwc:order order “Carnivora”∧∧xsd:string dwc:phylum phylum “Chordata”∧∧xsd:string dwc:scientificName scientificName “Mirounga leonina Linnaeus, 1758”∧∧xsd:string txn:hasEOLPage EOL page “http://eol.org/pages/328639”∧∧xsd:string dwc:basisOfRecord basisOfRecord “PreservedSpecimen”∧∧xsd:string dwc:occurrenceRemarks occurrenceRemarks “craneo completo”∧∧xsd:string dwc:individualCount individualCount 1∧∧xsd:int dwc:CatalogNumber CatalogNumber “100751-1”∧∧xsd:string geo-pos:lat decimalLatitude -42.53∧∧xsd:decimal geo-pos:long decimalLongitude -63.6∧∧xsd:decimal geo-ont:countryCode country “Argentina”∧∧xsd:string dwc:verbatimEventDate dwc:verbatimEventDate “2004-10-22”∧∧xsd:date foaf:name recordedBy or InstitutionCode “CENPAT-CONICET”@en . 4 Publishing and Accessing Data The transformed biodiversity data have been published, and can to be accessed, through GraphDB. GraphDB is a highly efficient and robust graph database with RDF and SPARQL support. It allows users to explore the hierarchy of RDF classes (Class hierarchy), where each class can be browsed to explore its instances. Similarly, relationships among these classes also can be explored giving an overview about how many links exist between instances of the two classes (Class relationship). Each link is a RDF statement where its subject and object are class instances and its predicate is the link itself. Lastly, users also can explore resources providing URIs representing any of the subject, predicate or object of a triple (View resource). Finally, Fig. 2 shows the resulting graph for the description of a southern elephant seal skull, which is part of the CENPAT collection of marine mam- mals and contains information about where has been found, who has been col- lected for, sex and scientific name, among others. Another way to access the same information is to explore the View resource in the GraphDB repository http://crowd.fi.uncoma.edu.ar:3333/resource/find for the specific occur- rence f6bbf85d-85ea-4605-87fa-d81aca73a1cd, while the serialization of the complete graph in Turtle syntax can be consulted in.9 Figure 2. Figure shows links between instances of classes, rdf:type assertions are shown in light gray. In blue colour you can see the reconciled values . 9 https://github.com/cenpat-gilia/CENPAT-GILIA-LOD/blob/master/rdf/graph. ttl, accessed at September 2017 5 Interlinking Interlinking other datasets in a semi-automated way is crucial aiming at fa- cilitating data integration. In this context, OpenRefine reconciliation service is able to match some links to DBpedia, but since it is still limited, our pro- cess should use more powerful tools to discover links to other datasets. For this task, our approach preliminarily integrate SILK framework10 that uses Silk-Link Specification Language (Silk-LSL) to express heuristics for decid- ing whether a semantic relationship exists between two entities. For interlinking species between DBpedia and our dataset, we used Levenshtein distance a com- parison operator that evaluates two inputs and computes the similarity based on a user-defined distance measure and a user-defined threshold. This comparator receives as input two strings dbp:binomial (Binomial nomenclature in DBpedia) and the combination of dwc:genus + dwc:specificEpithet (the concatenation of these two defines the scientific name of the species). The Levenshtein distance comparator was set up with . After the execution, SILK discovered 15 links to DBpedia with an accuracy of 100% and 85 link with an accuracy between 65% and 75%. In this case, we permit only one outgoing owl:sameAs link from each resource. The complete Silk-LSL script can be downloaded from.11 However, although a set of links has been successfully generated, users’ feed- back is needed to filter some species wrongly matched by the tool. Finally, we must identify further candidates for interlinking and tests other properties or classes from our dataset in order to increase the automatic capabilities of the framework. 6 Exploitation This section shows how the different types of observations of species can be retrieved, complemented with information of another datasets and filtered by submitting SPARQL queries to GraphDB endpoint. Moreover, it provides some experiments in R by using the SPARQL12 package. Each SPARQL query in fol- lowing examples assumes the prefix defined in Table 1. Total Number of Species in the CENPAT Dataset. The following query retrieves the species of the dataset. To this end, it includes the scientific name of the species and also its amount of occurrences, to execute this query in GraphDB see.13 The Fig. 3 shows only the first resulting records. 10 http://silkframework.org/ 11 https://github.com/cenpat-gilia/CENPAT-GILIA-LOD/blob/master/SILK/ link-spec.xml, accessed at September 2017 12 https://cran.rproject.org/web/packages/SPARQL/SPARQL.pdf 13 http://crowd.fi.uncoma.edu.ar:3333/sparql?savedQueryName=species-count SELECT ? scname ( COUNT (? s ) AS ? observations ) {? s a dwc : Occurrence . ? s dsw : toTaxon ? taxon . ? taxon dwc : scient ificName ? scname } GROUP BY ? scname ORDER BY DESC ( COUNT (? s )) Figure 3. Occurrences of each species that contains the dataset. Occurrences by Year. The following query allows to observe the temporality of the occurrences and its results are visualised using R as shown the Fig. 4. The R script is available in.14 SELECT ? year ( COUNT (? s ) as ? count ) {? s a dwc : Event . ? s dwc : v e r b a t i m E v e n t D a t e ? date } GROUP BY ( year (? date ) AS ? year ) ORDER BY ASC (? year ) Figure 4. Simple plot using SPARQL and ggplot2 package for R. Conservation Status of Species. Conservation status are defined by The IUCN Global Species Programme15 and are taken as a global reference. Inform- ation about the state of conservation is missing in CENPAT datasets so that 14 https://github.com/cenpat-gilia/CENPAT-GILIA-LOD/blob/master/ r-scripts/occurrences-by-year.R, accessed at September 2017 15 http://www.iucnredlist.org/ providing these data linking other RDF datasets is highly desirable. To this end, the following query capture these missing data using the owl:sameAs property. The results are shown in Fig. 5, to execute this query in GraphDB, see.16 SELECT ? scname ? eol_page ? c_status WHERE { ? s a dwc : Taxon . ? s dwc : scient ificName ? scname . ? s txn : hasEOLPage ? eol_page . ? s owl : sameAs ? resource . SERVICE < http :// dbpedia . org / sparql > { ? resource dbo : c o n s e r v a t i o n S t a t u s ? c_status .} } Figure 5. Conservation status associated to the species: LC (Least Concern), DD (Data Deficient), EN (Endangered), VU (Vulnerable). Locations of Marine Mammals. The last query is to retrieve the locations (latitude and longitude) for the species Mirounga Leonina. The results are de- picted in Fig. 6 using R, and the script is available in.17 SELECT ? lat ? long WHERE { ? s a dwc : Occurrence . ? s dsw : toTaxon ? taxon . ? taxon dwc : scient ificName ? s_name . ? s dsw : atEvent ? event . ? event dsw : locatedAt ? loc . ? loc geo - pos : lat ? lat . ? loc geo - pos : long ? long FILTER (? lat >= " -58.4046 " ^^ xsd : decimal && ? lat <= " -32.4483 " ^^ xsd : decimal ) FILTER (? long >= " -69.6095 " ^^ xsd : decimal && ? long <= " -52.631 " ^^ xsd : decimal ) FILTER regex ( STR (? s_name ) , " Mirounga leonina " )} 7 Conclusions and Further Works In this work we have presented a framework to publish biodiversity data from Argentinian Patagonia as LOD, which have initially been made available as 16 http://crowd.fi.uncoma.edu.ar:3333/sparql?savedQueryName= conservation-status 17 https://github.com/cenpat-gilia/CENPAT-GILIA-LOD/blob/master/ r-scripts/positions-ml.R, accessed at September 2017 Figure 6. Visualization of animal movements using R Darwin Core Archive files. The aim is to facilitate the access of researchers to important data and thus giving a valuable support to the scientific analysis of the biodiversity. In addition, this work is the first Argentinian initiative to convert biodiversity data according to the criteria established by LOD. We have detailed the transformation process and explained how to access and exploit them, promoting integration with other repositories. Moreover, we have depicted this process using queries extracted from the domain of application. Such RDF repository is hosted at http://crowd.fi.uncoma.edu.ar:3333/ to- gether with an SPARQL endpoint, in this initial stage we store 202.119 triples. As future works, we plan to automate some tasks of the process and interlink with more datasets. Moreover, providing easier SPARQL access for non-skilled users. Finally, we are analysing other ontologies such as ENVO [23], NCBI [24] and OWL Time [25] and working on a suite of complementary ontologies for describing every aspect of semantic biodiversity. References 1. Craig Moritz, James L Patton, Chris J Conroy, Juan L Parra, Gary C White, and Steven R Beissinger. Impact of a century of climate change on small-mammal communities in Yosemite National Park, USA. Science, 2008. 2. Adriana Vergés, Peter D Steinberg, Mark E Hay, Alistair GB Poore, Alexandra H Campbell, Enric Ballesteros, Kenneth L Heck, David J Booth, Melinda A Coleman, and Feary. The tropicalization of temperate marine ecosystems: climate-mediated changes in herbivory and community phase shifts. In Proc. R. Soc. B. The Royal Society, 2014. 3. Malcolm Scoble. Rationale and value of natural history collections digitisation. Biodiversity Informatics, 2010. 4. John Wieczorek, David Bloom, Robert Guralnick, Stan Blum, Markus Döring, Renato Giovanni, Tim Robertson, and David Vieglais. Darwin core: An evolving community-developed biodiversity data standard. PLoS ONE, 2012. 5. Tim Robertson, Markus Döring, Robert Guralnick, David Bloom, John Wieczorek, Kyle Braak, Javier Otegui, Laura Russell, and Peter Desmet. The GBIF integrated publishing toolkit: facilitating the efficient publishing of biodiversity data on the internet. PLoS One, 2014. 6. Tim Berners-Lee, James Hendler, Ora Lassila, et al. The Semantic Web. Scientific American, 2001. 7. Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data-the story so far. Semantic services, interoperability and web applications: emerging concepts, 2009. 8. François Belleau, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault, and Jean Morissette. Bio2rdf: Towards a mashup to build bioinformatics knowledge systems. Journal of Biomedical Informatics, 2008. 9. Jouni Tuominen, Nina Laurenne, and Eero Hyvönen. Biological Names and Tax- onomies on the Semantic Web – Managing the Change in Scientific Conception. Springer, 2011. 10. K Döring M Robertson T Remsen D, Braak. Darwin Core Archive How-To Guide. 2011. 11. Ruben Verborgh and Max De Wilde. Using OpenRefine. Packt Publishing Ltd, 2013. 12. Barry Bishop, Atanas Kiryakov, Damyan Ognyanoff, Ivan Peikov, Zdravko Tashev, and Ruslan Velkov. OWLIM: A family of scalable semantic repositories. Semantic Web, 2011. 13. Sören Auer, Lorenz Bühmann, Christian Dirschl, Orri Erling, Michael Hausenblas, Robert Isele, Jens Lehmann, Michael Martin, Pablo N. Mendes, Bert Van Nuffelen, Claus Stadler, Sebastian Tramp, and Hugh Williams. Managing the Life-Cycle of Linked Data with the LOD2 Stack. In International Semantic Web Conference (2), Lecture Notes in Computer Science, 2012. 14. Dublin Core Metadata Initiative et al. Dublin core metadata element set, version 1.1. 2012. 15. Mark J Costello and John Wieczorek. Best practice for biodiversity data manage- ment and publication. Biological Conservation, 2014. 16. Reed S Beaman and Nico Cellinese. Mass digitization of scientific collections: New opportunities to transform the use of biological specimens and underwrite biodiversity science. ZooKeys, 2012. 17. Ana Vollmar, James Alexander Macklin, and Linda Ford. Natural history specimen digitization: challenges and concerns. Biodiversity Informatics, 2010. 18. Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. DBpedia: A Nucleus for a Web of Open Data. The Semantic Web, 2007. 19. D Brickley. W3C Semantic Web Interest Group: Basic Geo (WGS84 lat/long) Vocabulary, 2011. 20. Marc Wick, B Vatant, and B Christophe. Geonames ontology. http: // www. geonames. org/ ontology , accessed at Sep 2017, 2015. 21. Dan Brickley and Libby Miller. The Friend Of A Friend (FOAF) vocabulary specification, 2007. 22. Steven J Baskauf and Campbell O Webb. Darwin-SW: Darwin Core-based terms for expressing biodiversity data as RDF. Semantic Web, 2016. 23. Pier Luigi Buttigieg, Evangelos Pafilis, Suzanna E. Lewis, Mark P. Schildhauer, Ramona L. Walls, and Christopher J. Mungall. The environment ontology in 2016: bridging domains with increased scope, semantic density, and interoperation. Journal of Biomedical Semantics, 2016. 24. Scott Federhen. The NCBI Taxonomy database. Nucleic Acids Research, 2012. 25. Time Ontology in OWL, 2006. http://www.w3.org/TR/owl-time, accessed at September 2017.