R2R+BCO-DMO – Linked Oceanographic
                 Datasets

    Adila Krisnadhi1,2 , Robert Arko3 , Suzanne Carbotte3 , Cynthia Chandler4 ,
     Michelle Cheatham1 , Pascal Hitzler1 , Yingjie Hu5 , Krzysztof Janowicz5 ,
         Peng Ji3 , Nazifa Karima1 , Adam Shepherd4 , and Peter Wiebe4
                             1
                               Wright State University
                 2
                 Faculty of Computer Science, Universitas Indonesia
             3
               Lamont-Doherty Earth Observatory, Columbia University
                     4
                       Woods Hole Oceanographic Institution
                    5
                       University of California, Santa Barbara


        Abstract. The Biological and Chemical Oceanography Data Manage-
        ment Office (BCO-DMO) and the Rolling Deck to Repository (R2R)
        program are two key data repositories for oceanographic research, sup-
        ported by the U.S. National Science Foundation (NSF). R2R curates dig-
        ital data and documentation generated by environmental sensor systems
        installed on vessels from the U.S. academic research fleet, with support
        from the NSF Oceanographic Technical Services and Arctic Research
        Logistics Programs. BCO-DMO human-curates and maintains data and
        metadata including biological, chemical, and physical measurements and
        results from projects funded by the NSF Biological Oceanography, Chem-
        ical Oceanography, and Antarctic Organisms & Ecosystems Programs.
        These two repositories have a strong connection, and document several
        thousand U.S. oceanographic research expeditions since the 1970’s. Re-
        cently, R2R and BCO-DMO have made their metadata collections avail-
        able as Linked Data, accessible via public SPARQL endpoints. In this
        paper, we report on these datasets.


1     Introduction
Researchers in the geosciences are challenged by the volume and heterogeneity
of data types and formats, and the difficulty in discovering, accessing, and in-
tegrating data sets from multiple sources [2, 6]. At the same time, this diversity
and heterogeneity is an unavoidable feature in a discipline that is so active and
multi-faceted as the geosciences.
    Geoscience researchers are therefore seeking methods and tools that allow
them to more easily share, discover, access, and reuse data. Currently, a very
important role to this end is played by large-scale data repositories, which ware-
house data for redistribution and inspection. Each repository usually caters for
a specialized subcommunity of researchers, and is highly specialized and focused
on particular purposes.
    In the meantime, the number of such repositories, which can be accessed on
the World Wide Web, abounds. It thus comes as no surprise that they each come
2       Adila Krisnadhi, et al.

with their own modes of access, visualizations, tools, data structures, etc. So,
while access to relevant research data is now much easier in principle, diversity
and heterogeneity continue to provide significant barriers to discovery and access.
    At the same time, global issues such as climate change and deforestation,
together with a growing understanding of the many interrelationships between
different subdisciplines, impose the necessity to consider Earth as a single but
very complex system. This drives the need to not only discover and access data,
but also to integrate information accross fields and disciplines. This importance
is witnessed, e.g., by the National Science Foundation’s funding of the Earth-
Cube program, which aims at providing “unprecedented data sharing” across
the geosciences.1
    Linked data, of course, provides a basic means to this end. Unfortunately,
while the uptake of linked data in the earth sciences is growing, it also remains
relatively slow. But as repository metadata begins to be published as linked data,
it gathers momentum due to the additional opportunities provided by publishing
in this shared format which decreases the barrier to reuse.
    Another advantage of advancing linked data solutions for the geosciences
emerges when considering the sociocultural benefits. For example, existing data
compilations such as the Global Multi-Resolution Topography synthesis [8],
Petrological Database [5], and Long Term Ecological Research Network [9] de-
pend upon contributions from hundreds of individual stakeholders such as sci-
entists and engineers on oceanographic cruises, geological surveys and mapping
agencies, and students and postdocs working in laboratories. Providing attribu-
tion (credit) to contributors is imperative for the success of such syntheses. Pub-
lishing content as linked open data, including links to investigators and field ex-
peditions, which, in turn, can be linked to journal articles and conference/award
abstracts, will provide greater incentive to contributors. Combining linked data
with greater semantic integration will not only facilitate connections between
global/gridded synthesis data and expedition-based (point-, track-, time-series-)
data, and make it easier for scientists to discover and access those data in a
consistent manner for multi-disciplinary investigations; it will also generate en-
thusiasm among scientists to contribute their data.
   In this paper, we present linked datasets providing content from the two key
ocean science repositories in the U.S., The Biological and Chemical Oceanogra-
phy Data Management Office (BCO-DMO) and the Rolling Deck to Repository
(R2R) program. We will first discuss the specific relevance of these repositories
and their datasets for their research fields (Section 2), then provide more details
about the corresponding linked datasets and their availability (Section 3), before
concluding (Section 4).


1
    http://earthcube.org/
                        R2R+BCO-DMO – Linked Oceanographic Datasets              3


                         Fig. 1. R2R online user interface


2     Repository Description and Relevance
2.1    The R2R Program
With their global capability and diverse array of sensors, the U.S. academic re-
search fleet is an essential mobile observing platform for ocean science. Data
collected on every expedition are of high value, especially given the high costs
and increasingly limited resources for ocean exploration. The Rolling Deck to
Repository (R2R) program2 is funded by NSF to provide stewardship of envi-
ronmental sensor data routinely collected by the U.S. academic research fleet,
working in close collaboration with the University-National Oceanographic Lab-
oratory System (UNOLS) and the NOAA National Data Centers.
    R2R maintains a catalog of vessels, instrument systems, expeditions, datasets,
investigators, organizations, funding awards, cruise reports, and navigation tracks
(see Figure 1) – every NSF-funded oceanographic cruise on a vessel in the aca-
demic fleet creates records in R2R. As such, R2R ensures preservation of and
2
    http://www.rvdata.us/
4       Adila Krisnadhi, et al.

access to U.S. national oceanographic research data resources, and provides a
central gateway through which data from oceanographic expeditions is routinely
cataloged and securely transmitted to national long-term archives including the
National Geophysical Data Center (NGDC) and National Oceanographic Data
Center (NODC). R2R thus provides essential data documentation for each expe-
dition, and tools to improve documentation of the wide array of shipboard data
acquisition activities typical of modern expeditions.
    R2R also conducts post-cruise quality assessment to document the quality of
data as originally delivered from vessels and provides feedback to cruise operators
regarding the data quality. The main objective is focused on identifying occur-
rences of suspicious data, and not to assess the scientific value of the data. That
is, R2R aims to preserve the data and the accompanying metadata to capture
as much as possible the orignal intent as they were collected or acquired during
expedition. The quality assessment is realized through a series of (mostly) au-
tomated tests such as checking whether appropriate metadata exists, searching
for possible errors in file formats, as well as collecting summaries of record-level
testing of data. All of these are done without making changes to the original raw
data files.
    As of April 28, 2015, R2R hosts data from 24 in-service vessels, 4,356 cruises,
and a total of 18,238,775 archived files. The R2R website has an average of over
60,000 page views per month.


2.2    BCO-DMO

The Biological and Chemical Oceanography Data Management Office (BCO-
DMO)3 was created to serve principal investigators funded by the NSF’s Biolog-
ical Oceanography, Chemical Oceanography and Antarctic Organisms & Ecosys-
tems Programs as a facility where marine biogeochemical and ecological data
and information developed in the course of scientific research can easily be dis-
seminated, protected, and stored on short and intermediate time-frames. The
Data Management Office also provides research scientists and others with the
tools and systems necessary to work with marine biogeochemical and ecological
data from heterogeneous sources with increased efficacy. To accomplish this, two
data management offices were united in 2006 and enhanced to provide a venue
for submission of electronic data and metadata and other information for open
distribution via the World Wide Web. The BCO-DMO data system can accom-
modate many different types of data including biological, chemical, and physical
measurements and results. The system provides access to the data (numbers,
images, and/or documents) in a consistent manner, with sufficient metadata, so
that others can make full use of these data for their own purposes. The existence
of sufficient metadata enables the discovery and accurate reuse of data by more
than just the initial investigators who collect and process the data. The BCO-
DMO data system is not simply a catalog of data resources, but a system that
takes full advantage of a MySQL database storing documentation (metadata)
3
    http://bco-dmo.org/
                        R2R+BCO-DMO – Linked Oceanographic Datasets              5


                      Fig. 2. BCO-DMO online map interface


for each data set, and a data management backend that allows data to reside at
multiple sites (including the originating investigator’s location if they wish).
    The office manages existing and new data sets from individual scientific inves-
tigators and collaborative groups of investigators, and continues to make these
available online. The office works with principal investigators and other data con-
tributors on data quality control; maintains an inventory and program thesaurus
of strictly defined field names; generates metadata Directory Interchange Format
records required by federal agencies; ensures submission of data to national data
centers; supports and encourages data synthesis by providing new, online, web-
based display tools; and facilitates regional, national, and international data and
information exchange. The data being served provide the scientific investigators
with an opportunity to explore the complex and multifaceted data sets wherever
they reside world-wide and to collaborate with colleagues in addressing pressing
environmental questions, problems, and challenges. The BCO-DMO collection
of data sets supports synthesis and modeling activities, reuse of oceanographic
data for new research endeavors, availability of “real data” for teachers/students
at school and college level to use in their classes, and provides decision-support
field data for policy-relevant issues. Figure 2 shows a sample screen shot.
    In terms of data quality, BCO-DMO employs an approach that is laregly
people-intensive. Here, BCO-DMO provides data managers who work closely
6       Adila Krisnadhi, et al.

with investigators to ensure sufficient metadata are collected and preserved to
assist discovery, use, and reuse tasks. Collected metadata include information
regarding design of experiments, instruments employed, as well as all the steps
in processing field measurements into the final form of the data. Beyond the
collection, data managers also coordinate closely with data contributors to decide
how to organize and present the data in the best way possible. By employing
this approach, BCO-DMO feels that higher quality data can be obtained and
reused effectively.
    As of April 28, 2015, BCO-DMO hosts 7,490 datasets including information
about 1,799 researchers, 2,127 deployments, and 512 projects, that span the
full range of oceanographic measurements from research cruises, timeseries sites,
laboratory and mesocosm experiments, and synthesis and modeling projects.
The BCO-DMO site typically has over 6,500 page views each month.

3     The Linked Datasets
3.1   R2R
The R2R linked dataset currently consists of over 530,000 triples, which are
accessible via SPARQL Endpoint.4 Machine-readable metadata is available at
http://data.rvdata.us/.well-known/void. A Snorql interface is also pro-
vided5 for exploring the SPARQL Endpoint, and an entry point URL is pro-
vided for Semantic Web browsers.6 A navigable HTML view is also available.7
The SPARQL endpoint is fed from the internal R2R database and is therefore up-
to-date. Bulk download is possible at http://www.rvdata.us/outgoing/lod/
rvdata.us.20150430.ttl.gz. R2R data are currently under Creative Commons
CC BY-NC-SA 3.0 US license.
    The RDF graph structure underlying the R2R linked dataset uses a set of
interlinked ontology design patterns which are described elsewhere [3, 4]. A con-
ceptual view on the schema can be found in Figure 3. Note that the triplification
is done only on the metadata, and not down to each observation datum, which
would require sheer amount of resources beyond the current capacity of R2R
program. The ontology design patterns themselves are an ongoing recent out-
come of the National Science Foundation’s EarthCube program, more precisely
of the GeoLink project8 [10] and its precursor OceanLink [7]. They have been
developed with ease of information integration in mind.

3.2   BCO-DMO
The BCO-DMO linked dataset9 has machine-readable metadata accessible at
http://www.bco-dmo.org/.well-known/void. The whole dataset currently con-
4
  http://data.rvdata.us/sparql
5
  http://data.rvdata.us/snorql/
6
  http://data.rvdata.us/all
7
  http://data.rvdata.us/
8
  http://www.geolink.org/
9
  http://www.bco-dmo.org/linked-open-data
                        R2R+BCO-DMO – Linked Oceanographic Datasets            7


                      Fig. 3. R2R conceptual schema diagram


                        Fig. 4. BCO-DMO schema diagram


sists of over 2,170,000 triples. The triples are accessible via a SPARQL Endpoint
and a Virtuoso SPARQL Browser10 is provided for exploring the SPARQL End-
point. The SPARQL endpoint is fed from the internal BCO-DMO database and
is therefore up-to-date. Bulk download is also possible via the URIs pointed to
by the void:dataDump property within the machine-readable metadata. BCO-
DMO data are currently under Creative Commons CC BY-SA 3.0 license.
    BCO-DMO uses a manually designed ontology for data organization, which
was reported on in [1]. The schema diagram can be seen in Figure 4. Like in R2R,
triples in BCO-DMO are essentially only on the metadata level, and not down
to individual measurements. Meanwhile, for the purpose of better integration,
not just with R2R, but also possibly with other data repositories in geo science,
BCO-DMO provides additional triplification into the GeoLink design patterns,
which are currently ongoing [4].

10
     http://lod.bco-dmo.org/sparql
8         Adila Krisnadhi, et al.

3.3     The Overlap between R2R and BCO-DMO

The reader may suspect some overlaps exist between R2R and BCO-DMO, given
that there are actually only dozens of oceanographic research vessels deployed for
field observation, etc. The map-based interfaces also look similar. Indeed, there
is a strong partnership between R2R and BCO-DMO, which makes linking their
content between each other particularly attractive and potentially impactful.
R2R housed data about the vessels, the route navigated during an expedition,
as well as narrative description of activities performed during the expedition. It
also hosted data obtained from on-board sensors and devices fixed to the ves-
sels, such as those from CTD11 instruments or multibeam sensors. On the other
hand, data obtained from devices personally brought by the researchers (and
thus are not fixed permanently to the vessels) are not kept by R2R, but rather
by other repositories, particularly BCO-DMO. In this context, R2R and BCO-
DMO are linked two each other via (meta)data about persons and they agree on
oceanoraphic cruise identifiers. This linking is of high quality as both data repos-
itory maintainers closely cooperate to identify the overlap. For cruise identifiers
in particular, there are only about a few dozens research vessels actively used
for the U.S. oceanography research, and R2R essentially acts as the gateway of
data from the whole fleet of vessels before data being deposited and catalogued
in other long-term archives. As such, determining the mapping between the two
datasets and checking the redundancy become relatively manageable. Further-
more, both linked datasets provide external links to DBpedia, more precisely
they map affiliations (organizations), scientific instruments (devices), and re-
search programs to DBpedia using skos:exactMatch links, these were discovered
through string matching.
    It is important to note that although BCO-DMO has information about
cruises, it does not host the detailed navigation data and other kinds of data
pertinent to the vessels of which the vessel operators are reposnsible – these are
hosted by R2R. BCO-DMO is more focused on data from specific researchers
who run research projects. This means that BCO-DMO would have more detailed
data about observations and measurements made during a research expedition.
In addition, BCO-DMO does not limit its operation solely on oceanographic
data coming from expeditions aboard research vessels, but also those from de-
ployments via other platforms, such as moorings, satellite, land-based platforms,
or submarine-based platforms, although oceanographic data from vessel-based
expeditions constitute significant chunk of the BCO-DMO repository.


4      Conclusion

As Semantic Web technologies are on the rise in applications, the publication of
metadata as linked datasets by major geoscience data repositories is likely going
to be a driver of future developments. As data becomes available as linked data,
its reusability increases, and this includes the development of linked data based
11
     conductivity, temperature, and depth of the ocean
                         R2R+BCO-DMO – Linked Oceanographic Datasets                 9

data discovery and access. In this paper, we have presented the linked datasets
providing metadata for the two major oceanographic data repositories, R2R and
BCO-DMO.
    Besides the obvious potential these linked datasets have for leveraging Se-
mantic Web technologies for the geosciences, these datasets also lend themselves
to Semantic Web research, as they pose interesting and challenging problems
while at the same time are “real” datasets, as opposed to the often artificial
or academically produced benchmarks. For example, they provide an excellent
playground for investigations into ontology matching due to the various degrees
of overlap between sub-domains, widely different scales, and due to the fact that
the utilization of spatio-temporal aspects will likely be critical. They also pro-
vide a realistic setting for co-reference resolution problems, solutions of which
would have immediate beneficial benefit to the data repositories. Particularly
interesting is the fact that, while the datasets are of significant size, they still
center around a relatively clearly defined research community, thus certain vari-
ables can more easily be controled. Different ways to refer to places, e.g. via
coordinates or gazetteer names, and different ways to refer to chemicals, e.g. by
name or formula, etc. provide additional challenging dimensions for co-reference
resolution research.
    From a much wider perspective, of course, the development of Semantic Web
methods and tools for on-the-fly integration of major geoscience data repositories
would have immediate major impact on the work of geoscientists in practice.
Providing linked data for some repositories – or even for most repositories – can
only be a very small first step in this endeavour, which requires major advances
in methods. Some EarthCube projects, among them the GeoLink project which
the authors are part of, already pursue this vision.

Acknowledgement The presented work has been partially funded by the National
Science Foundation under the award 1440202 “EarthCube Building Blocks: Col-
laborative Proposal: GeoLink-Leveraging Semantics and Linked Data for Data
Sharing and Discovery in the Geosciences.”


References

 1. Chandler, C.L., Groman, R.C., Shepherd, A., Allison, M.D., Kinkade, D., Rauch,
    S., Wiebe, P.H., Glover, D.M.: Using Controlled Vocabularies and Semantics to
    Improve Ocean Data Discovery (Invited). AGU Fall Meeting Abstracts p. B5 (2013)
 2. Heidorn, P.: Shedding light on the dark data in the long tail of science. Library
    Trends 57(2), 280–299 (2008)
 3. Krisnadhi, A.A., Arko, R., Carbotte, S., Chandler, C., Cheatham, M., Finin, T.,
    Hitzler, P., Janowicz, K., Narock, T., Raymond, L., Shepherd, A., Wiebe, P.: Ontol-
    ogy pattern modeling for cross-repository data integration in the ocean sciences:
    The oceanographic cruise example. In: Narock, T., Fox, P. (eds.) The Semantic
    Web in Earth and Space Science: Current Status and Future Directions. Studies
    on the Semantic Web, IOS Press (2015), to appear
10      Adila Krisnadhi, et al.

 4. Krisnadhi, A.A., Hu, Y., Janowicz, K., Hitzler, P., Arko, R., Carbotte, S., Chandler,
    C., Cheatham, M., Fils, D., Finin, T., Ji, P., Jones, M., Karima, N., Mickle, A.,
    Narock, T., O’Brien, M., Raymond, L., Shepherd, A., Schildhauer, M., Wiebe, P.:
    The GeoLink modular Oceanography ontology. Submitted to ISWC 2015 (2015),
    available from http://daselab.cs.wright.edu/topics/publications.html
 5. Lehnert, K., Su, Y., Langmuir, C., Sarbas, B., Nohl, U.: A global geochemical
    database structure for rocks. Geochemistry, Geophysics, Geosystems 1(5) (2000)
 6. Malik, T., Foster, I.T.: Addressing data access needs of the long-tail distribution
    of geoscientists. In: 2012 IEEE International Geoscience and Remote Sensing Sym-
    posium, Munich, Germany, July 22-27, 2012. pp. 5348–5351. IEEE (2012)
 7. Narock, T., Arko, R.A., Carbotte, S., Krisnadhi, A., Hitzler, P., Cheatham, M.,
    Shepherd, A., Chandler, C., Raymond, L., Wiebe, P., Finin, T.W.: The OceanLink
    project. In: Lin, J., Pei, J., Hu, X., Chang, W., Nambiar, R., Aggarwal, C., Cercone,
    N., Honavar, V., Huan, J., Mobasher, B., Pyne, S. (eds.) 2014 IEEE International
    Conference on Big Data, Big Data 2014, Washington, DC, USA, October 27-30,
    2014. pp. 14–21. IEEE (2014)
 8. Ryan, W., Carbotte, S., Coplan, J., O’Hara, S., Melkonian, A., Arko, R., Weissel,
    R., Ferrini, V., Goodwillie, A., Nitsche, F., Bonczkowski, J., Zemsky, R.: Global
    Multi-Resolution Topography synthesis. Geochemistry, Geophysics, Geosystems
    10(3) (2009)
 9. Waide, R., Thomas, M.: Long-Term Ecological Research Network. In: Meyers,
    R.A. (ed.) Encyclopedia of Sustainability Science and Technology, pp. 6216–6240.
    Springer, Heidelberg (2012)
10. You, J.: Geoscientists aim to magnify specialized Web searching. Science 347(6217),
    11 (2015)