COPO - Linked Open Infrastructure for Plant
                     Data

     F Shaw1 , A Etuk1 , A Gonzalez-Beltran2 , P Rocca-Serra2 , D Johnson2 ,
                      P Kersey3 , R Bastow4 , S Sansone2 ,
                          V Schneider1 , and R Davey1
                         1
                            The Genome Analysis Centre, UK
               2
                   Oxford e-Research Centre, University of Oxford, UK
                    3
                      European Bioinformatics Institute, Cambridge
                               4
                                 University of Warwick

       Abstract. Collaborative Open Plant Omics (COPO) is a brokering ser-
       vice between plant scientists and public repositories, enabling manage-
       ment, aggregation and publication of research outputs. COPO consoli-
       dates access to services and disparate information sources via web inter-
       faces and Application Programming Interfaces (APIs). Users will deposit
       and view open access data, as well as seamlessly pull such data into anal-
       ysis environments. Subsequent accessions and associated metadata will
       be tracked in COPO, thus creating a provenance trail from data to pub-
       lication.


1    Introduction
In plant science, high throughput “-omics” technologies have resulted in more
and larger datasets. Researchers are realizing the benefits of data sharing to
promote their work and to accelerate discovery in science based on aggregated
data. Many funding bodies and journals now require that data be made publicly
available. Despite the opportunities that data sharing offers for recognition and
reuse, many scientists still do not use public repositories, choosing instead to
store data in private infrastructure. This is may be due to unfamiliarity with
services and technology, lack of standards and common metadata, or a lack of
funding to support archiving. The large number and size of datasets make them
difficult to store, let alone download, making cloud-based analysis tools essen-
tial. However, submission formats to public repositories are heterogeneous, often
requiring manual authoring of complex markup documents, taking scientists out
of their fields of expertise.
    COPO aims to streamline the process of data deposition to public repositories
and data journals, by hiding much of the complexity of meta data capture and
data management from the end-user. The ISA (Investigation/Study/Assay) in-
frastructure (www.isa-tools.org) provides the interoperability between metadata
formats required for deposition to repositories. Logical groupings of artifacts (e.g.
experimental meta data and results, PDFs, raw data, contextual supplementary
information) relating to a body of work are stored in COPO “collections” and
represented by common open standards, which are publicly searchable. Bundles
of data objects can be deposited directly into public repositories (such as the
European Nucleotide Archive, Figshare and F1000) through COPO interfaces.
2       Authors Suppressed Due to Excessive Length

2   Metadata Management
The ISA model enables experimental metadata attribution and management of
metadata formats, where scientific metadata comprises information about in-
vestigators, objectives, hypotheses, publications, subjects, experimental design,
experimental workflow, and assays and related experimental data. ISA meta-
data is represented in ISA-JSON, and integrated within a broader subset of
metadata, COPO-JSON, that encompasses infrastructural information relative
to the platform itself. Both JSON implementations can be extended to JSON-LD
linked data schemas. All JSON metadata fragments are stored in a MongoDB
document-based database. Where required, ISA converters allow traversal be-
tween representations of the same metadata, e.g. ISATab to/from ISA-JSON,
and public repository formats are expressed as ISA configurations which are
mapped to a COPO-JSON user interface (UI) model to power the COPO UI
itself. In this way, we can quickly and easily adapt to new repositories or changes
to existing repository schemas all the way from data representation to UI design.

3   Platform in Development

The COPO framework is being built using Python, Django, MongoDB, JSON-
LD, ISATools, jQuery and Bootstrap technologies. A single sign-on (SSO) mech-
anism provided via ORCiD, allows COPO to track service integration and rich
user profile data. Anonymous users are able to search the COPO index for re-
search artifacts. Deposition functionality is available to authenticated users only.
The complexity of deposition services is hidden from end users, who simply fill
out clean, intuitive web forms and story-driven wizards that use the semantic
level metadata to make inferences about what a user is submitting, subsequently
making suggestions based on previous submissions.
    So far we have developed initial EMBL-EBI repository deposition support
(European Nucleotide Archive (ENA), MetaboLights) facilitated by Aspera-
powered data transfer and ISA API integration. Figshare deposition of secondary
research artifacts (PDFs, images, figures, supplementary data, etc) is also sup-
ported.

4   Future Work
The large network of linked metadata that COPO will gather allows semantic
meaning to be attached to research artifacts. Semantic inferences can then be
made over artifacts providing a richer search experience than with text based
search alone, enabling researchers to quickly find and use well-described publicly
available datasets linked by inter-connected network of metadata. The provision
of visualization for graphs of linked metadata will aid discovery of useful con-
nections between datasets, investigations and protocols. Support for more repos-
itories and open publishing platforms are planned, as well as integration with
cloud-based analysis services such as Galaxy and iPlant.