=Paper=
{{Paper
|id=Vol-1752/paper02
|storemode=property
|title=
Metadata for Nanoscience Experiments
|pdfUrl=https://ceur-ws.org/Vol-1752/paper02.pdf
|volume=Vol-1752
|authors=Vasily Bunakov,Tom Griffin,Brian Matthews,Stefano Cozzini
|dblpUrl=https://dblp.org/rec/conf/rcdl/BunakovGMC16
}}
==
Metadata for Nanoscience Experiments
==
Metadata for nanoscience experiments © Vasily Bunakov © Tom Griffin ©Brian Matthews Science and Technology Facilities Council, Harwell, Oxfordshire, UK vasily.bunakov@stfc.ac.uk tom.griffin@stfc.ac.uk brian.matthews@stfc.ac.uk © Stefano Cozzini Instituto Officina dei Materiali Trieste, Italy cozzini@iom.cnr.it Abstract participating organizations yet is able to capture significant features of nanoscience physical and Metadata is a key aspect of data management. This paper computational experiments. describes the work of NFFA project on the design of a metadata standard for nanoscience community. The 2 Approach and methodology methodology and the resulting high-level metadata model 2.1 General approach are presented. The paper explains and illustrates the principles of metadata design for data-intensive research. The major purpose of any metadata is satisfying This is value to data management practitioners in all information needs of a certain community. “Community” branches of research and technology that imply a so- should be understood in broad terms and includes called “visitor science” model where multiple researchers machine agents, to ensure human-to-human, human-to- apply for a share of a certain resource on large facilities machine and machine-to-machine interoperability. (instruments). The information needs may be generic (common with other communities) or specific for a particular 1 Introduction community. From the project perspective, the The Nanostructures Foundries and Fine Analysis (NFFA- information needs should be expressed as clearly Europe) project www.nffa.eu brings together European formulated Use Cases for the existing or proposed nanoscience research laboratories that aim to provide information and data management systems (IT researchers with seamless access to equipment and platforms). A good metadata design should take into computation. This will support a single entry point for account user requirements and IT architecture, and in turn research proposals supported by the project, and a should feed considerations for the IT architecture. common platform to support the access and integration of The IT architecture, the use cases and practices, and the resulting experimental data. Both physical and the metadata design can be considered pillars of computational experiments are in scope, with a vision enterprise architecture that includes both technological that they complement each other and can be mixed in the and organizational aspects of a loosely coupled virtual same identifiable piece of research. enterprise that the NFFA project is going to deliver for The project requires setting up the IT infrastructure the European nanoscience community. for managing research proposals and substantial amounts The main purpose of metadata design effort in of data resulted from physical and computational NFFA project can be formulated then as giving the experiments. A common metadata model that supports adequate support for that widely defined enterprise different stages of the nanoscience research lifecycle is architecture for nanoscience. This has an implication of essential to unified researchers’ experience across metadata design from “first principles”, i.e. by pondering locations, and also for the design and operation of IT over existing best practices of information management, infrastructure components. use cases for nanoscience and information technology Metadata design is a part of a joint research activity opportunities (and limitations) rather than adopting any within NFFA that takes empirical input from the project existent metadata standard. participants, also takes into account state-of-the art 2.2 Top-down input: relevant information standards and practices. Metadata design is an management frameworks incremental effort of the project; this work presents the first stage resulting in a high-level metadata model that is The case for metadata collection and use can be specific agnostic to the actual data management situation in to nanoscience, yet there are general information needs that are typical for a wide variety of users and that have Proceedings of the XVIII International Conference been developed in other branches of science and «Data Analytics and Management in Data Intensive information management. Domains» (DAMDID/RCDL’2016), Ershovo, Russia, One of the mature information design frameworks is October 11 - 14, 2016 Functional Requirements for Bibliographic Records [2] 3 that considers four basic information needs (user tasks) in most popular data management solutions. The regards to information: “Find”, “Identify”, “Select” and questionnaire inquired on the following aspects of data “Obtain”. The ultimate goal is of course getting the management in nano-facilities: information resource, yet between searching for it and Intensity of experiments and of resulting data flow obtaining it, the resource should be identified as the one Popular data formats being sought, and selected as being useful for the user [1]. Data catalogue software Each task may involve certain subtasks, e.g. selection Data catalogue openness may require checks on the resource context and on its Data management policy relevance to the actual user’s needs. Metadata standards for data catalogue Another mature information design framework of Persistent identifiers for data relevance is the Reference Model for an Open Archival User management platform Information System [3], a popular functional model for Popular third-party databases and information long-term digital preservation. If expressed in terms of systems information practitioner needs (user tasks) similarly to In total, seventeen responses out of the 20 project FRBR, the OAIS basically deals with three categories of partners were received and reviewed. They showed very them: “Ingest (into the archive)”, “Manage (within the different levels of data management maturity. From the archive)” and “Disseminate (from archive)”. Each of responses, the following priorities the metadata design these tasks may be complex and involve a number of were identified: interrelated subtasks, e.g. managing information in the archive may imply provenance and integrity checks, One experiment to many samples and one sample managing access to information, and administration / to many data files relationships should be reporting. supported. Overall, the OAIS framework should be able to A common set of metadata fields for data provide a good coverage of what NFFA needs to consider discoverability should be agreed upon, possibly for sensible data collection, archiving and provision based on an existing popular standards or towards the end users (researchers in nanoscience), and recommendation for data discovery. the FRBR framework should be able to cover the end user User roles with different permissions for access needs for information retrieval. The respective areas of to metadata should be developed. This means the coverage and user categories relevant to NFFA are metadata model will need to represent users as illustrated by the following table: well as data. Table 1 Information management frameworks and their It is reasonable to develop a common data coverage of NFFA scope. management policy for NFFA, or a set of policies Framework (a OAIS FRBR with different flavours of access to data. source of best Having links to external reference databases is practices) valuable to ensure the high quality of metadata yet this will mean additional effort so should be General use Data collection, Data retrieval de-scoped from the initial design of metadata. case management and In addition to the questionnaire where responses dissemination were collected from research offices or relevant research User categories Data archives End users programme representatives, a common vocabulary of administrators (nanoscience terms and definitions relevant to nanoscience data IT specialists researchers) management was compiled and then refined by the IT teams of participating NFFA organizations ([5]). The Information Ingest data Find data vocabulary contains about twenty commonly agreed needs (user Manage data Identify data terms with definitions; it serves as a basis for the design tasks) Disseminate Select data of information entities (groups of metadata elements) and data Obtain data contributes to the earlier mentioned NFFA “virtual Being general in nature, OAIS and FRBR are still able to enterprise” architecture. provide good recommendations for NFFA practices of A particularly important use case to be supported by information and data management. In particular, OAIS the metadata model should be the situation when the same emphasizes the need of having a clear agreement between researcher (or a research group) applies for experimental the data producer and the archive, and a clearly defined time on more than one facility – as the nature of format for data exchange between them – so called experiment may require this – yet the researcher wants a Submission Information Package, whilst FRBR seamless experience across nanoscience facilities, with a emphasizes the importance of having a clear identity for single entry point for data management. data assets. Another conclusion based on responses to the 2.3 Bottom-up input: questionnaire responses and questionnaire is that computational experiments in common vocabulary nanoscience become common and can be mixed up with A questionnaire was used to collect the NFFA partners’ physical experiments, so there should not be an artificial responses about their data management practices and division between the two. 4 2.4 Side input: IT architecture considerations 3.2 Entity-relationship diagram As an additional consideration for principal metadata design, we used the draft NFFA Data System Architecture that defines the outline design of the NFFA portal, which considered the generic use case of the same user performing a measurement on two different facilities. Generic use cases when one user wants to access data produced by another user, or wants to release data into the public domain are currently not being considered. These may be considered in future, so should be taken into account within an extensible metadata design. The draft architecture suggests that data should be harvested from individual facilities in a suitable “packaged” format, with METS [6] as a potential candidate as it supports the provision of descriptive, Figure 1 Metadata groups of elements and their administrative, structural and file metadata. For the purpose. descriptive part of metadata, the purpose of having the data assets discoverable is emphasized in the draft architecture. For the administrative metadata, the importance of intellectual property information and Table 2 Metadata elements and information needs information about the data source (provenance) is coverage. emphasized. For the structural metadata, having the information about the organization, perhaps structured in Information Inge Manag Diss Find Iden Obta a hierarchical way, is suggested. For the file metadata, entity st e data emin data tify in data (within ate data data having the list of files that constitute a digital object (data NFFA data asset) and having pointers to external metadata files are portal) deemed most important. Research Y Y Y Y After considering the draft architecture, the User Instrument Y Y conclusion was that we could take METS as “the role Scientist model” metadata standard for data packaging that Project Y Y Y Y corresponds to a specific entity in the NFFA generic Proposal Y Y metadata model – Data Asset. As to particular elements Facility Y Y Y Y Y Y of metadata suggested by the IT architecture draft, the Instrument Y Y Y Experiment Y Y Y fields for capturing intellectual property information and Sample Y Y Y provenance are easily most important ones as they affect Data Asset Y Y Y Y Y Y the data assets reusability that should be one of the Raw Data Y Y Y Y Y Y important outcomes of the NFFA project. Analysed Y Y Y Y Y Y Data Data Analysis Y Y Y 3 Implementation Data Analysis Y Y Y Software 3.1 Metadata groups and elements Data Archive Y Y Y The top-down, bottom up and side requirements resulted Data Manager Y Y Y in the basic structure of the proposed metadata model that Data Policy Y Y NFFA Portal Y Y is illustrated by Figure 1. The suggested metadata elements are presented as a matrix in Table 2 to make explicit the coverage of As a basis for further, more detailed metadata design and identified information entities (common vocabulary as a contribution to the IT architecture design, the Entity- terms) and of earlier identified information needs Relationship diagram presented by Figure 2 has been categories of them, see Section 2.2). agreed. Certain elements are in common with the Core Scientific Metadata Model ([4]) already in use in some of the facilities. Mandatory and optional metadata fields 3.3 Metadata operational recommendations (attributes) for each element were defined and shared The metadata elements suggested are not all we need for amongst project participants for further discussion ([5]). having a successful metadata framework in NFFA. In 5 Figure 2 Entity-Relationship diagram for NFFA high-level metadata model. addition, there should be established metadata opportunities for mutual mapping and cross-walks management practices, ideally assisted by clear between different metadata models. recommendations for NFFA partner organizations of how to assign and curate metadata. Acknowledgements For example, there are choices of how you aggregate This work is supported by Horizon 2020 NFFA-Europe data: let us say all data files for all samples measured in a project www.nffa.eu under grant agreement no. 654360. particular Experiment can be assembled in one package, References and then the package is given common descriptions like Facility name, research User name, Data Policy etc. [1] Philip Hider. Information resource description: However, this may not suit actual data management Creating and managing metadata. Facet Publishing, practices or policies of certain Facilities, e.g. they may 2012. want to make a Sample rather than an Experiment a focal [2] Functional Requirements for Bibliographic Records point of their metadata descriptions. (FRBR). Final Report. http://archive.ifla.org/archive/VII/s13/frbr/ These operational aspects of NFFA metadata Retrieved 20 May 2016. implementation will require further engagement and discussions with data practitioners in NFFA. [3] Reference Model for an Open Archival Information System (OAIS), Recommended Practice, CCSDS 650.0-M-2 (Magenta Book). Issue 2, June 2012. 4 Conclusion http://public.ccsds.org/publications/archive/650x0m 2.pdf Retrieved 20 May 2016. The NFFA metadata development so far has produced an agreed common approach with its mapping to the existing [4] The Core Scientific Metadata Model (CSMD). metadata frameworks and best practices. It has defined a https://icatproject.org/user-documentation/csmd/ common vocabulary, the provisional list of mandatory Retrieved 20 May 2016. and optional attributes, and the ER diagram that can be [5] Draft metadata standard for nanoscience data. used both in metadata design and in IT architecture NFFA project deliverable D11.2. February 2016. design. The high-level metadata model will be further [6] METS: Metadata Encoding and Transmission refined through project work in NFFA and through Standard. http://www.loc.gov/standards/mets/ discussions in the wider nanoscience community. Also [7] COADATA UDS: Uniform Description System for the state-of-the-art metadata development for Materials on the Nanoscale nanoscience that may cover specific entities in our http://www.codata.org/uploads/Uniform_Descriptio generic metadata model, e.g. CODATA UDS [7] for n_System_Nanomaterials-Published-v01-15-02- Sample, should be looked into in more detail, to see the 01.pdf 6