Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) Integrating Research Data Management into Geographical Information Systems Christian T. Jacobs, Alexandros Avdis, Simon L. Mouradian, and Matthew D. Piggott Department of Earth Science and Engineering, South Kensington Campus, Imperial College London, London SW7 2AZ, United Kingdom {c.jacobs10,a.avdis,simon.mouradian06,m.d.piggott}@imperial.ac.uk http://www.imperial.ac.uk/engineering/departments/earth-science Abstract. Ocean modelling requires the production of high-fidelity com- putational meshes upon which to solve the equations of motion. The production of such meshes by hand is often infeasible, considering the complexity of the bathymetry and coastlines. The use of Geographical Information Systems (GIS) is therefore a key component to discretising the region of interest and producing a mesh appropriate to resolve the dynamics. However, all data associated with the production of a mesh must be provided in order to contribute to the overall recomputability of the subsequent simulation. This work presents the integration of re- search data management in QMesh, a tool for generating meshes using GIS. The tool uses the PyRDM library to provide a quick and easy way for scientists to publish meshes, and all data required to regenerate them, to persistent online repositories. These repositories are assigned unique identifiers to enable proper citation of the meshes in journal articles. Keywords: Geographical Information Systems, Research Data Man- agement, Digital Curation, Reproducibility, Digital Object Identifier, Online Repositories 1 Introduction Computer simulations of ocean dynamics are becoming ever more important to predict the effects of global-scale hazards such as tsunamis [13], the influence of marine renewable energy turbines on sediment transport [20], and the dispersal range of nuclear contaminants [6], to name just a few applications. The underly- ing numerical model behind such simulations often requires a mesh upon which the equations describing the flow dynamics are solved, thereby transitioning from a continuous description of the region of interest (also known as the domain) to a discrete one. An example focussing on the area around the Orkney and Shetland Isles is shown in Figure 1. A mesh for ocean simulations must be of high enough quality to resolve the intricate coastlines and bathymetry [12]. However, creating such a mesh manually is infeasible for large-scale, high-resolution simulations. Geographical Information Systems (GIS) offer an effective way of processing bathymetry and coastline data to create a geometry with which to work [19]. 7 Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) 2 Data Management in Geographical Information Systems Fig. 1. An example of an unstructured computational mesh which discretises the ma- rine area around the North-East coast of Scotland. The resolution is highest around the Scottish coastline and around the Orkney and Shetland Isles. A method of producing a computational mesh from this geometry is then re- quired to perform a simulation on it. QMesh [2] is a software package currently being developed at Imperial College London for this purpose. QMesh reads in a geometry defined in the QGIS Geographical Information System software [21], and then converts the geometry into a readable format for the Gmsh mesh gen- eration software [11], which in turn generates the mesh to provide a discrete representation of the domain. Ocean simulations may then be performed with a computational fluid dynamics package. Publications that are dependant on numerical simulation often provide de- tails of the simulation setups to improve reproducibility and indeed recom- putability. However, while a description of the domain may also be given, the mesh that discretises this domain is rarely provided as a supplementary material. This lack of data availability has also been highlighted in many other areas of science [27], [1], [26]. Furthermore, citations to the software used to produce the mesh typically only refer to a generic user manual and contain no information about which version was used. For the purpose of recomputability and repro- ducibility, it is crucial that researchers provide all the data files, as well as the precise version of the software’s source code used to produce the output in the first place [8], [5]. In the case of this work, the input data is the geographical information defining the domain, the output data is the computational mesh, and the software is QMesh (and its dependencies). 8 Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) Data Management in Geographical Information Systems 3 Despite the need for a more open research environment where software and datasets are shared freely, the level of motivation amongst researchers to do this is generally quite low. This is in part due to the extra effort and time required to gather and publish the data [18], whilst typically gaining little from the process. To encourage the sharing of data and improve its reproducibility and recomputability, it is therefore important to make the publication process more straight forward and swift. This can be effected by the development of research data management tools that readily capture the datasets involved and information about the software being used [25], [18]. This paper describes the integration of a research data management tool, which uses the PyRDM library [14], into the QMesh software. The tool au- tomates the publication of the QMesh source code, as well as the input and output data for a specified QGIS project, to online, citable and persistent repos- itories such as those provided by Figshare (figshare.com), Zenodo (zenodo.org) and DSpace-based (dspace.org) hosting services. The tool has both a command line and a graphical user interface, and allows users to publish the software and data at the ‘push of a button’, thereby facilitating sharing and a more open research environment. In contrast to other software tools that also facilitate the publication of code and datasets, such as Fidgit [24], rfigshare [4], and dvn [17], the QMesh publishing tool incorporates application-specific knowledge to provide a greater amount of automation. For example, the tool is able to parse QGIS project files to automatically determine the relevant input data to publish, rather than the user having to specify the data files manually. Furthermore, this work represents a novel application of research data management and curation software within a GIS environment. Section 2 describes in greater detail the extensions made to the QMesh soft- ware to automate the publication process for the software itself, the input files (for a given QGIS project) and any output files (i.e. the computational mesh). Section 3 presents a realistic example of a scientific workflow involving produc- tion of a mesh of a UK coastal region. The data files are read in to QGIS and a mesh is produced. Both the QGIS data and mesh are subsequently published to an online repository provided by Figshare, and a DOI is assigned which can be used to properly cite the data in journal articles. Finally, some concluding remarks are made in Section 4. 2 Integration with QMesh QMesh features a command line interface (CLI), as well as a graphical user inter- face (GUI) via a QGIS plugin through which users can select relevant geometry objects and produce a mesh. The integration of research data management tech- niques into QMesh was achieved by adding a PyRDM-based publishing tool to both of these interfaces. The tool provides the option of publishing the QMesh software source code and data required to reproduce the mesh to separate online repositories. Users are presented with a simple interface and only have to provide a minimal amount 9 Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) 4 Data Management in Geographical Information Systems of information; this is illustrated in Figure 2. The publication process itself is handled by the PyRDM library [14] which communicates with an online reposi- tory hosting service via its Application Programming Interface (API). The pub- lication process results in a Digital Object Identifier (DOI) [7] being assigned to the repository, with which users can properly cite their research outputs. Fig. 2. The QMesh publisher tool, which is part of the QMesh QGIS plugin. Users choose the online repository service that they wish to use; by default this is set to Figshare. In addition to the input data files associated with the QGIS project, users may also publish the output data file (i.e. the resulting computational mesh) produced by QMesh, if they so desire. By default, the publication is made public unless the user decides otherwise; in the case of private publication, a DOI is still assigned to the repository, but will not be made active/‘live’ until the repository is made public. The publication of data is handled separately to the publication of the QMesh software. In the former case, when a suitable mesh has been produced and is ready to be published, users simply have to provide the QMesh publishing tool with the location of the QGIS project file on the computer’s file system when using the CLI. When using the GUI, this location is provided automatically when the project is opened in QGIS. The tool then searches for the tags in the XML-based project file to determine the location of all the files that the project comprises; these may include shape files that define various layers in the 10 Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) Data Management in Geographical Information Systems 5 geometry, data files in NetCDF format [23] which define the bathymetry of the ocean, and a multitude of other data formats. Optionally, the location of the Gmsh mesh file may also be provided, thereby publishing the resultant output data along with the files required to produce it. The locations of all these data files, including the QGIS project file itself, are then provided to PyRDM which automatically creates a repository on the hosting service and uploads the files via the service’s API. The service then returns a publication ID and a DOI, which is presented to the user for citation purposes. This process is illustrated in Figure 3. The publication of software involves a similar process, but can currently only be accomplished via the CLI. The user only has to provide the QMesh publishing tool with the location of the software’s source code on the computer’s file system. The PyRDM library then handles the rest; it determines the exact version of QMesh currently in use using the Git version control system (git-scm.com) [22], and then checks to see whether that version has been published already1 . If it has, PyRDM retrieves the existing DOI for re-use. If it has not, then PyRDM publishes the source code in a similar fashion to the case of publishing data, as shown in Figure 3. Note that publications in journals would need to reference both the software repository’s DOI and the data repository’s DOI. There is currently no explicit link that is made between the software and data repositories, unless specified manually. As demonstrated by Figure 3, the QMesh publishing tool requires minimal user interaction and is largely automated by the PyRDM library. This is impor- tant for encouraging the sharing of software and data files, in order to achieve a more open research environment. 3 Workflow Example To demonstrate an example of a scientific workflow involving mesh generation using GIS, the Orkney and Shetland Isles considered in [2] and [3] are used. The researcher first has to describe the geography of the domain in QGIS and then decide on the area they wish to create a mesh for. The QGIS project for the Orkney and Shetland Isles comprises a number of geometrical layers which define the coastlines (and potentially coastal engineering structures such as marine power turbines), in addition to a NetCDF file which defines the bathymetry of the ocean floor, and another NetCDF file which defines the desired resolution throughout the mesh. These files are shown in Figure 4 beside the area that will be meshed. The mesh that QMesh produces for this domain (shown in Figure 1) is then used by the researcher in their marine simulations. Once the researcher is ready to publish their results, they upload the data files associated with the production 1 Repository searching is only available when using the Figshare repository service, due to API limitations explained later in Section 4. PyRDM will publish the software regardless of whether it has been published before when Zenodo or a DSpace-based service is chosen. 11 Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) 6 Data Management in Geographical Information Systems Fig. 3. The processes behind publishing the QGIS data files (left) and QMesh software source code (right) to Figshare. Fig. 4. Screenshot of the UK region visualised in QGIS. The solid dark purple line defines the area that will be meshed (in this case it contains the Orkney and Shetland Isles). The different files that make up the layers of the geometry are specified in the column on the left-hand side. 12 Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) Data Management in Geographical Information Systems 7 of the simulation’s mesh to an online repository using the QMesh publishing tool shown in Figure 2 (the CLI may also be used instead of the graphical interface). In this example, it uploads all the files previously mentioned to Figshare. Once uploaded, the files can be downloaded from the Figshare website (see Figure 5) and a DOI is presented to the researcher to share with colleagues and for use in journal publications (see Figure 6). Fig. 5. A screenshot of the resulting repository on the Figshare website, with the files readily available to download. The QMesh publishing tool automatically assigns a title and tags to the repository based on the QGIS project’s name. The version of the QMesh software’s source code that is used should also be published, in a separate repository to the data. However, it should be noted that publishing the QMesh source code may not be enough to reproduce the exact same mesh without also knowing the versions of its dependencies. For example, different versions of Gmsh may produce slightly different meshes as a result of algorithmic improvements within the software. It is therefore important that such information be recorded in some way to further improve the degree of reproducibility. For example, ideally Gmsh would also have a similar system for publishing the current version of its source code in use. 13 Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) 8 Data Management in Geographical Information Systems Fig. 6. A Figshare publication ID and a DOI are assigned to each repository, and presented to the researcher once the publication process is complete. 4 Discussion and Conclusions Throughout the production of the PyRDM-based publishing tool for QMesh, several issues were encountered which largely stemmed from a lack of standard- isation and support in the repository hosting services’ APIs. For example, in order for PyRDM to attribute authors to the software repository on Figshare, all authors of QMesh must provide their Figshare author IDs in the AUTHORS file that is part of the QMesh source code. Unfortunately, another different set of author IDs would need to be provided when using a different repository ser- vice such as Zenodo, which is inconvenient and requires all authors of QMesh to have accounts across all the supported services. A more standardised way of identifying and attributing authors to research software and data would be to use ORCID (orcid.org) researcher IDs. Figshare has recently added support for authenticating with ORCID IDs via its web interface [9], and it is hoped that ORCID authentication via the Figshare API will also be added for the benefit of PyRDM. Another example, this time involving lack of API support, is the current inability to search for an existing repository with the Zenodo API. Fur- ther developments are necessary in this area to enrich the publication process and improve automation. The production of meshes can involve proprietary and/or private data which cannot be published openly, but at the same time sharing all research output is becoming a common requirement imposed by research funders. The QMesh publishing tool comes with the option of publishing the data to private reposi- tories. However, with some services the private storage space is rather limited, and typically not large enough to store high quality mesh files for realistic ocean simulations. For example, the free private storage space offered by Figshare is 1 GB at the time of writing this paper, with a 250 MB individual file size limit2 . Furthermore, only a maximum of 5 collaborators can be given access to a private repository. In contrast, the integration of Figshare for Institutions [10] offers a more suitable platform for larger-scale research data management. This project enables researchers at an institution to publish to private repositories hosted in the cloud. This is considerably more sustainable for GIS projects and mesh 2 http://figshare.com/pricing 14 Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) Data Management in Geographical Information Systems 9 generation that can involve very large file sizes, both public and private data, and collaboration amongst many researchers and research groups. In conclusion, the integration of a publishing tool in a Geographical Informa- tion System has helped to mitigate one of the reasons why researchers tend not to publish their software and data; that is, it is time-consuming to do so with little reward. The new QMesh publishing tool makes publishing a computational mesh and associated data files easy and largely effortless through the addition of a significant amount of automation. Furthermore, the use of online repository services enable more formal citation of all research outputs through the use of DOIs. However, it is the responsibility of the scientific community to encourage and provide incentives for the openness and public availability of this software and data, in order to overcome the barrier of lack of motivation to publish. Acknowledgments CTJ was funded by an internal grant entitled “Research data management: Where software meets data” from the Research Office at Imperial College London. Part of the work presented in this paper is based on work first presented in poster form at the International Digital Curation Conference (IDCC) in February 2015 [16], and in a PyRDM project report [15]. The authors would like to thank the two anonymous reviewers of this paper for their feedback. References 1. Alsheikh-Ali, A.A., Qureshi, W., Al-Mallah, M.H., Ioannidis, J.P.A.: Public Avail- ability of Published Research Data in High-Impact Journals. PLoS ONE 6(9), e24357 (2011) 2. Avdis, A., Hill, J., Jacobs, C.T., Kramer, S.C., Candy, A.S., Gorman, G.J., Pig- gott, M.D.: Efficient unstructured mesh generation for renewable tidal energy using Geographical Information Systems (In Preparation) 3. Avdis, A., Jacobs, C.T., Hill, J., Piggott, M.D., Gorman, G.J.: Shoreline and Bathymetry Approximation in Mesh Generation for Tidal Renewable Simulations. In: Proceedings of the 11th European Wave and Tidal Energy Conference (Ac- cepted) 4. Boettiger, C., Chamberlain, S., Ram, K., Hart, E.: rfigshare: an R interface to figshare.com. (2014), http://CRAN.R-project.org/package=rfigshare, r package version 0.3-1 5. Buckheit, J.B., Donoho, D.L.: WaveLab and Reproducible Research. In: Anto- niadis, A., Oppenheim, G. (eds.) Wavelets and Statistics, Lecture Notes in Statis- tics, vol. 103, pp. 55–81. Springer, New York (1995) 6. Choi, Y., Kida, S., Takahashi, K.: The impact of oceanic circulation and phase transfer on the dispersion of radionuclides released from the Fukushima Dai-ichi Nuclear Power Plant. Biogeosciences 10, 4911–4925 (2013) 7. Davidson, L.A., Douglas, K.: Digital Object Identifiers: Promise and Problems for Scholarly Publishing. Journal of Electronic Publishing 4(2) (1998) 8. de Leeuw, J.: Reproducible Research: the Bottom Line. Department of Statis- tics Papers, University of California (2001), http://escholarship.org/uc/item/ 9050x4r4 15 Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) 10 Data Management in Geographical Information Systems 9. Figshare: figshare ORCID integration. Figshare blog, http://figshare.com/blog (2013) 10. Figshare: Loughborough University, figshare, Arkivum and Symplectic an- nounce pioneering research data management solution. Figshare blog, http://figshare.com/blog (2014) 11. Geuzaine, C., Remacle, J.F.: Gmsh: A 3-D finite element mesh generator with built- in pre- and post-processing facilities. International Journal for Numerical Methods in Engineering 79(11), 1309–1331 (2009) 12. Gorman, G.J., Piggott, M.D., Wells, M.R., Pain, C.C., Allison, P.A.: A systematic approach to unstructured mesh generation for ocean modelling using GMT and Terreno. Computers & Geosciences 34(12), 1721–1731 (2008) 13. Hill, J., Collins, G.S., Avdis, A., Kramer, S.C., Piggott, M.D.: How does multiscale modelling and inclusion of realistic palaeobathymetry affect numerical simulation of the Storegga Slide tsunami. Ocean Modelling 83, 11–25 (2014) 14. Jacobs, C.T., Avdis, A., Gorman, G.J., Piggott, M.D.: PyRDM: A Python-based library for automating the management and online publication of scientific software and data. Journal of Open Research Software 2(1), e28 (2014) 15. Jacobs, C.T., Avdis, A., Gorman, G.J., Piggott, M.D.: RDM Green Shoots Project Report: Research data management: Where software meets data (2014), http: //dx.doi.org/10.6084/m9.figshare.1269127 16. Jacobs, C.T., Avdis, A., Gorman, G.J., Piggott, M.D.: PyRDM: A library to fa- cilitate the automated publication of software and data in computational science. Poster presentation at the 10th International Digital Curation Conference (2015), http://dx.doi.org/10.6084/m9.figshare.1318710 17. Leeper, T.J.: Archiving Reproducible Research with R and Dataverse. The R Jour- nal 6(1) (2014), http://journal.r-project.org/archive/2014-1/leeper.pdf 18. LeVeque, R.J., Mitchell, I.M., Stodden, V.: Reproducible Research for Scientific Computing: Tools and Strategies for Changing the Culture. Computing in Science & Engineering 14(4), 13–17 (2012) 19. Li, R.: Data Models for Marine and Coastal Geographic Information Systems, chap. 3. CRC Press (2000) 20. Martin-Short, R., Hill, J., Kramer, S.C., Avdis, A., Allison, P.A., Piggott, M.D.: Tidal resource extraction in the Pentland Firth, UK: potential impacts on flow regime and sediment transport in the Inner Sound of Stroma. Renewable Energy 76, 596–607 (2015) 21. QGIS Development Team: QGIS Geographic Information System. Open Source Geospatial Foundation (2009), http://qgis.osgeo.org 22. Ram, K.: Git can facilitate greater reproducibility and increased transparency in science. Source Code for Biology and Medicine 8(7) (2013) 23. Rew, R.K., Davis, G.P.: NetCDF: an interface for scientific data access. IEEE Computer Graphics and Applications 10(4), 76–82 (1990) 24. Smith, A.: Fidgit - DOIs for code. figshare (2013), http://dx.doi.org/10.6084/ m9.figshare.828487 25. Stodden, V., Bailey, D., Borwein, J., LeVeque, R.J., Rider, W., Stein, W.: Set- ting the Default to Reproducible: Reproducibility in Computational and Exper- imental Mathematics. Tech. rep., Institute for Computational and Experimen- tal Research in Mathematics (ICERM) (2013), http://www.davidhbailey.com/ dhbpapers/icerm-report.pdf 26. Vines, T.H., Andrew, R.L., Bock, D.G., Franklin, M.T., Gilbert, K.J., Kane, N.C., Moore, J.S., Moyers, B.T., Renaut, S., Rennison, D.J., Veen, T., Yeaman, S.: Man- 16 Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) Data Management in Geographical Information Systems 11 dated data archiving greatly improves access to research data. The FASEB Journal 27(4), 1304–1308 (2013) 27. Whitlock, M.C., McPeek, M.A., Rausher, M.D., Rieseberg, L., Moore, A.J.: Data Archiving. The American Naturalist 175(2), 145–146 (2010) 17