Repeatability and Re-usability in Scientific Processes: Process Context, Data Identification and Verification ©Andreas Rauber, ©Tomasz Miksa, ©Rudolf Mayer, © Stefan Proell SBA Research, and Vienna University of Technology Vienna, Austria {arauber, tmiksa, rmayer, sproell} @sba-research.org Abstract agencies such as the EC 1 are committed to data re-use and open data eScience offers huge potential of speeding up initiatives. As a result, all research data from scientific discovery, being able to flexibly re-use, publicly funded projects needs to be made available for combine and build on top of existing tools and results. the public. Not only does this entail that the data must be Yet, to reap the benefits we must be able to actually equipped with useful and stable metadata, comprehensive perform these activities, i.e. having the data, processing descriptions and documentation, but also that the data components etc. available for redeployment and being must be preserved for the long term. Yet, from an able to trust them. Thus, repeatability of e-Science eScience perspective, mere availability of data is not experiments is a requirement of validating work to sufficient, as data as such is barely useful. First of all, establish trust in results. This proves challenging as eScience benefits not only from the availability of data, procedures currently in place are not set up to meet but also from the re-use and re-purposing of tools and these goals. entire experimental workflows. Secondly, and more Several approaches have tackled this issue from various importantly, data never exists solely on its own, but is angles. This paper reviews these building blocks and usually the result of more or less complex (pre- ties them together. It starts from the capture and )processing chains. This commences with the processing description of entire research processes and ways to happening at a sensor level or other processing happening document them. Regarding data, we review the during data capture, via analysis processes resulting in recommendations of the Research Data Alliance on how processed data, leading up to experimental results serving to precisely identify arbitrary subsets of potentially as input for further meta-studies. Thus, in both cases, we high-volume and highly dynamic data used in a process. need to ensure that we have the underlying tools and Last, we present mechanisms for verifying the processes available. This is necessary to understand their correctness of process reexecutions. impact on the result, potential bias introduced by them, and to apply identical processing to new data to ensure comparability of results. To re-use such processing tools 1 Introduction we need to trust them and any underlying components to produce identical (comparable) results under identical New means of performing research and sharing results (similar) conditions. offers huge potential for speeding up scientific discovery, From a scientific point of view, the validation of such enabling scientists to flexibly re-use, combine and build research results (or, in fact, the result of every individual on top of results without geographical or time limitations processing step) is a core requirement needed for and across establishing such trust in the scientific community, its discipline boundaries. Yet, to reap the benefits tools and data, specifically in dataintensive domains. This promised by eScience [13], we must be able to actually proves challenging as procedures currently in place are perform these activities, i.e. having the data, processing not set up to meet these goals. Experiments are often components available for re-deployment. Funding complex chains of processing, involving a number of data sources, computing infrastructure, software tools, or Proceedings of the XVII International Conference external and third-party services, all of which are subject «Data Analytics and Management in Data Intensive to change dynamically. In scientific research external Domains» (DAMDID/RCDL’2015), Obninsk, Russia,  1 ec.europa.eu/digital-agenda/en/open-data-0 October 13 - 16, 2015 246 influences can have a large impact on the outcome of ƒ operator, (2) equipment, (3) calibration of the equipment, ‡š’‡”‹‡–Ǥ —ƒ ˆƒ –‘”•ǡ –Ї —•‡† –‘‘Ž• ƒ† (4) environment and (5) time elapsed between ‡“—‹’‡–ǡ –Ї ‘ˆ‹‰—”ƒ–‹‘ ‘ˆ •‘ˆ–Ǧ ƒ† Šƒ”†™ƒ”‡ǡ measurements. The standard defines an experiment as –Ї ‡š‡ —–‹‘ environment and its properties are repeatable, if the mentioned influences (1) - (4) are important factors which need to be considered. The constant and (5) is a reasonable time span between two impact of such dependencies has proven to be graver than executions of the experiment and its verification. expected. While many approaches rely on documenting Reproducibility allows variance in these factors, as they the individual processing steps performed during an cannot be avoided if different research teams want to experiment, on storing the data as well as the code used compare results. to perform an analysis, the impact of the underlying To tackle these issues we proposed to introduce software and hardware stack are often ignored. Yet, Process Management Plans (PMPs) [23]. They extend beyond the challenges posed by the actual Data Management Plans by taking a process centric view, experiment/analysis, it is the complexity of the viewing data simply as the result of underlying processes computing infrastructure (both the processing workflows such as capture, (pre-) processing, transformation, and their dependencies on HW and SW environments, as integration and analyses. The general objective of PMPs well as the enormous amounts of data being processed) is to foster identification, description, sharing and that renders research results in many domains hard to preservation of scientific processes. To embody the verify. As a recent study in the medical domain has concept of PMPs we need to solve the challenges related prominently shown [11], even assumed minute to the description of computational processes, verification differences such as the specific version of the operating and validation, monitoring external dependencies, as well system used can have a massive impact: different results as data citation. This paper reviews these building blocks were obtained in cortical thickness and volume and ties them together to demonstrate the feasibility of measurements of neuroanatomical structures if the sharing and preservation of not only datasets, but also software setup of FreeSurfer, a popular software package scientific processes. processing MRI scans, is varied. More dramatically, Section 2 summarizes related work from the areas of though, there was also a difference in the result if not the Data Management Plans (describing the result of data primary software, but only the operating system versions capturing/production processes), digital preservation of (in this case the Mac-OSX 10.5 and 10.6) differ. This processes, and several eScience research infrastructures. indicates the presence of dependencies from FreeSurfer to Section 3 presents the Context Model that is functions provided by the operating system, causing automatically captured, describing the process instabilities and misleading results. As these implementation including all software and hardware dependencies are hidden from the physician, such side- dependencies. Ways to precisely identify and cite effects of the ICT infrastructure need to be detected and arbitrary subsets of dynamic data are described in Section resolved transparently if we want to be able to trust 4, presenting the recommendations of the RDA WG on results based on computational analyses. Data Citation. Section 5 discusses the verification and A number of approaches have tackled this issue from validation of the reexecution of computational processes. various angles, including initiatives for data sharing, code These concepts are illustrated via a use case from the versioning and publishing as open source, the use of machine learning domain in Section 6, followed by workflow engines to formalize the steps taken in an conclusions in Section 7. experiment, to ways to describe the complex environment an experiment is executed in. In addition the data that is 2 Related Work created but also the processing algorithms, scripts, and other software tools used in the experiment need to be 2.1 Data Management Plans accessible for longer time periods, for facilitating data A prominent reason for the non-reproducibility of reuse and allowing peers to retrieve and verify scientific experiments is poor data management, as experiments. Keeping these assets accessible is not only a criticized in several disciplines. Different data sets technical challenge, but requires institutional scattered around different machines with no track of commitment and defined procedures. dependency between them are a common landscape for Repeatability and reproducibility are two fundamental particle physicists who move quickly from one research concepts in science. An experiment is repeatable, if it activity to another [5]. Several institutions reacted, produces the exact same results under the very same publishing templates and recommendations for DMPs, preconditions. An experiment is reproducible, if the same such as the Digital Curation Centre (DCC) [9], Australian results can be obtained even under somewhat different National Data Services (ANDS) [3] and National Science conditions, e.g. performed by a different team in a Foundation (NSF) [24], amongst many others. These are different location. There are several factors which have very similar, containing a set of advises, mainly lists of an influence on the variance of experiments. The ISO questions which researches should consider when standard 57251:1994 [14] lists the following factors: (1) developing a DMP. The attention is attracted to what 247 happens with data after it has been created, rather than in environments. This is important when we want to reuse it what way it was obtained. All the description is provided to build other processes. in a text form, and in case of NSF there is a limit of 2 pages. Thus, it is unlikely anybody will be able to reuse 2.3 eScience and Research Infrastructures or at least reproduce the process which created the data. Furthermore, the correctness of data is taken for granted Several projects benefit nowadays from sharing and and thus DMPs do not provide sufficient information that reusing data [6]. In [7] the evolution of research practices would allow ˜ƒŽ‹†ƒ–‹‰ –Ї †ƒ–ƒǤ ‹ƒŽŽ›ǡ –Ї “—ƒŽ‹–› by sharing of tools, techniques and resources is discussed. ƒ† †‡–ƒ‹Ž ‘ˆ ‹ˆ‘”ƒ–‹‘ •–”‘‰Ž› †‡’‡†• ‘ –Ї myExperiment [31] is a platform for sharing scientific ‰‘‘† ™‹ŽŽ ‘ˆ researchers. There is no formal template for workflows. This is already one step beyond just sharing specification of DMPs which would ensure that all the data. Workflows created and run within the Taverna important information is covered comprehensively. workflow engine can be published and reused by other Several tools are available, like DMPonline2 for DCC or researchers. However, the workflows do not always DMPtool 3 for NSF, which aid the researcher in the specify all required information (e.g. tools to run the process of DMP creation, but they are rather simple steps, description of parameters) to re-run the workflow interactive questionnaires which generate a textual [19]. document at the end, rather than the complex tools An environment which enables scientists to required to validate at least the appropriateness of the collaboratively conduct their research and publish it in provided information. The main conclusion from the form of executable paper was presented in [25]. The analysis is that DMPs focus on describing results of solution requires working in a specific environment, experiments. This is a consequence of their data centric limiting its applicability to the tools and software view, which enforces focus on access and correct supported by the environment. PMPs does not have such interpretation (metadata) of data and does not pay much a requirement and can be used in every case. There is a attention to processing of data. While these constitute a strong move towards ”providing a consistent platform, valuable step in the right direction, we need to move software and infrastructure, for all users in the European beyond this, taking a process centric view. Research Area to gain access to suitable and integrated computing resources” [2]. 2.2 Digital Preservation 3 Documenting eScience Processes The area of digital preservation is shifting focus from collections of simple objects to the long term preservation To enable analysis, repeatability and reuse of processes, of entire processes and workflows. they must be well described and documented. As most WF4Ever4 addressed the challenges of preserving processes are rather complex in their nature, a precise scientific experiments by using abstract workflows that description is needed to re-enact the execution of the are reusable in different execution environments [26]. process. Thus, formalized models are useful for a detailed The abstract workflow specifies conceptual and representation of critical aspects such as the hardware, technology-independent representations of the scientific software, data and execution steps supporting the process, process. They further developed new approaches to share as well as their relationships and dependencies. Several workflows by using an RDF repository and make the models can be considered for this type of documentation. workflows and data sets accessible from a SPARQL Workflow-Centric Research Objects [15] (ROs) are a Endpoint[10]. The TIMBUS 4 project addressed the means to aggregate or bundle resources used in a preservation of business processes by ensuring continued scientific investigation, such as a workflow, provenance access to services and software necessary to properly from results of its execution, and other digital resources render, validate and transform information. The approach such as publications, data-sets. In addition, annotations centers on a context model [20] of the process, which is are used to further describe these digital objects. The an ontology for describing the process components and model of Research Objects is in the form of an OWL their dependencies. It allows to store rich information, ontology, and incorporates several existing ontologies. At ranging from software and hardware to organizational its core, the Research Object model extends the Object and legal aspects. The model can be used to develop Exchange and Reuse model (ORE) [33]5 to formalize the preservation strategies and redeploy the process in a new aggregation of digital resources. Annotations are realized environment in the future. The project developed a by using the Annotation Ontology (AO) [4], which allows verification and validation method for redeployed e.g. for comment and tag-style textual annotations. processes [12] that evaluates the conformance and Specifying the structure of an abstract workflow is performance quality processes redeployed in new enabled by the wfdesc ontology. Finally, the provenance of a specific execution of a workflow is described using  the wfprov ontology. Research objects have also been 2 dmponline.dcc.ac.uk/ 3 dmp.cdlib.org/ 4 wf4ever- project.org/  4 5 http://timbusproject.net/ openarchives.org/ore/1.0 248 presented as a means to preserve scientific processes [8], Ontology Language (OWL) [34], and the integration is proposing archiving and autonomous curation solutions performed via ontology mapping from the extensions to that would monitor the decay of workflows. the core model. An overview of this architecture and the Enterprise architecture (EA) modelling languages provided domain-specific extensions is given in Figure 1, provide a holistic framework to describe several aspects consisting of: of a process. For example, the Archimate [30] language Software Dependencies cover dependencies between supports description, analysis and visualization of the different types of software, including information on process architecture, on three distinct but interrelated which versions are compatible or conflicting with each layers: business, application and technology layer. On other. It is, for example, important to know that a specific each of these layers, active structures, behavior and version of a Java Virtual Machine is required to run a passive structures can be certain piece of software, or that a particular application ‘†‡ŽŽ‡†Ǥ Š—• –Ї ’”‘ ‡•• ƒ „‡ •’‡ ‹ˆ‹‡† ‘– ‘Ž› is required to view a digital object. This is important ƒ•ƒŠ‹‰Šއ˜‡Ž•‡“—‡ ‡‘ˆ•–‡’•ǡ„—–ƒŽ•‘ƒ•ƒŽ‘™ when considering preservation of specific parts of the software stack utilized in the process. Beyond repeatability, this information may be used during preservation planning to identify alternative software applications that can be utilized. Technical dependencies on software and operating systems in the Context Model can be captured and described via the Common Upgradeability Description Format (CUDF) [32]. Data Formats In a process execution, a number of digital objects are created, modified or read. This section includes information on which data/file formats these are stored in. It is used for preservation actions and for Fig. 1 Overview on the Context Model architecture: cor selecting appropriate comparator modules during the and extensions validation process described in Sec. 5 Our implementation of the Context Model uses the PREMIS level sequence of inputs and outputs from software and Data Dictionary [27] to represent this information. hardware components needed to run the process, e.g. Hardware contains a comprehensive description of database software, libraries, software device drivers, the computational hardware, from desktop systems, fonts, codecs, or dedicated hardware created for the server infrastructure components, to specialized hardware purpose of the experiment. Enterprise architectures do not used for certain tasks. Even though in many processes the address any specific domaindependent concerns. They hardware employed to host the software applications rather cut across the whole organization running the might be standard commodity hardware, its exact process [16]. specifications can still influence the run-time behavior of While models such as Archimate or Research Objects a process. This might be critical in certain circumstances, are extensive, they often do not provide enough detail on such as execution speed, or when specific functionalities technology aspects of the process, and thus in these and characteristics of the hardware such as precision aspects provide only little guidance to researchers aiming limits, analog/digital conversion thresholds etc. are part to produce a solid description of their technical of the computation. Further, certain processes might use infrastructure. One approach to alleviate this issue is certain hardware capabilities for computation, such as realized in the Process Context Model [17], which builds using graphical processing units (GPUs) for large-scale on top of Archimate and extends it with domain specific experiments in scientific processes. These types of languages to address specific requirements of a given hardware, and the software that can utilize them, are not domain. Wherever possible, the extension ontologies are yet as standardized and abstracted, thus an exact based on already existing languages. The development of description is needed in many cases. the model was driven by requirements to preserve and re- Legal aspects cover legal restrictions imposed on the execute complete processes. The context a process is processes. License information focusies specifically on embedded in covers immediate and local aspects such as software licenses. Relevant aspects are e.g. the types of the software and hardware supporting the process, to licenses under which software was made available, and aspects such as the organization the process is executed the clauses they contain. Patent information describes the in, the people involved, service providers, to even laws owner of a specific patent, or when it was granted. and regulations. The exact context can differ significantly Large parts of the Context Model of a process can be depending on the domain the process stems from. extracted automatically [17], especially in the aspects of The model is using the domain-independent Archimate software dependencies and data formats. Other aspects language as a core model to integrate the domain specific may still require significant manual work to obtain a extension languages. It is implemented in the Web proper representation. For example, the communication 249 to a web service has to be described via the provision of Storing a dump of such huge volumes of data, e.g. as part its exact address and interface type. Databases usually run of the validation data in the context model, is not feasible as independent server processes they are usually detected in big data settings. We need to ensure that we can refer but not fully captured when running a tool to monitor a to the original data source / data repository for providing specific research process execution. the data upon re-execution. While this may be rather We created a set of tools processing eScience trivial for static data sources being analysed in their Workflows modeled for the Taverna Workflow engine to entirety, precise identification turns into a challenge when extract the above-mentioned information and represent it researchers use only a specific subset of the entire data within the context Model. A ƒ˜‡”ƒʹ” Š‹͸ ‡š–”ƒ –‘” collection, and where this data collection is dynamic, i.e. ”‡ƒ†• ƒ˜‡”ƒ ™‘”ˆŽ‘™ ˆ‹Ž‡• ȋ–ʹˆŽ‘™ ˆ‘”ƒ–Ȍ ƒ† subject to changes. ‰‡‡”ƒ–‡• ” Š‹ƒ–‡ ‘†‡Ž•Ǥ Ї•‡ ƒ”‡ ˆ—”–Ї” Most research datasets are not just static, but highly –”ƒ•ˆ‘”‡† ‹–‘  ‘–‘Ž‘‰› ”‡’”‡•‡–ƒ–‹‘ —•‹‰ dynamic in their nature. New data is read from sensors or –Ї ” Š‹ʹ͹ ‘˜‡”–‡”Ǥ Š‹• ™ƒ›ǡ ƒŽŽ ‹ˆ‘”ƒ–‹‘ added from continuous experiments. Additional dynamics –Šƒ– ƒ„‡‡š–”ƒ –‡†ˆ”‘ arises from the need of correcting errors in the data, removing erroneous data values, or re-calibrating and thus re-computing values at later points in time. Thus, researchers require a mechanism to retrieve a specific state of the data again, to compare the results of previous iterations of an experiment. Freezing the databases at specific points in time, batch-release of versions, etc. all provide rather inconvenient work-arounds, wasting storage space by keeping multiple copies of unchanged data in different releases, and delaying the release of new data by aggregating continuous streams of data into batch releases. Additionally, most processes will not analyse the entire database, but a very specific subset of it. We thus need to ensure that precisely the same subset can be fed into the process again. Current approaches either waste Fig. 2 SQL query selecting data for music classification space by storing explicit dumps of the subset used as experiment, supporting data citation input, or require human intervention by providing the static workflow definition is captured. This is (sometimes rather ambiguous) natural language complemented by monitoring the execution of one or descriptions of the subset of data used. more process execution instances using the extractor of To address this issue, the Working Group on Dynamic the Process Migration Framework (PMF)8 which is based Data Citation 10 (WGDC) of the Research Data Alliance on the strace 9 tool. This way, all dependencies are (RDA) has devised a set of recommendations to address explored and all files and ports touched by the process are this challenge. In a nutshell, it relies on time-stamped and detected and added to the context model as dependencies. versioned storage of the data. Subsets are identified by These process traces will usually detect an enormous assigning persistent identifiers (PIDs) to time-stamped number of libraries and other files used by a process. To queries resolving to the subset. Hash-keys of the queries refine and make the model more compact, the PMF can and the result sets are stored as metadata to allow resolve Debian packages to which an identified file verification of the resulting data sets upon re-execution belongs and therefore create a smaller, concise list of [29, 28]. By shifting the focus from citing static data sets dependencies. It also removes files from the model that towards the citation of queries, which allow retrieving are not used for data exchange, for example, log and reproducible data sets from versioned data sources on cache files. demand, the problem of referencing accurate data sets can be addressed more flexibly. It also provides additional provenance information on the data set by containing a 4 Data Citation semantic description of the subset in the form of filter Processes frequently process large volumes of data. To be parameters in the query. It furthermore allows retrieving able to repeat any such process we need to ensure that the semantically identical data set including all precisely the same sequence of data is fed as input into it. corrections applied to it afterwards by re-executing the timestamped query with a later time-stamp. As the  6 ifs.tuwien.ac.at/dp/process/projects/ process can be automated it allows integrating data tavernaExtractor.html citation capabilities into existing workflows. 7 ifs.tuwien.ac.at/dp/process/projects/archi2OWL. html 9 8 ifs.tuwien.ac.at/dp/process/projects/pmf.html  10 sourceforge.net/projects/strace rd-alliance.org/groups/data-citation-wg.html 250 The persistent identifier serves as a handle which, in Following the static verification, the validation step addition to representing the input of data in a specific analyses the actual computations by comparing all process, can be shared with other peers and be used in interim and final results produced at each input/output publications. As the system is aware of updates and point (files, ports) for the original and the re-executed evolving data, researchers have transparent access to process. This validation data (as well as according specific versions of data in their workflows. There is no metrics) are defined when preparing the VPlan for the need of storing multiple versions of a dataset externally process. for the long term as the system can reproduce them on The VFramework consists of two sequences of actions. demand. As hashing methods are in place, the integrity of The first is performed in the original environment, i.e. the the datasets can be verified. Thus the exact data set used system that a process is initially deployed in. The results during a specified workflow execution can be referenced obtained from the execution of each step are written into as part of an experiment description/specification within the VPlan. This VPlan is another modular extension of the parts of the context model describing specific process the context model described in Sec. 3. It contains instances. These can later-on be used for validation. information needed to validate whether a process is We implemented several prototypes to demonstrate the reexecuted correctly. In a nutshell, it comprises of feasibility of this data identification ƒ† ‹–ƒ–‹‘ measurement points (usually all input/output happening ƒ’’”‘ƒ Šǡ‹ Ž—†‹‰•‘Ž—–‹‘•ˆ‘””‡Žƒ–‹‘ƒŽ†ƒ–ƒ„ƒ•‡• between the individual process steps), associated metrics ȋȌ •— Š ƒ• ›ǡ ƒ• ™‡ŽŽ ƒ• ˆ‘” ‘ƒǦ (usually testing whether the data for in/output are •‡’ƒ”ƒ–‡† ˜ƒŽ—‡ ˆ‹Ž‡• ȋȌ ͳͳ Ǥ  ‡šƒ’އ identical upon re-execution), and according reference †‡‘•–”ƒ–‹‰–Ї“—‡”›”‡Ǧ™”‹–‹‰”‡“—‹”‡†–‘ ”‡ƒ–‡ process instance data (i.e. storing expected values for ”‡Ǧ‡š‡ —–ƒ„އ “—‡”‹‡• ƒ‰ƒ‹•– –‹‡Ǧ•–ƒ’‡† †ƒ–ƒ ‹• specific process test runs to compare against), captured at ’”‘˜‹†‡† ‹ ‹‰Ǥ ʹǤ —†‹‘ ˆ—Žˆ‹ŽŽ‹‰ ‡”–ƒ‹ process runs in the original environment. ”‡“—‹”‡‡–•ȋ Žƒ••‹ ƒŽ—•‹ ™‹–Šƒ‹‹—އ‰–Š The second sequence is performed in the redeployment ‘ˆ ͳʹͲ •‡ ‘†•Ȍ ‹• •‡Ž‡ –‡† ƒ• ‹– ™ƒ• ƒ˜ƒ‹Žƒ„އ at a environment at any time in the future when given timestamp, removing those that had been deleted the original platform may not be available anymore. The by that timestamp. migration of an entire process, i.e. the set-up of a minimal environment required to run the process in an identical configuration, is supported by the second part of the 5 Verification and Validation Process Migration Framework. The information needed Upon re-executing a process (be it a simple reproduction for such a migration is read from the VPlan. It may, or a repeatability setting after applying preservation however, be necessary to re-engineer the process to fit it actions), we need to verify the correct behavior in a into a new system (in which case the verification step will potentially changed environment. To verify and validate report all elements in the resulting dependency tree that the replicated process that was extracted from the source are different from the original setting). system and run in the target system, we follow the Subsequently, the validation data is captured again guidelines of [1] that describe the verification and from the re-executed process and compared to the validation of such a transition activity. We devised information stored in the VPlan module of the context guidelines forming the VFramework [22] which are model using specific metrics. (usually requiring them to specifically tailored to processes and describe what be identical or within certain tolerance intervals, conditions must be met and what actions need to be taken depending on the significant properties of the process to compare the executions of two processes in different step/output to be compared.) environments. This process of verification and validation We developed the Provenance Extractor 12 which (V&V) does not check the scientific correctness of the extracts relevant process instance information from the processes. It rather helps in obtaining evidence whether provenance files produced by workflows executed in the the replicated process has the same characteristics and Taverna workflow engine. It converts these into an OWL performs in the same way as the original process. representation linked to the context model via the VPlan According to these guidelines, verification checks module. whether the process set-up and configuration in the new We investigated several workflows to define environment is identical to the original one, i.e. whether requirements, metrics and measurement points for each of the same software, operating system, library versions etc. them. The analysis revealed that the majority of are used in the according configurations. Any changes functional requirements deal with the correctness of a made to run the process in the new environment (re- single workflow step execution and the best way to compilations, newer versions of individual components) validate it is to check each of its output ports. In case of will be detected and reported as potential causes for the non-functional metrics, the prevailing requirement differences in any reexecution.   12 ifs.tuwien.ac.at/dp/process/projects/ 11 datacitation.eu ProvenanceExtractor.html 251 was the computation time that should be similar or should using different process instances or by manual addition of not exceed the ’reasonable time’. identified process dependencies. By verification and Based on this analysis we validate the workflow by validation of the process automatically recreated in the validating all of its steps by comparing the data on the target system we also indirectly verify and validate the outputs of the workflow steps and also by checking their Context Model. We determine its correctness and execution duration. The comparison is made taking into completeness, as the process is re-created via the account the format of the data using appropriate tools. information stored in the Context Model, re-creating all For example if two JPEG images depicting the same elements stored there in the target system. If the phenomenon are compared by computing a hash value, representation in the Context Model were incomplete the they may be detected as being different due to different process could not be repeated and run correctly in the creation timestamps in the metadata. While this could be target system. fixed by identifying the date of a computation (i.e. the This methodology can be applied to all situations in system clock) as one system input being used (which then which a process is re-run, re-produced, or re-used. To would need to be set to the same constant value) it may support the verification and validation for reproduction also be modelled explicitly by performing a dedicated and reuse, it is important to also publish the verification comparison for checking the identity of two JPEG files data, as other researchers may not have access to the relying only on the image content. A correct way to source system. Then they can perform V&V using the perform this comparison in general would be to compare validation data provided by the experiment owner. the features of the images using software for image analysis. We developed a set of comparator tools for 6 Use Cases prominent ˆ‹Ž‡ ˆ‘”ƒ–• ȋ‡Ǥ‰Ǥ ǡ ͵ǡ  ǡ  ǡ We will use an example from the domain of music ǡ ǡ  Ȍ •—’’‘”–‹‰ •— Š ‡˜ƒŽ—ƒ–‹‘•Ǥ • –Ї information retrieval (MIR) to illustrate the concepts ‘–‡š– ‘†‡Ž ‘–ƒ‹• ‹ˆ‘”ƒ–‹‘ ƒ„‘—– –Ї ˆ‹Ž‡ presented in the preceding sections. A common task in ˆ‘”ƒ– ‘ˆ †ƒ–ƒ ’”‘†— ‡†Ȁ”‡ƒ† „› ƒ •’‡ ‹ˆ‹  •–‡’ ƒ MIR is automatic classification of audio into some set of •—‹–ƒ„އ ‘’ƒ”ƒ–‘”‹••‡Ž‡ –‡†Ǥ pre-defined categories, e.g. genres such as jazz, pop, rock, Ї”‡ ƒ”‡ ƒ —„‡” ‘ˆ ŠƒŽŽ‡‰‡• –Šƒ– ‡‡† –‘ „‡ classic etc. at different levels of granularities. A process –ƒ‡ ‹–‘ ƒ ‘—– †—”‹‰ ƬǤ ‘‡ ’”‘ ‡••‡• reflecting this task is depicted in Fig. 3. It requires the ‡š Šƒ‰‡†ƒ–ƒ™‹–Ї𖇔ƒŽ•‘—” ‡•—•‹‰ƒ˜ƒ”‹‡–›‘ˆ acquisition of both the actual audio files as well as ‡–™‘” ‘‡ –‹‘•ǤЇ•‡”‡•‘—” ‡•—•–ƒŽ•‘„‡ ground truth information (i.e. pre-assigned genre labels for training and test data in the music collection) from some source. Next, some numeric descriptors (e.g. MFCCS, Rhythm- Fig. 3. Music Genre Classification Process [18] ƒ˜ƒ‹Žƒ„އ †—”‹‰ –Ї ˜ƒŽ‹†ƒ–‹‘ ’”‘ ‡•• •‘ –Šƒ– –Ї process can interact with them. A solution that allows monitoring of external services for changes, as well as their replacement for the purpose of verification and validation is described in [21]. Another challenge having influence on the verification is the lack of determinism of components. It can apply to both external resources that provide random values and to internal software components that, for example, depend on the system clock or the current CPU speed. In such cases the exact conditions must be re-created in both environments. Potentially, such components need to be Fig. 4. Music Genre Classification Process modelled in substituted with deterministic equivalents [12]. Taverna The Context Model contains information about ‘ dependencies required to run the software. If any of them was not identified by the automated tools or modelled Patterns, SSDs) are extracted from the individual audio manually, then the process will not execute. In the course files via a range of signal processing routines and of verification and validation the Context Model gets applying psycho-acoustic models to obtain feature vector improved until the process operates correctly. This is representations of the audio. These are subsequently fed achieved either by repeating the capturing of the process into some machine learning algorithm to train a classifier 252 such as Support Vector Machines (SVM) or Random authentication codes for the web service that the audio Forests, and subsequently evaluated using performance files are sent to for feature extraction. The vector files are measures such as recall and precision. merged and fed into the classifier which returns the actual In one of our experiment settings this process was classification results and the overall accuracy. implemented using a web service for the feature Applying a process monitoring tool we are able to extraction, WEKA as a third-party machine learning automatically capture all resources (files, ports) accessed package, and a set of dedicated scripts and java or created by one instance of the process, depicted in Fig. applications for tasks such as data acquisition, 5. This includes, amongst others, a whole range of transformation, etc. These were orchestrated manually via libraries (depicted in the upper left corner), the set of mp3 the command line or partially automated via shell scripts, audio files (depicted in the lower left corner), a range of deployed on a Linux system. To increase repeatability processes being called (e.g. wget to download the audio and ease automatic analysis we migrated this process into files and ground truth information, depicted in the upper a proper workflow representation using the Taverna right ‘”‡”Ȍǡ –Ї —•‡” ‹† ‘ˆ –Ї ’‡”•‘ ƒŽŽ‹‰ –Ї workflow engine, as depicted in Fig. 4. It lists explicitly ’”‘ ‡••ǡƒ†‘–Ї”•Ǥ the data sources (URLs) where the audio files and ground truth labels are read from, as well as providing the Fig. 5 Dependencies extracted from Music Genre Classification Process. 253 ȋƒȌ ȋ„Ȍ Fig. 6 Annotated Context Model of the Music Genre Classification Process. The raw information extracted bottom-up is subsequently enhanced, both automatically as well as manually, by structuring it according to the concepts The raw information extracted bottom-up is When applying the VFramework this information is subsequently enhanced, both automatically as well as used in two ways. First, to verify the environment in manually, by structuring it according to the concepts which the workflow is re-executed to confirm that it is provided by Archimate and adding additional configured correctly; second, to validate that the results information, such as file format information being added conform to the original workflow execution. The report by performing file format analysis using tools such as summarizing the verification result is provided in fig. 7a. DROID, contacting file format registries such as It provides an aggregated summary of the libraries, PRONOM. The resulting structure is depicted in Fig. 6. specifically the jar files of WEKA for machine learning, Fig. 6a captures, at the bottom, the basic process and the and the SOMToolbox for the vector format migration, as objects (Music files, features extracted and passed on to well as the remote service call for the feature extraction the classifier, the ground truth annotations, and the final web service. results). Stacked above it are the services being called, We use the Process Migration Framework (PMF) tools to i.e. the audio feature extractor. In Fig. 6b the basic generate the VPlan module of the Context Model of a software (Java Virtual Machine, WEKA, the data workflow execution and compare it with a Context Model fetchers) are provided, with additional dependencies (e.g. obtained in the same way for its re-execution in a the Unix Bash Shell, Base64 encoders, Ubuntu Linux in a different environment. We use the data captured for each specific version), with the data objects in different of the workflow steps and compare it using appropriate representations (e.g. the audio files as MP3 as well as comparators. For the MIR case study we compare 16 base64-encoded MP3 files) and license information for metrics related to the outputs of the workflow steps, thus the various tools (different versions of GPL, Apache evaluating 13 functional requirements. We also use 12 License, Oracle Binary Code License, the MP3 patent). metrics related to workflow execution time to evaluate 2 On top of these, the detailed application components and nonfunctional requirements. All of them are fulfilled, services, both internal as well as external, are represented. therefore the workflow re-execution is established as This way, a comprehensive and well-structured being repeatable. An excerpt of the validation report is documentation of the process can be obtained in a semi- depicted in Fig.7b, confirming that the output at three of automatic manner. the measurement points is identical. This information forms the Process Context Model and can be used for verification and validation. 254 Current work focuses on evaluating the individual components of the PMP with stakeholders from different scientific communities. Specific focus is on tool support to automate the documentation steps, specifically capturing and monitoring of low-level process characteristics and performance aspects. We incorporate all suggestions into a prototype implementation which fosters actionability and enforceability of Process Management Plans. ACKNOWLEDGMENTS This research was co-funded by COMET K1, FFG Austrian Research Promotion Agency. References ȏͳȐ IEEE Std 1012 - 2012 IEEE Standard for Software Verification and Validation. Technical report, 2012. ȏʹȐ Cristina Aiftimiei, Alberto Aimar, Andrea Ceccanti, Marco Cecchi, Alberto Di Meglio, Florida Estrella, (a) Patrick Fuhrmam, Emidio Giorgio, Balzs Knya, Laurence Field, Jon Kerr Nilsen, Morris Riedel, and John White. Towards next generations of software for distributed infrastructures: The european middleware initiative. In 8th IEEE Intl Conf on E- Science, 2012. ȏ͵Ȑ Australian National Data Service. ANDS Guides Awareness level - Data management planning. (b) Technical Report, 2011. ȏͶȐ Paolo Ciccarese, Marco Ocana, Leyla Garcia Castro, Sudeshna Das, and Tim Clark. An open annotation Fig. 7. (a) Verification and (b) Validation report (excerpt) ontology for science on web 3.0. Journal of for the MIR process Biomedical Semantics, 2(Suppl 2):S4, 2011. ȏͷȐ Andrew Curry. Rescue of old data offers lesson for 13 functional requirements. We also use 12 metrics particle physicists. Science, 331(6018):694– 695, related to workflow execution time to evaluate 2 2011. nonfunctional requirements. All of them are fulfilled, ȏ͸Ȑ R. Darby, S. Lambert, B. Matthews, M. Wilson, K. therefore the workflow re-execution is established as Gitmans, S. Dallmeier-Tiessen, S. Mele, and J. being repeatable. An excerpt of the validation report is Suhonen. Enabling scientific data sharing and re- depicted in Fig.7b, confirming that the output at three of use. In IEEE 8th Intl Conf on E- the measurement points is identical. Science, 2012. ȏ͹Ȑ D. De Roure. Machines, methods and music: On the 7 Conclusions and Future Work evolution of e-research. In 2011 Intl Conf on High Performance Computing and Simulation (HPCS), This paper describes a way to move beyond datacentric pages 8–13, 2011. research evaluation and re-use by addressing the capture ȏͺȐ David De Roure, Khalid Belhajjame, Paolo Missier, and description of entire research processes using Process Jos´e Manuel, Rau´l Palma, Jos´e Enrique Ruiz, Management Plans (PMPs), which foster identification, Kristina Hettne, Marco Roos, Graham Klyne, and description, sharing and preservation of scientific Carole Goble. Towards the preservation of scientific processes. To demonstrate how the core elements of a workflows. In 8th Intl Conf on Preservation of PMP can be implemented we described how capturing of Digital Objects, 2011. computational processes and their context can be ȏͻȐ Martin Donnelly and Sarah Jones. Checklist for a performed. We also reviewed the recommendations of the Data Management Plan. DCC, 2011. Research Data Alliance on how to precisely identify ȏͳͲȐ Daniel Garijo and Yolanda Gil. A new approach for arbitrary subsets of potentially high-volume and highly publishing workflows: Abstractions, standards, and dynamic data. Last, we presented mechanisms for linked data. In 6th WS on Workflows in support of verification and validation of process re-executions. large-scale science, 2011. 255 ȏͳͳȐ Ed Gronenschild, Petra Habets, Heidi Jacobs, Ron ȏʹͷȐ Piotr Nowakowski, Eryk Ciepiela, Daniel Harezlak, Mengelers, Nico Rozendaal, Jim van Os, and Joanna Kocot, Marek Kasztelnik, Tomasz Machteld Marcelis. The effects of Freesurfer Bartynski, Jan Meizner, Grzegorz Dyk, and Maciej version, workstation type, and macintosh operating Malawski. The collage authoring environment. system version on anatomical volume and cortical Procedia CS, 4:608–617, 2011. thickness measurements. PloS one, 7(6), 2012. ȏʹ͸Ȑ Kevin Page, Raul Palma, Piotr Holubowicz, Graham ȏͳʹȐ Mark Guttenbrunner and Andreas Rauber. A Klyne, Stian Soiland-Reyes, Don Cruickshank, measurement framework for evaluating emulators Rafael Gonzalez Cabero, Esteban Garcia, David De for digital preservation. ACM Transactions on Roure Cuesta, and Jun Zhao. From workflows to Information Systems (TOIS), 30(2), 3 2012. research objects: an architecture for preserving the ȏͳ͵Ȑ Tony Hey, Stewart Tansley, and Kristin Tolle, semantics of science. In 2nd Intl Workshop on editors. The Fourth Paradigm: Data-Intensive Linked Science, 2012. Scientific Discovery. Microsoft Research, 2009. ȏʹ͹Ȑ PREMIS Editorial Committee. Premis data ȏͳͶȐ ISO. ISO 5725:1:1994 Accuracy (trueness and dictionary for preservation metadata. Technical precision) of measurement methods and results Part report, March 2008. 1: General principles and definitions. Technical ȏʹͺȐ Stefan Proell and Andreas Rauber. A Scalable report, ISO, December 1994. Framework for Dynamic Data Citation of Arbitrary ȏͳͷȐ K. Belhajjame, O. Corcho, D. Garijo, et. al. Structured Data. In 3rd Intl Conf on Data Workflow-centric research objects: First class Management Technologies and Applications citizens in scholarly discourse. In Workshop on the (DATA2014), Vienna, Austria, August 29-31 2014. Semantic Publishing, 9th Extended Semantic Web ȏʹͻȐ Stefan Pr¨oll and Andreas Rauber. Data Citation in Conf, May 28 2012. Dynamic, Large Databases: Model and Reference ȏͳ͸Ȑ M. Lankhorst. Enterprise architecture at work. Implementation. In IEEE Intl Conf on Big Data, Springer, 2005. Santa Clara, CA, USA, October ȏͳ͹Ȑ Rudolf Mayer, Gonc¸alo Antunes, Artur Caetano, 2013. Marzieh Bakhshandeh, Andreas Rauber, and Jos´e ȏ͵ͲȐ Van Haren Publishing and A.J.E. Al. Archimate 2.0: Borbinha. Using ontologies to capture the semantics A Pocket Guide. TOGAF series. Van Haren of a (business) process for digital preservation. Intl Publishing, 2012. J. of Digital Libraries (IJDL), 15:129–152, April ȏ͵ͳȐ D.D. Roure, C. Goble, S. Aleksejevs, S. Bechhofer, 2015. J. Bhagat, D. Cruickshank, P. Fisher, N. Kollara, D. ȏͳͺȐ Rudolf Mayer and Andreas Rauber. Towards time- Michaelides, P. Missier, D. Newman, M. Ramsden, resilient mir processes. In 13th Intl Society for M. Roos, K. Wolstencroft, E. Zaluska, and Jun Music Information Retrieval Conf (ISMIR), 2012. Zhao. The evolution of myexperiment. In IEEE 6th ȏͳͻȐ Rudolf Mayer and Andreas Rauber. A Quantitative Intl Conf on eScience, pages 153–160, 2010. Study on the Re-executability of Publicly Shared ȏ͵ʹȐ Ralf Treinen and Stefano Zacchiroli. Description of Scientific Workflows. In 11th IEEE Intl Conf on the CUDF Format. Technical report, 2008. eScience, 2015. http://arxiv.org/abs/0811.3621. ȏʹͲȐ Rudolf Mayer, Andreas Rauber, Martin Alexander ȏ͵͵Ȑ Herbert Van de Sompel and Carl Lagoze. Neumann, John Thomson, and Gonc¸alo Antunes. Interoperability for the Discovery, Use, and ReUse Preserving scientific processes from design to of Units of Scholarly Communication. CTWatch publication. In 16th Intl Conf on Theory and Quarterly, 3(3), August 2007. Practice of Digital Libraries (TPDL 2012). ȏ͵ͶȐ W3C. OWL 2 Web Ontology Language Structural Springer, 2012. Specification and Functional-Style Syntax. W3C ȏʹͳȐ Tomasz Miksa, Rudolf Mayer, and Andreas Rauber. Recommendation, 2012. Ensuring sustainability of web services dependent processes. Intl J. of Computational Science and Engineering, 10(1/2):70–81, 2015. ȏʹʹȐ Tomasz Miksa, Stefan Proell, Rudolf Mayer, Stephan Strodl, Ricard Vieira, Jose Barateiro, and Andreas Rauber. Framework for verification of preserved and redeployed processess. In 10th Conf on Preservation of Digital Objects (IPRES), 2013. ȏʹ͵Ȑ Tomasz Miksa, Stephan Strodl, and Andreas Rauber. Process management plans. Intl J. of Digital Curation, 9(1), 2014. ȏʹͶȐ National Science Foundation. Data Management for NSF EHR Directorate. NSF, 2011. 256