Repeatability and Re-usability in
                       Scientific Processes:
       Process Context, Data Identification and Verification

                ©Andreas Rauber, ©Tomasz Miksa, ©Rudolf Mayer, © Stefan Proell
                                           SBA Research, and
                                    Vienna University of Technology
                                             Vienna, Austria
                          {arauber, tmiksa, rmayer, sproell} @sba-research.org

                         Abstract                                    agencies such as the
                                                                     EC 1 are committed to data re-use and open data
      eScience offers huge potential of speeding up                  initiatives. As a result, all research data from
  scientific discovery, being able to flexibly re-use,               publicly funded projects needs to be made available for
  combine and build on top of existing tools and results.            the public. Not only does this entail that the data must be
  Yet, to reap the benefits we must be able to actually              equipped with useful and stable metadata, comprehensive
  perform these activities, i.e. having the data, processing         descriptions and documentation, but also that the data
  components etc. available for redeployment and being               must be preserved for the long term. Yet, from an
  able to trust them. Thus, repeatability of e-Science               eScience perspective, mere availability of data is not
  experiments is a requirement of validating work to                 sufficient, as data as such is barely useful. First of all,
  establish trust in results. This proves challenging as             eScience benefits not only from the availability of data,
  procedures currently in place are not set up to meet               but also from the re-use and re-purposing of tools and
  these goals.                                                       entire experimental workflows. Secondly, and more
  Several approaches have tackled this issue from various            importantly, data never exists solely on its own, but is
  angles. This paper reviews these building blocks and               usually the result of more or less complex (pre-
  ties them together. It starts from the capture and                 )processing chains. This commences with the processing
  description of entire research processes and ways to               happening at a sensor level or other processing happening
  document them. Regarding data, we review the                       during data capture, via analysis processes resulting in
  recommendations of the Research Data Alliance on how               processed data, leading up to experimental results serving
  to precisely identify arbitrary subsets of potentially             as input for further meta-studies. Thus, in both cases, we
  high-volume and highly dynamic data used in a process.             need to ensure that we have the underlying tools and
  Last, we present mechanisms for verifying the                      processes available. This is necessary to understand their
  correctness of process reexecutions.                               impact on the result, potential bias introduced by them,
                                                                     and to apply identical processing to new data to ensure
                                                                     comparability of results. To re-use such processing tools
  1    Introduction                                                  we need to trust them and any underlying components to
                                                                     produce identical (comparable) results under identical
  New means of performing research and sharing results
                                                                     (similar) conditions.
  offers huge potential for speeding up scientific discovery,
                                                                        From a scientific point of view, the validation of such
  enabling scientists to flexibly re-use, combine and build
                                                                     research results (or, in fact, the result of every individual
  on top of results without geographical or time limitations
                                                                     processing step) is a core requirement needed for
  and across
                                                                     establishing such trust in the scientific community, its
  discipline boundaries. Yet, to reap the benefits
                                                                     tools and data, specifically in dataintensive domains. This
  promised by eScience [13], we must be able to actually
                                                                     proves challenging as procedures currently in place are
  perform these activities, i.e. having the data, processing
                                                                     not set up to meet these goals. Experiments are often
  components available for re-deployment. Funding
                                                                     complex chains of processing, involving a number of data
                                                                     sources, computing infrastructure, software tools, or
Proceedings of the XVII International Conference                     external and third-party services, all of which are subject
«Data Analytics and Management in Data Intensive                     to change dynamically. In scientific research external
Domains» (DAMDID/RCDL’2015), Obninsk, Russia,                            
                                                                         1
                                                                          ec.europa.eu/digital-agenda/en/open-data-0
October 13 - 16, 2015


                                                               246
influences can have a large impact on the outcome of         operator, (2) equipment, (3) calibration of the equipment,
Ǥ   ǡ                  (4) environment and (5) time elapsed between
ǡ    Ǧ  ǡ        measurements. The standard defines an experiment as
   environment and its properties are               repeatable, if the mentioned influences (1) - (4) are
important factors which need to be considered. The               constant and (5) is a reasonable time span between two
impact of such dependencies has proven to be graver than         executions of the experiment and its verification.
expected. While many approaches rely on documenting              Reproducibility allows variance in these factors, as they
the individual processing steps performed during an              cannot be avoided if different research teams want to
experiment, on storing the data as well as the code used         compare results.
to perform an analysis, the impact of the underlying                To tackle these issues we proposed to introduce
software and hardware stack are often ignored. Yet,              Process Management Plans (PMPs) [23]. They extend
beyond the challenges posed by the actual                        Data Management Plans by taking a process centric view,
experiment/analysis, it is the complexity of the                 viewing data simply as the result of underlying processes
computing infrastructure (both the processing workflows          such as capture, (pre-) processing, transformation,
and their dependencies on HW and SW environments, as             integration and analyses. The general objective of PMPs
well as the enormous amounts of data being processed)            is to foster identification, description, sharing and
that renders research results in many domains hard to            preservation of scientific processes. To embody the
verify. As a recent study in the medical domain has              concept of PMPs we need to solve the challenges related
prominently shown [11], even assumed minute                      to the description of computational processes, verification
differences such as the specific version of the operating        and validation, monitoring external dependencies, as well
system used can have a massive impact: different results         as data citation. This paper reviews these building blocks
were obtained in cortical thickness and volume                   and ties them together to demonstrate the feasibility of
measurements of neuroanatomical structures if the                sharing and preservation of not only datasets, but also
software setup of FreeSurfer, a popular software package         scientific processes.
processing MRI scans, is varied. More dramatically,                 Section 2 summarizes related work from the areas of
though, there was also a difference in the result if not the     Data Management Plans (describing the result of data
primary software, but only the operating system versions         capturing/production processes), digital preservation of
(in this case the Mac-OSX 10.5 and 10.6) differ. This            processes, and several eScience research infrastructures.
indicates the presence of dependencies from FreeSurfer to        Section 3 presents the Context Model that is
functions provided by the operating system, causing              automatically captured, describing the process
instabilities and misleading results. As these                   implementation including all software and hardware
dependencies are hidden from the physician, such side-           dependencies. Ways to precisely identify and cite
effects of the ICT infrastructure need to be detected and        arbitrary subsets of dynamic data are described in Section
resolved transparently if we want to be able to trust            4, presenting the recommendations of the RDA WG on
results based on computational analyses.                         Data Citation. Section 5 discusses the verification and
   A number of approaches have tackled this issue from           validation of the reexecution of computational processes.
various angles, including initiatives for data sharing, code     These concepts are illustrated via a use case from the
versioning and publishing as open source, the use of             machine learning domain in Section 6, followed by
workflow engines to formalize the steps taken in an              conclusions in Section 7.
experiment, to ways to describe the complex environment
an experiment is executed in. In addition the data that is       2     Related Work
created but also the processing algorithms, scripts, and
other software tools used in the experiment need to be           2.1   Data Management Plans
accessible for longer time periods, for facilitating data
                                                                 A prominent reason for the non-reproducibility of
reuse and allowing peers to retrieve and verify
                                                                 scientific experiments is poor data management, as
experiments. Keeping these assets accessible is not only a
                                                                 criticized in several disciplines. Different data sets
technical     challenge,    but     requires    institutional
                                                                 scattered around different machines with no track of
commitment and defined procedures.
                                                                 dependency between them are a common landscape for
   Repeatability and reproducibility are two fundamental         particle physicists who move quickly from one research
concepts in science. An experiment is repeatable, if it          activity to another [5]. Several institutions reacted,
produces the exact same results under the very same              publishing templates and recommendations for DMPs,
preconditions. An experiment is reproducible, if the same        such as the Digital Curation Centre (DCC) [9], Australian
results can be obtained even under somewhat different            National Data Services (ANDS) [3] and National Science
conditions, e.g. performed by a different team in a              Foundation (NSF) [24], amongst many others. These are
different location. There are several factors which have         very similar, containing a set of advises, mainly lists of
an influence on the variance of experiments. The ISO             questions which researches should consider when
standard 57251:1994 [14] lists the following factors: (1)        developing a DMP. The attention is attracted to what


                                                           247
happens with data after it has been created, rather than in                       environments. This is important when we want to reuse it
what way it was obtained. All the description is provided                         to build other processes.
in a text form, and in case of NSF there is a limit of 2
pages. Thus, it is unlikely anybody will be able to reuse                         2.3         eScience and Research Infrastructures
or at least reproduce the process which created the data.
Furthermore, the correctness of data is taken for granted                         Several projects benefit nowadays from sharing and
and thus DMPs do not provide sufficient information that                          reusing data [6]. In [7] the evolution of research practices
would allow   Ǥ ǡ                         by sharing of tools, techniques and resources is discussed.
                                myExperiment [31] is a platform for sharing scientific
   researchers. There is no formal template for                      workflows. This is already one step beyond just sharing
specification of DMPs which would ensure that all                                 the data. Workflows created and run within the Taverna
important information is covered comprehensively.                                 workflow engine can be published and reused by other
Several tools are available, like DMPonline2 for DCC or                           researchers. However, the workflows do not always
DMPtool 3 for NSF, which aid the researcher in the                                specify all required information (e.g. tools to run the
process of DMP creation, but they are rather simple                               steps, description of parameters) to re-run the workflow
interactive questionnaires which generate a textual                               [19].
document at the end, rather than the complex tools                                   An environment which enables scientists to
required to validate at least the appropriateness of the                          collaboratively conduct their research and publish it in
provided information. The main conclusion from the                                form of executable paper was presented in [25]. The
analysis is that DMPs focus on describing results of                              solution requires working in a specific environment,
experiments. This is a consequence of their data centric                          limiting its applicability to the tools and software
view, which enforces focus on access and correct                                  supported by the environment. PMPs does not have such
interpretation (metadata) of data and does not pay much                           a requirement and can be used in every case. There is a
attention to processing of data. While these constitute a                         strong move towards ”providing a consistent platform,
valuable step in the right direction, we need to move                             software and infrastructure, for all users in the European
beyond this, taking a process centric view.                                       Research Area to gain access to suitable and integrated
                                                                                  computing resources” [2].
2.2        Digital Preservation
                                                                                  3         Documenting eScience Processes
The area of digital preservation is shifting focus from
collections of simple objects to the long term preservation                       To enable analysis, repeatability and reuse of processes,
of entire processes and workflows.                                                they must be well described and documented. As most
   WF4Ever4 addressed the challenges of preserving                                processes are rather complex in their nature, a precise
scientific experiments by using abstract workflows that                           description is needed to re-enact the execution of the
are reusable in different execution environments [26].                            process. Thus, formalized models are useful for a detailed
The abstract workflow specifies conceptual and                                    representation of critical aspects such as the hardware,
technology-independent representations of the scientific                          software, data and execution steps supporting the process,
process. They further developed new approaches to share                           as well as their relationships and dependencies. Several
workflows by using an RDF repository and make the                                 models can be considered for this type of documentation.
workflows and data sets accessible from a SPARQL                                     Workflow-Centric Research Objects [15] (ROs) are a
Endpoint[10]. The TIMBUS 4 project addressed the                                  means to aggregate or bundle resources used in a
preservation of business processes by ensuring continued                          scientific investigation, such as a workflow, provenance
access to services and software necessary to properly                             from results of its execution, and other digital resources
render, validate and transform information. The approach                          such as publications, data-sets. In addition, annotations
centers on a context model [20] of the process, which is                          are used to further describe these digital objects. The
an ontology for describing the process components and                             model of Research Objects is in the form of an OWL
their dependencies. It allows to store rich information,                          ontology, and incorporates several existing ontologies. At
ranging from software and hardware to organizational                              its core, the Research Object model extends the Object
and legal aspects. The model can be used to develop                               Exchange and Reuse model (ORE) [33]5 to formalize the
preservation strategies and redeploy the process in a new                         aggregation of digital resources. Annotations are realized
environment in the future. The project developed a                                by using the Annotation Ontology (AO) [4], which allows
verification and validation method for redeployed                                 e.g. for comment and tag-style textual annotations.
processes [12] that evaluates the conformance and                                 Specifying the structure of an abstract workflow is
performance quality processes redeployed in new                                   enabled by the wfdesc ontology. Finally, the provenance
                                                                                  of a specific execution of a workflow is described using
               the wfprov ontology. Research objects have also been
      2
        dmponline.dcc.ac.uk/
      3
        dmp.cdlib.org/ 4 wf4ever-
      project.org/                                                                      
      4                                                                                 5
        http://timbusproject.net/                                                           openarchives.org/ore/1.0


                                                                            248
presented as a means to preserve scientific processes [8],      Ontology Language (OWL) [34], and the integration is
proposing archiving and autonomous curation solutions           performed via ontology mapping from the extensions to
that would monitor the decay of workflows.                      the core model. An overview of this architecture and the
   Enterprise architecture (EA) modelling languages             provided domain-specific extensions is given in Figure 1,
provide a holistic framework to describe several aspects        consisting of:
of a process. For example, the Archimate [30] language             Software Dependencies cover dependencies between
supports description, analysis and visualization of the         different types of software, including information on
process architecture, on three distinct but interrelated        which versions are compatible or conflicting with each
layers: business, application and technology layer. On          other. It is, for example, important to know that a specific
each of these layers, active structures, behavior and           version of a Java Virtual Machine is required to run a
passive structures can be                                       certain piece of software, or that a particular application
Ǥ              is required to view a digital object. This is important
 ǡ           when considering preservation of specific parts of the
                                                                software stack utilized in the process. Beyond
                                                                repeatability, this information may be used during
                                                                preservation planning to identify alternative software
                                                                applications that can be utilized. Technical dependencies
                                                                on software and operating systems in the Context Model
                                                                can be captured and described via the Common
                                                                Upgradeability Description Format (CUDF) [32].
                                                                   Data Formats In a process execution, a number of
                                                                digital objects are created, modified or read. This section
                                                                includes information on which data/file formats these are
                                                                stored in. It is used for preservation actions and for
 Fig. 1 Overview on the Context Model architecture: cor         selecting appropriate comparator modules during the
                     and extensions                             validation process described in Sec. 5 Our
                                                                implementation of the Context Model uses the PREMIS
level sequence of inputs and outputs from software and          Data Dictionary [27] to represent this information.
hardware components needed to run the process, e.g.                Hardware contains a comprehensive description of
database software, libraries, software device drivers,          the computational hardware, from desktop systems,
fonts, codecs, or dedicated hardware created for the            server infrastructure components, to specialized hardware
purpose of the experiment. Enterprise architectures do not      used for certain tasks. Even though in many processes the
address any specific domaindependent concerns. They             hardware employed to host the software applications
rather cut across the whole organization running the            might be standard commodity hardware, its exact
process [16].                                                   specifications can still influence the run-time behavior of
   While models such as Archimate or Research Objects           a process. This might be critical in certain circumstances,
are extensive, they often do not provide enough detail on       such as execution speed, or when specific functionalities
technology aspects of the process, and thus in these            and characteristics of the hardware such as precision
aspects provide only little guidance to researchers aiming      limits, analog/digital conversion thresholds etc. are part
to produce a solid description of their technical               of the computation. Further, certain processes might use
infrastructure. One approach to alleviate this issue is         certain hardware capabilities for computation, such as
realized in the Process Context Model [17], which builds        using graphical processing units (GPUs) for large-scale
on top of Archimate and extends it with domain specific         experiments in scientific processes. These types of
languages to address specific requirements of a given           hardware, and the software that can utilize them, are not
domain. Wherever possible, the extension ontologies are         yet as standardized and abstracted, thus an exact
based on already existing languages. The development of         description is needed in many cases.
the model was driven by requirements to preserve and re-           Legal aspects cover legal restrictions imposed on the
execute complete processes. The context a process is            processes. License information focusies specifically on
embedded in covers immediate and local aspects such as          software licenses. Relevant aspects are e.g. the types of
the software and hardware supporting the process, to            licenses under which software was made available, and
aspects such as the organization the process is executed        the clauses they contain. Patent information describes the
in, the people involved, service providers, to even laws        owner of a specific patent, or when it was granted.
and regulations. The exact context can differ significantly        Large parts of the Context Model of a process can be
depending on the domain the process stems from.                 extracted automatically [17], especially in the aspects of
   The model is using the domain-independent Archimate          software dependencies and data formats. Other aspects
language as a core model to integrate the domain specific       may still require significant manual work to obtain a
extension languages. It is implemented in the Web               proper representation. For example, the communication


                                                          249
to a web service has to be described via the provision of                       Storing a dump of such huge volumes of data, e.g. as part
its exact address and interface type. Databases usually run                     of the validation data in the context model, is not feasible
as independent server processes they are usually detected                       in big data settings. We need to ensure that we can refer
but not fully captured when running a tool to monitor a                         to the original data source / data repository for providing
specific research process execution.                                            the data upon re-execution. While this may be rather
    We created a set of tools processing eScience                               trivial for static data sources being analysed in their
Workflows modeled for the Taverna Workflow engine to                            entirety, precise identification turns into a challenge when
extract the above-mentioned information and represent it                        researchers use only a specific subset of the entire data
within the context Model. A ʹ ͸                            collection, and where this data collection is dynamic, i.e.
    ȋʹ Ȍ                          subject to changes.
   Ǥ                                    Most research datasets are not just static, but highly
                             dynamic in their nature. New data is read from sensors or
  ʹ͹ Ǥ  ǡ                          added from continuous experiments. Additional dynamics
                                                       arises from the need of correcting errors in the data,
                                                                                removing erroneous data values, or re-calibrating and
                                                                                thus re-computing values at later points in time. Thus,
                                                                                researchers require a mechanism to retrieve a specific
                                                                                state of the data again, to compare the results of previous
                                                                                iterations of an experiment. Freezing the databases at
                                                                                specific points in time, batch-release of versions, etc. all
                                                                                provide rather inconvenient work-arounds, wasting
                                                                                storage space by keeping multiple copies of unchanged
                                                                                data in different releases, and delaying the release of new
                                                                                data by aggregating continuous streams of data into batch
                                                                                releases.
                                                                                    Additionally, most processes will not analyse the
                                                                                entire database, but a very specific subset of it. We thus
                                                                                need to ensure that precisely the same subset can be fed
                                                                                into the process again. Current approaches either waste
 Fig. 2 SQL query selecting data for music classification
                                                                                space by storing explicit dumps of the subset used as
          experiment, supporting data citation
                                                                                input, or require human intervention by providing
the static workflow definition is captured. This is                             (sometimes rather ambiguous) natural language
complemented by monitoring the execution of one or                              descriptions of the subset of data used.
more process execution instances using the extractor of                             To address this issue, the Working Group on Dynamic
the Process Migration Framework (PMF)8 which is based                           Data Citation 10 (WGDC) of the Research Data Alliance
on the strace 9 tool. This way, all dependencies are                            (RDA) has devised a set of recommendations to address
explored and all files and ports touched by the process are                     this challenge. In a nutshell, it relies on time-stamped and
detected and added to the context model as dependencies.                        versioned storage of the data. Subsets are identified by
   These process traces will usually detect an enormous                         assigning persistent identifiers (PIDs) to time-stamped
number of libraries and other files used by a process. To                       queries resolving to the subset. Hash-keys of the queries
refine and make the model more compact, the PMF can                             and the result sets are stored as metadata to allow
resolve Debian packages to which an identified file                             verification of the resulting data sets upon re-execution
belongs and therefore create a smaller, concise list of                         [29, 28]. By shifting the focus from citing static data sets
dependencies. It also removes files from the model that                         towards the citation of queries, which allow retrieving
are not used for data exchange, for example, log and                            reproducible data sets from versioned data sources on
cache files.                                                                    demand, the problem of referencing accurate data sets can
                                                                                be addressed more flexibly. It also provides additional
                                                                                provenance information on the data set by containing a
4       Data Citation                                                           semantic description of the subset in the form of filter
Processes frequently process large volumes of data. To be                       parameters in the query. It furthermore allows retrieving
able to repeat any such process we need to ensure that                          the semantically identical data set including all
precisely the same sequence of data is fed as input into it.                    corrections applied to it afterwards by re-executing the
                                                                                timestamped query with a later time-stamp. As the
    
    6
       ifs.tuwien.ac.at/dp/process/projects/                                    process can be automated it allows integrating data
tavernaExtractor.html                                                           citation capabilities into existing workflows.
     7
       ifs.tuwien.ac.at/dp/process/projects/archi2OWL.
html
   9
     8
       ifs.tuwien.ac.at/dp/process/projects/pmf.html                               
                                                                                  10
     sourceforge.net/projects/strace                                                   rd-alliance.org/groups/data-citation-wg.html


                                                                          250
   The persistent identifier serves as a handle which, in                           Following the static verification, the validation step
addition to representing the input of data in a specific                         analyses the actual computations by comparing all
process, can be shared with other peers and be used in                           interim and final results produced at each input/output
publications. As the system is aware of updates and                              point (files, ports) for the original and the re-executed
evolving data, researchers have transparent access to                            process. This validation data (as well as according
specific versions of data in their workflows. There is no                        metrics) are defined when preparing the VPlan for the
need of storing multiple versions of a dataset externally                        process.
for the long term as the system can reproduce them on                               The VFramework consists of two sequences of actions.
demand. As hashing methods are in place, the integrity of                        The first is performed in the original environment, i.e. the
the datasets can be verified. Thus the exact data set used                       system that a process is initially deployed in. The results
during a specified workflow execution can be referenced                          obtained from the execution of each step are written into
as part of an experiment description/specification within                        the VPlan. This VPlan is another modular extension of
the parts of the context model describing specific process                       the context model described in Sec. 3. It contains
instances. These can later-on be used for validation.                            information needed to validate whether a process is
   We implemented several prototypes to demonstrate the                          reexecuted correctly. In a nutshell, it comprises of
feasibility of this data identification                             measurement points (usually all input/output happening
 ǡ                           between the individual process steps), associated metrics
ȋȌ    ǡ     Ǧ                              (usually testing whether the data for in/output are
   ȋȌ ͳͳ Ǥ                                  identical upon re-execution), and according reference
Ǧ                            process instance data (i.e. storing expected values for
Ǧ    Ǧ                           specific process test runs to compare against), captured at
        Ǥ ʹǤ                             process runs in the original environment.
ȋ                                   The second sequence is performed in the redeployment
 ͳʹͲ  Ȍ        at a                    environment at any time in the future when
given timestamp, removing those that had been deleted                            the original platform may not be available anymore. The
by that timestamp.                                                               migration of an entire process, i.e. the set-up of a minimal
                                                                                 environment required to run the process in an identical
                                                                                 configuration, is supported by the second part of the
5       Verification and Validation                                              Process Migration Framework. The information needed
Upon re-executing a process (be it a simple reproduction                         for such a migration is read from the VPlan. It may,
or a repeatability setting after applying preservation                           however, be necessary to re-engineer the process to fit it
actions), we need to verify the correct behavior in a                            into a new system (in which case the verification step will
potentially changed environment. To verify and validate                          report all elements in the resulting dependency tree that
the replicated process that was extracted from the source                        are different from the original setting).
system and run in the target system, we follow the                                  Subsequently, the validation data is captured again
guidelines of [1] that describe the verification and                             from the re-executed process and compared to the
validation of such a transition activity. We devised                             information stored in the VPlan module of the context
guidelines forming the VFramework [22] which are                                 model using specific metrics. (usually requiring them to
specifically tailored to processes and describe what                             be identical or within certain tolerance intervals,
conditions must be met and what actions need to be taken                         depending on the significant properties of the process
to compare the executions of two processes in different                          step/output to be compared.)
environments. This process of verification and validation                           We developed the Provenance Extractor 12 which
(V&V) does not check the scientific correctness of the                           extracts relevant process instance information from the
processes. It rather helps in obtaining evidence whether                         provenance files produced by workflows executed in the
the replicated process has the same characteristics and                          Taverna workflow engine. It converts these into an OWL
performs in the same way as the original process.                                representation linked to the context model via the VPlan
   According to these guidelines, verification checks                            module.
whether the process set-up and configuration in the new                             We investigated several workflows to define
environment is identical to the original one, i.e. whether                       requirements, metrics and measurement points for each of
the same software, operating system, library versions etc.                       them. The analysis revealed that the majority of
are used in the according configurations. Any changes                            functional requirements deal with the correctness of a
made to run the process in the new environment (re-                              single workflow step execution and the best way to
compilations, newer versions of individual components)                           validate it is to check each of its output ports. In case of
will be detected and reported as potential causes for                            the non-functional metrics, the prevailing requirement
differences in any reexecution.
                                                                                     
     
                                                                                    12
                                                                                      ifs.tuwien.ac.at/dp/process/projects/
    11
       datacitation.eu                                                           ProvenanceExtractor.html


                                                                           251
was the computation time that should be similar or should       using different process instances or by manual addition of
not exceed the ’reasonable time’.                               identified process dependencies. By verification and
   Based on this analysis we validate the workflow by           validation of the process automatically recreated in the
validating all of its steps by comparing the data on the        target system we also indirectly verify and validate the
outputs of the workflow steps and also by checking their        Context Model. We determine its correctness and
execution duration. The comparison is made taking into          completeness, as the process is re-created via the
account the format of the data using appropriate tools.         information stored in the Context Model, re-creating all
For example if two JPEG images depicting the same               elements stored there in the target system. If the
phenomenon are compared by computing a hash value,              representation in the Context Model were incomplete the
they may be detected as being different due to different        process could not be repeated and run correctly in the
creation timestamps in the metadata. While this could be        target system.
fixed by identifying the date of a computation (i.e. the           This methodology can be applied to all situations in
system clock) as one system input being used (which then        which a process is re-run, re-produced, or re-used. To
would need to be set to the same constant value) it may         support the verification and validation for reproduction
also be modelled explicitly by performing a dedicated           and reuse, it is important to also publish the verification
comparison for checking the identity of two JPEG files          data, as other researchers may not have access to the
relying only on the image content. A correct way to             source system. Then they can perform V&V using the
perform this comparison in general would be to compare          validation data provided by the experiment owner.
the features of the images using software for image
analysis. We developed a set of comparator tools for            6    Use Cases
prominent   ȋǤǤ ǡ ͵ǡ  ǡ  ǡ        We will use an example from the domain of music
ǡ ǡ  Ȍ    Ǥ        information retrieval (MIR) to illustrate the concepts
               presented in the preceding sections. A common task in
    Ȁ             MIR is automatic classification of audio into some set of
  Ǥ                               pre-defined categories, e.g. genres such as jazz, pop, rock,
                classic etc. at different levels of granularities. A process
     ƬǤ                reflecting this task is depicted in Fig. 3. It requires the
           acquisition of both the actual audio files as well as
  Ǥ               ground truth information (i.e. pre-assigned genre labels
                                                                for training and test data in the music collection) from
                                                                some source. Next, some numeric descriptors (e.g.
                                                                MFCCS, Rhythm-


 Fig. 3. Music Genre Classification Process [18]

        
process can interact with them. A solution that allows
monitoring of external services for changes, as well as
their replacement for the purpose of verification and
validation is described in [21].
    Another challenge having influence on the verification
is the lack of determinism of components. It can apply to
both external resources that provide random values and to
internal software components that, for example, depend
on the system clock or the current CPU speed. In such
cases the exact conditions must be re-created in both
environments. Potentially, such components need to be           Fig. 4. Music Genre Classification Process modelled in
substituted with deterministic equivalents [12].                Taverna
    The Context Model contains information about                                          ‘
dependencies required to run the software. If any of them
was not identified by the automated tools or modelled           Patterns, SSDs) are extracted from the individual audio
manually, then the process will not execute. In the course      files via a range of signal processing routines and
of verification and validation the Context Model gets           applying psycho-acoustic models to obtain feature vector
improved until the process operates correctly. This is          representations of the audio. These are subsequently fed
achieved either by repeating the capturing of the process       into some machine learning algorithm to train a classifier


                                                          252
such as Support Vector Machines (SVM) or Random                 authentication codes for the web service that the audio
Forests, and subsequently evaluated using performance           files are sent to for feature extraction. The vector files are
measures such as recall and precision.                          merged and fed into the classifier which returns the actual
   In one of our experiment settings this process was           classification results and the overall accuracy.
implemented using a web service for the feature                     Applying a process monitoring tool we are able to
extraction, WEKA as a third-party machine learning              automatically capture all resources (files, ports) accessed
package, and a set of dedicated scripts and java                or created by one instance of the process, depicted in Fig.
applications for tasks such as data acquisition,                5. This includes, amongst others, a whole range of
transformation, etc. These were orchestrated manually via       libraries (depicted in the upper left corner), the set of mp3
the command line or partially automated via shell scripts,      audio files (depicted in the lower left corner), a range of
deployed on a Linux system. To increase repeatability           processes being called (e.g. wget to download the audio
and ease automatic analysis we migrated this process into       files and ground truth information, depicted in the upper
a proper workflow representation using the Taverna              right Ȍǡ        
workflow engine, as depicted in Fig. 4. It lists explicitly      ǡǤ
the data sources (URLs) where the audio files and ground
truth labels are read from, as well as providing the


                              Fig. 5 Dependencies extracted from Music Genre Classification Process.


                                                          253
                         ȋȌ                                                     ȋȌ


                            Fig. 6 Annotated Context Model of the Music Genre Classification Process.

                    The raw information extracted bottom-up is subsequently enhanced, both automatically as well as
                                           manually, by structuring it according to the concepts
    The raw information extracted bottom-up is                     When applying the VFramework this information is
subsequently enhanced, both automatically as well as            used in two ways. First, to verify the environment in
manually, by structuring it according to the concepts           which the workflow is re-executed to confirm that it is
provided by Archimate and adding additional                     configured correctly; second, to validate that the results
information, such as file format information being added        conform to the original workflow execution. The report
by performing file format analysis using tools such as          summarizing the verification result is provided in fig. 7a.
DROID, contacting file format registries such as                It provides an aggregated summary of the libraries,
PRONOM. The resulting structure is depicted in Fig. 6.          specifically the jar files of WEKA for machine learning,
Fig. 6a captures, at the bottom, the basic process and the      and the SOMToolbox for the vector format migration, as
objects (Music files, features extracted and passed on to       well as the remote service call for the feature extraction
the classifier, the ground truth annotations, and the final     web service.
results). Stacked above it are the services being called,       We use the Process Migration Framework (PMF) tools to
i.e. the audio feature extractor. In Fig. 6b the basic          generate the VPlan module of the Context Model of a
software (Java Virtual Machine, WEKA, the data                  workflow execution and compare it with a Context Model
fetchers) are provided, with additional dependencies (e.g.      obtained in the same way for its re-execution in a
the Unix Bash Shell, Base64 encoders, Ubuntu Linux in a         different environment. We use the data captured for each
specific version), with the data objects in different           of the workflow steps and compare it using appropriate
representations (e.g. the audio files as MP3 as well as         comparators. For the MIR case study we compare 16
base64-encoded MP3 files) and license information for           metrics related to the outputs of the workflow steps, thus
the various tools (different versions of GPL, Apache            evaluating 13 functional requirements. We also use 12
License, Oracle Binary Code License, the MP3 patent).           metrics related to workflow execution time to evaluate 2
On top of these, the detailed application components and        nonfunctional requirements. All of them are fulfilled,
services, both internal as well as external, are represented.   therefore the workflow re-execution is established as
This way, a comprehensive and well-structured                   being repeatable. An excerpt of the validation report is
documentation of the process can be obtained in a semi-         depicted in Fig.7b, confirming that the output at three of
automatic manner.                                               the measurement points is identical.
This information forms the Process Context Model and
can be used for verification and validation.


                                                          254
                                                                    Current work focuses on evaluating the individual
                                                                 components of the PMP with stakeholders from different
                                                                 scientific communities. Specific focus is on tool support
                                                                 to automate the documentation steps, specifically
                                                                 capturing and monitoring of low-level process
                                                                 characteristics and performance aspects. We incorporate
                                                                 all suggestions into a prototype implementation which
                                                                 fosters actionability and enforceability of Process
                                                                 Management Plans.

                                                                 ACKNOWLEDGMENTS
                                                                 This research was co-funded by COMET K1, FFG
                                                                 Austrian Research Promotion Agency.

                                                                 References
                                                                 ȏͳȐ IEEE Std 1012 - 2012 IEEE Standard for Software
                                                                      Verification and Validation. Technical report, 2012.
                                                                 ȏʹȐ Cristina Aiftimiei, Alberto Aimar, Andrea Ceccanti,
                                                                      Marco Cecchi, Alberto Di Meglio, Florida Estrella,
                            (a)                                       Patrick Fuhrmam, Emidio Giorgio, Balzs Knya,
                                                                      Laurence Field, Jon Kerr Nilsen, Morris Riedel, and
                                                                      John White. Towards next generations of software
                                                                      for distributed infrastructures: The european
                                                                      middleware initiative. In 8th IEEE Intl Conf on E-
                                                                      Science, 2012.
                                                                 ȏ͵Ȑ Australian National Data Service. ANDS Guides
                                                                      Awareness level - Data management planning.
                            (b)                                       Technical Report, 2011.
                                                                 ȏͶȐ Paolo Ciccarese, Marco Ocana, Leyla Garcia Castro,
                                                                      Sudeshna Das, and Tim Clark. An open annotation
Fig. 7. (a) Verification and (b) Validation report (excerpt)          ontology for science on web 3.0. Journal of
                    for the MIR process                               Biomedical Semantics, 2(Suppl 2):S4, 2011.
                                                                 ȏͷȐ Andrew Curry. Rescue of old data offers lesson for
13 functional requirements. We also use 12 metrics                    particle physicists. Science, 331(6018):694– 695,
related to workflow execution time to evaluate 2                      2011.
nonfunctional requirements. All of them are fulfilled,           ȏ͸Ȑ R. Darby, S. Lambert, B. Matthews, M. Wilson, K.
therefore the workflow re-execution is established as                 Gitmans, S. Dallmeier-Tiessen, S. Mele, and J.
being repeatable. An excerpt of the validation report is              Suhonen. Enabling scientific data sharing and re-
depicted in Fig.7b, confirming that the output at three of            use. In IEEE 8th Intl Conf on E-
the measurement points is identical.                                  Science, 2012.
                                                                 ȏ͹Ȑ D. De Roure. Machines, methods and music: On the
7    Conclusions and Future Work                                      evolution of e-research. In 2011 Intl Conf on High
                                                                      Performance Computing and Simulation (HPCS),
This paper describes a way to move beyond datacentric                 pages 8–13, 2011.
research evaluation and re-use by addressing the capture         ȏͺȐ David De Roure, Khalid Belhajjame, Paolo Missier,
and description of entire research processes using Process            Jos´e Manuel, Rau´l Palma, Jos´e Enrique Ruiz,
Management Plans (PMPs), which foster identification,                 Kristina Hettne, Marco Roos, Graham Klyne, and
description, sharing and preservation of scientific                   Carole Goble. Towards the preservation of scientific
processes. To demonstrate how the core elements of a                  workflows. In 8th Intl Conf on Preservation of
PMP can be implemented we described how capturing of                  Digital Objects, 2011.
computational processes and their context can be                 ȏͻȐ Martin Donnelly and Sarah Jones. Checklist for a
performed. We also reviewed the recommendations of the                Data Management Plan. DCC, 2011.
Research Data Alliance on how to precisely identify              ȏͳͲȐ Daniel Garijo and Yolanda Gil. A new approach for
arbitrary subsets of potentially high-volume and highly               publishing workflows: Abstractions, standards, and
dynamic data. Last, we presented mechanisms for                       linked data. In 6th WS on Workflows in support of
verification and validation of process re-executions.                 large-scale science, 2011.


                                                           255
ȏͳͳȐ Ed Gronenschild, Petra Habets, Heidi Jacobs, Ron           ȏʹͷȐ Piotr Nowakowski, Eryk Ciepiela, Daniel Harezlak,
     Mengelers, Nico Rozendaal, Jim van Os, and                      Joanna Kocot, Marek Kasztelnik, Tomasz
     Machteld Marcelis. The effects of Freesurfer                    Bartynski, Jan Meizner, Grzegorz Dyk, and Maciej
     version, workstation type, and macintosh operating              Malawski. The collage authoring environment.
     system version on anatomical volume and cortical                Procedia CS, 4:608–617, 2011.
     thickness measurements. PloS one, 7(6), 2012.              ȏʹ͸Ȑ Kevin Page, Raul Palma, Piotr Holubowicz, Graham
ȏͳʹȐ Mark Guttenbrunner and Andreas Rauber. A                        Klyne, Stian Soiland-Reyes, Don Cruickshank,
     measurement framework for evaluating emulators                  Rafael Gonzalez Cabero, Esteban Garcia, David De
     for digital preservation. ACM Transactions on                   Roure Cuesta, and Jun Zhao. From workflows to
     Information Systems (TOIS), 30(2), 3 2012.                      research objects: an architecture for preserving the
ȏͳ͵Ȑ Tony Hey, Stewart Tansley, and Kristin Tolle,                   semantics of science. In 2nd Intl Workshop on
     editors. The Fourth Paradigm: Data-Intensive                    Linked Science, 2012.
     Scientific Discovery. Microsoft Research, 2009.            ȏʹ͹Ȑ PREMIS Editorial Committee. Premis data
ȏͳͶȐ ISO. ISO 5725:1:1994 Accuracy (trueness and                     dictionary for preservation metadata. Technical
     precision) of measurement methods and results Part              report, March 2008.
     1: General principles and definitions. Technical           ȏʹͺȐ Stefan Proell and Andreas Rauber. A Scalable
     report, ISO, December 1994.                                     Framework for Dynamic Data Citation of Arbitrary
ȏͳͷȐ K. Belhajjame, O. Corcho, D. Garijo, et. al.                    Structured Data. In 3rd Intl Conf on Data
     Workflow-centric research objects: First class                  Management Technologies and Applications
     citizens in scholarly discourse. In Workshop on the             (DATA2014), Vienna, Austria, August 29-31 2014.
     Semantic Publishing, 9th Extended Semantic Web             ȏʹͻȐ Stefan Pr¨oll and Andreas Rauber. Data Citation in
     Conf, May 28 2012.                                              Dynamic, Large Databases: Model and Reference
ȏͳ͸Ȑ M. Lankhorst. Enterprise architecture at work.                  Implementation. In IEEE Intl Conf on Big Data,
     Springer, 2005.                                                 Santa Clara, CA, USA, October
ȏͳ͹Ȑ Rudolf Mayer, Gonc¸alo Antunes, Artur Caetano,                  2013.
     Marzieh Bakhshandeh, Andreas Rauber, and Jos´e             ȏ͵ͲȐ Van Haren Publishing and A.J.E. Al. Archimate 2.0:
     Borbinha. Using ontologies to capture the semantics             A Pocket Guide. TOGAF series. Van Haren
     of a (business) process for digital preservation. Intl          Publishing, 2012.
     J. of Digital Libraries (IJDL), 15:129–152, April          ȏ͵ͳȐ D.D. Roure, C. Goble, S. Aleksejevs, S. Bechhofer,
     2015.                                                           J. Bhagat, D. Cruickshank, P. Fisher, N. Kollara, D.
ȏͳͺȐ Rudolf Mayer and Andreas Rauber. Towards time-                  Michaelides, P. Missier, D. Newman, M. Ramsden,
     resilient mir processes. In 13th Intl Society for               M. Roos, K. Wolstencroft, E. Zaluska, and Jun
     Music Information Retrieval Conf (ISMIR), 2012.                 Zhao. The evolution of myexperiment. In IEEE 6th
ȏͳͻȐ Rudolf Mayer and Andreas Rauber. A Quantitative                 Intl Conf on eScience, pages 153–160, 2010.
     Study on the Re-executability of Publicly Shared           ȏ͵ʹȐ Ralf Treinen and Stefano Zacchiroli. Description of
     Scientific Workflows. In 11th IEEE Intl Conf on                 the CUDF Format. Technical report, 2008.
     eScience, 2015.                                                 http://arxiv.org/abs/0811.3621.
ȏʹͲȐ Rudolf Mayer, Andreas Rauber, Martin Alexander             ȏ͵͵Ȑ Herbert Van de Sompel and Carl Lagoze.
     Neumann, John Thomson, and Gonc¸alo Antunes.                    Interoperability for the Discovery, Use, and ReUse
     Preserving scientific processes from design to                  of Units of Scholarly Communication. CTWatch
     publication. In 16th Intl Conf on Theory and                    Quarterly, 3(3), August 2007.
     Practice of Digital Libraries (TPDL 2012).                 ȏ͵ͶȐ W3C. OWL 2 Web Ontology Language Structural
     Springer, 2012.                                                 Specification and Functional-Style Syntax. W3C
ȏʹͳȐ Tomasz Miksa, Rudolf Mayer, and Andreas Rauber.                 Recommendation, 2012.
     Ensuring sustainability of web services dependent
     processes. Intl J. of Computational Science and
     Engineering, 10(1/2):70–81, 2015.
ȏʹʹȐ Tomasz Miksa, Stefan Proell, Rudolf Mayer,
     Stephan Strodl, Ricard Vieira, Jose Barateiro, and
     Andreas Rauber. Framework for verification of
     preserved and redeployed processess. In 10th Conf
     on Preservation of Digital Objects (IPRES), 2013.
ȏʹ͵Ȑ Tomasz Miksa, Stephan Strodl, and Andreas
     Rauber. Process management plans. Intl J. of Digital
     Curation, 9(1), 2014.
ȏʹͶȐ National Science Foundation. Data Management for
     NSF EHR Directorate. NSF, 2011.


                                                          256