<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Semantic Provenance in CRISTAL</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jetendr Shamdasani?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrew Branson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard McClatchey</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CCCS Research Centre, CEMS Faculty, University of the West of England Coldharbour Lane</institution>
          ,
          <addr-line>Frenchay, Bristol BS16 1QY</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Traceability is an important feature of work ow based systems, and is a key source of provenance data. This paper presents CRISTAL, a mature software platform developed and used at CERN for experiment construction at the LHC. It is entirely work ow based capturing provenance on every aspect of its use from application development to end-user interaction. In this paper we summarize some initial work towards the adaptation of CRISTAL to a more semantic orientation, in particular compliance with the Open Provenance Model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Provenance, as in the documentation of the origin or source of something [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
and in particular data and work ow provenance, is an important concern within
the area of computer science. This paper describes a stable production level
system known as CRISTAL [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and some early attempts to make its underlying
provenance model compliant with semantic web technologies. CRISTAL has been
used in many and various projects over the last decade such as neuGRID [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
in the construction of the CMS ECAL at CERN [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and as the core of the
Agilium commercial software suite [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It is also currently being employed in
the N4U (neuGRID for You) EU FP7 project [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. A key factor of the N4U
project is the requirement for provenance information capture. In N4U clinicians
execute analyses on the Grid; provenance information can help them in recreating
previous experiments that they or others may have carried out. Also by providing
complete traceability of a clinical analysis provenance information can aid in
nding areas where a clinicians work may have failed.
      </p>
      <p>
        The current CRISTAL model is already capable of capturing the
information required by the N4U project. However, the core of CRISTAL was developed
more than ten years ago, before the semantic web, and so although the data is
captured tools are missing to easily communicate that information to other
systems. Consequently, we have decided to modernise it by using emerging semantic
web based provenance models. This paper presents initial work that has been
undertaken at CERN to export the current CRISTAL provenance model to be
compliant with semantic provenance vocabularies such as the Open Provenance
Model (OPM) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This work is research in progress.
      </p>
      <p>This paper is structured as follows: section 2 presents the CRISTAL platform
and how it has been used previously, especially in the context of provenance.
Section 3 presents the current CRISTAL provenance model and section 4 describes
a preliminary attempt to convert it to the OPM. Finally section 5 presents
conclusions with directions for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The CRISTAL workbench</title>
      <p>
        CRISTAL is a system designed to provide data management, work ow tracking
and change management to an agreed set of user requirements. It is a distributed
data and work ow management system which uses a database for its repository,
a multi-layered architecture for its component abstraction and dynamic object
modelling for the design of its objects and components. These techniques were
deemed critical in handling the complexity of data-intensive systems and to
provide the exibility to adapt to the changing production scenarios typical
of any research production system. CRISTAL has been based on a so-called
description-driven approach in which all logic and data structures are described
by meta-data, which can be modi ed and versioned online as the description
of the object, component, item or an application changes. A Description-Driven
System (DDS) architecture, as advocated previously in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is an example of a
re ective meta layer architecture ( gure 1).
      </p>
      <p>The meta-data along with the instantiated elements of data are stored in the
database and the evolution of the design is tracked by versioning the changes in
the meta-data over time. Thus DDSs make use of meta-objects to store
domainspeci c system descriptions that control and manage the lifecycles of domain
objects. The separation of descriptions from their instances allows speci cation
and management to evolve independently and asynchronously. This separation
is essential in handling the complexity issues facing many web-based computing
applications and facilitates interoperability, reusability and system evolution.
Separating descriptions from their instantiation allows new versions of de ned
objects (and in turn their descriptions) to coexist with older versions.</p>
      <p>
        Neuroimaging is constantly developing new algorithms and work ows; these
may require variations to the provenance data that is collected. At the same
time provenance data, to be useful, needs to remain consistent over time, to be
traceable, to be queryable and easily accessible and scientists analyses need to
be conducted on those data. CRISTAL handles all of this. The reader is directed
to previous publications on DDS for further background ([
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]). CRISTAL is
essentially a provenance tracking system which has previously been used to track
the construction of large-scale experiments such as the CMS project [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] at the
CERN LHC it has also more recently been used in the neuGRID project and
its follow up N4U. It is both a process modelling and provenance capture tool
which addresses the harmonisation of processes so that multiple potentially
heterogeneous processes can be integrated with each other and have their work ows
tracked in the CRISTAL provenance database.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Provenance in CRISTAL</title>
      <p>The current CRISTAL model is shown in gure 2. A collection of all these
objects is known as an Item in CRISTAL terminology. An Item contains :
{ Work ows i.e. complete layouts of every action that can be performed on
that item, connected in a directed graph that enforces the execution order
of the constituent activities.
{ Activities capture the parameters of each atomic execution step, de ning
what data is to be supplied and by whom. The execution is performed by
agents.
{ Agents are either human users or mechanical/ computational agents (via an</p>
      <p>API), which then generate events.
{ Events detail each change of state of an Activity. Completion events
generate data, stored as outcomes. From the generation of an Event provenance
information is stored.
{ Outcomes are XML documents resulting from each execution (i.e. the data
from completion Events), for which viewpoints arise.
{ Viewpoints refer to particular versions of an Item's Outcome (e.g. the latest
version or, in the case of descriptions, a particular version number).
{ Properties are name/value pairs that name and type items. Properties also
denormalize collected data for more e cient querying, and
{ Collections enable items to be linked to each other.</p>
      <p>The provenance is captured by Events being generated. For an Agent to
write anything within CRISTAL they need to generate an Event. This is done
by altering a state in an Activity. The actual provenance information required
is application dependent. CRISTAL allows application designers to de ne their
own backend with respect to how they wish to store provenance. Thus a key
function of the CRISTAL system is its ability to adapt to changing requirements in
terms of provenance storage. The domain of e-science is constantly changing as
new work ows, algorithms and research studies are developed. The underlying
CRISTAL model allows the system to evolve to handle such challenges whilst
retaining provenance information in a consistent and traceable manner. For
example in the neuGRID project the CRISTAL provenance service captured:
{ Work ow speci cations - These were XML based speci cations of work ow
descriptions which were external to CRISTAL. These were serialised and
stored in an relational database.
{ Data or inputs supplied to each work ow component - The parameters to
each work ow component in the case of neuGrid and N4U these are images,
however, they can be any piece of data.
{ Annotations added to the work ow and individual work ow components
These consisted of simple name value pairs which allowed uses to store extra
information about a work ow.
{ Links and dependencies between work ow components - This is a part of the
work ow speci cation.
{ Execution errors generated during analysis - A necessary component for any
provenance model.
{ Output produced by the work ow and each work ow component - In the
case of neuGRID and N4U these are images, however, as wit inputs they can
be any piece of data that the application requires.</p>
      <p>these were all added onto CRISTAL as a relational database model. In the
neuGRID project CRISTAL was exposed as a provenance service where
information is captured in real time and stored in the current CRISTAL provenance
database. However, to enhance CRISTAL we have decided to apply semantic web
technologies to the current CRISTAL provenance model. The next section
explains some early work on converting the CRISTAL model to be more compliant
with the OPM.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Towards an OPM compliant description of CRISTAL provenance</title>
      <p>The OPM is currently a provenance model which is gaining popularity and
implementations are becoming available in OWL, RDF and Java. We
personally feel that the OPM is a good choice for modernizing the current CRISTAL
provenance model since the mapping for the \top layer" of the graphical model
ts in well with the current CRISTAL provenance model. This top layer includes
Artifact (a physical state such as the result of an action performed), Process (an
action that is caused by an artifact and may generate a new artifact) and Agent
(an entity which causes the execution of a process). Figure 3 shows the mapping
of the tentative N4U CRISTAL Provenance Model onto the OPM.</p>
      <p>The colours map onto the conceptual ideas of artifacts (e.g. A1, A2) and
processes (e.g. P1) in the OPM. Agents map onto CRISTAL agents since they
initiate Processes. Processes from the OPM are equivalent to Work ows and
Activities from CRISTAL since they generate Artifacts. Artifacts from the OPM
are CRISTAL Events and Outcomes because they are various forms of data.
At the time that this initial mapping between the OPM and the CRISTAL
provenance model was conceived, there was not an implementation of the OPM
available. We are currently in the process of converting the CRISTAL provenance
model into the OWL version of the OPM implementation. We believe that once
this conversion has been completed it will allow us to demonstrate further how
Semantic Web technologies can aid in the provenance of work ows.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>In this research in progress paper we presented CRISTAL which is a system
that is able to capture provenance information based on work ows. CRISTAL
currently is being used in many di erent projects and is in the process of being
commercialised. An initial mapping of the provenance aspect of CRISTAL to
the OPM was shown.</p>
      <p>
        Future work consists of expanding our initial mapping of the OPM to CRISTAL
from a simple and preliminary model to using the OWL implementation of the
OPM. We believe that this will aid compatibility with other work ow based
systems such as Taverna [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] which has an export to OPM option available. As
further work we are exploring on how to convert the current CRISTAL model
into a full RDF based implementation. This work has already begun and is
ongoing. This paper has simply demonstrated how the current provenance aspect
of CRISTAL can be made OPM compliant.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Luc</given-names>
            <surname>Moreau</surname>
          </string-name>
          .
          <article-title>The foundations for provenance on the web</article-title>
          .
          <source>Foundations and Trends in Web Science</source>
          ,
          <volume>2</volume>
          (
          <issue>2</issue>
          {3):
          <volume>99</volume>
          {
          <fpage>241</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Branson</surname>
          </string-name>
          et al. Evolving Requirements:
          <article-title>Model-Driven Design for Change</article-title>
          .
          <source>Information Systems</source>
          ,
          <year>2012</year>
          . Under Final Review.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A.</given-names>
            <surname>Anjum</surname>
          </string-name>
          et al.
          <article-title>Reusable Services from the neuGRID Project for Grid-Based Health Applications</article-title>
          .
          <source>Studies in Health Technology and Informatics</source>
          ,
          <volume>147</volume>
          :
          <fpage>88</fpage>
          {
          <fpage>99</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>4. The CMS Collaboration. The CMS experiment at the CERN LHC</article-title>
          .
          <source>Journal of Instrumentation</source>
          ,
          <volume>3</volume>
          :
          <fpage>S08004</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Aglium. http://www.agilium.com. accessed on 03/
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>6. neuGRID for You (N4U)</article-title>
          . http://neugrid4you.eu/. accessed on 03/
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>L.</given-names>
            <surname>Moreau</surname>
          </string-name>
          et al.
          <source>The Open Provenance Model core speci cation (v1.1)</source>
          .
          <source>Future Generation of Computer Systems</source>
          ,
          <volume>27</volume>
          (
          <issue>6</issue>
          ):
          <volume>743</volume>
          {
          <fpage>756</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>F</given-names>
            <surname>Estrella</surname>
          </string-name>
          et al.
          <article-title>Pattern Rei cation as the Basis for Description-Driven Systems</article-title>
          .
          <source>Journal of Software and System Modelling</source>
          ,
          <volume>2</volume>
          (
          <issue>2</issue>
          ):
          <volume>108</volume>
          {
          <fpage>119</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>F.</given-names>
            <surname>Estrella</surname>
          </string-name>
          et al.
          <article-title>Meta-data Objects as the Basis for System Evolution</article-title>
          .
          <source>In Proceedings of the Second International Conference on Advances in Web-Age Information Management, WAIM '01</source>
          , pages
          <fpage>390</fpage>
          {
          <fpage>399</fpage>
          . Springer-Verlag,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>T.</surname>
          </string-name>
          Oinn et al.
          <article-title>Taverna: a tool for the composition and enactment of bioinformatics work ows</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>20</volume>
          (
          <issue>17</issue>
          ):
          <volume>3045</volume>
          {
          <fpage>3054</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>