Data policy as activity network
                                                 © Vasily Bunakov
                      Science and Technology Facilities Council, Harwell Campus,
                                          United Kingdom
                                            vasily.bunakov@stfc.ac.uk
          Abstract. The work suggests using a network of semantically clear interconnected activities for a
    formal yet flexible definition of policies in data archives and data infrastructures. The work is inspired by
    needs of EUDAT Collaborative Data Infrastructure and the case of long-term digital preservation but the
    suggested policy modelling technique is universal and can be considered for all sorts of data management that
    require clearly defined policies linked to machine-executable policy implementations.
          Keywords: data management, long-term digital preservation, data policy, semantic modelling.

1 Introduction                                                     of this gap and do make efforts to close it by working on
                                                                   data policies implementation. An example of such e-
    Problematics of advanced long-term digital                     infrastructure is EUDAT [3] that has developed a number
preservation [1] has been in focus of many collaborative           of operational services [4] and data pilots with user
projects and popular recommendations. However, it has              communities, and is now trying to express and apply
been paid a relatively small attention in domain-specific          policies to these services.
projects that rely on data archiving, or in projects that              The prime candidate for applying data policies in
develop scalable e-infrastructures aggregating data that           EUDAT is B2SAFE service [5] based on iRODS
comes from different user communities.                             platform [6]. B2SAFE developers are doing a very good
    One of the problems that long-term digital                     job on building geographically and organizationally
preservation aims to address is having clear policies for          distributed data storage with data replication, integrity
the entire data lifecycle from data ingestion by archive or        checks and other routine tasks of data management
by e-infrastructure service, through years-long data               guided by iRODS machine-executable rules. B2SAFE
management with sensible data checks, transformations              have made their own effort on policies with the
and moves, to data access and data dissemination to the            development of Data Policy Manager [7] which is a
end users.                                                         software module with policies expressed via XML
    One can argue that without clear data policies and             templates. There is a perceived need though of having a
means of their validation there is no such a thing as the          more universal solution for policy management across all
long-term digital preservation, even in cases when a               EUDAT services. The possible policy modelling
technology foundation used for an archive or an e-                 approaches under consideration are using RuleML[8],
infrastructure is sound and well-supported. At the end of          SWRL[9] or ProvOne ontology[10] which seems suitable
the day, every technology evolves – and at a brisk pace            not only for capturing data provenance after the
compared to relatively long time when many data assets             execution of certain actions but also for the forward-
are going to be useful, so data policies and means of their        looking design of data processing workflows which can
expression should be semantically clear and in a way               then potentially serve as a means of data policy
more permanent than technology that underpins data                 modelling.
management. A strong case for policy-driven digital                    This work presents an alternative approach to those
preservation, with extensive references to the prominent           mentioned and is based on Research Activity Model [11]
projects and popular methodologies was made in [2].                which is in fact quite universal and suitable for the
    In practice, quite a few data archives and e-                  expression of all sorts of activities, not necessarily related
infrastructures end up in a situation when they have got a         to research. Research Activity Model is slightly extended
sound technology for managing data bits, also acquire a            and applied to the case of data policy modelling.
decent number of users (which is a popular measure used                The main advantage of this alternative approach is its
by funders for their judgement on the e-infrastructure             high modularity which allows modeling policy elements
success) but do not have a reasonable data policy, let             and using them as building blocks for the semantically
alone any machine-assisted reasoning over it. The users’           clear representation of a whole policy. The modularity of
trust in the archive or the e-infrastructure may be enough         policy design is especially important in data
for their daily use but there can be a substantial                 infrastructures that commonly aggregate data coming
conceptual and technological gap in regards to data                from different user communities, often having their own
policies formulation, expression and execution.                    business models, technical requirements, data formats
    Some larger projects and e-infrastructures are aware           and data lifecycles which makes it difficult to design and
                                                                   adequately express the crosswalks between community-
Proceedings of the XIX International Conference                    specific data policies and those for the data infrastructure.
“Data Analytics and Management in Data Intensive                   Another advantage of the suggested approach is its ability
Domains” (DAMDID/RCDL’2017), Moscow, Russia,                       to address the conceptual gap between policy formulation
October 10–13, 2017                                                and policy implementation, as it may not be easy to


                                                              79
translate a high-level policy (often in a textual form) into          statements. These granular statements which can be
machine-executable policy.                                            converted, in a pretty straightforward way, in machine-
    The modularity should allow high levels of                        executable statements are called control policies in
inheritance and reuse of policy elements; it also helps to            SCAPE. Examples of control policies are: “information
solve specific problems of policy formulation and                     on preservation events should use the PREMIS metadata
validation when textually the same policy can be                      schema” or “original object creation date must be
executed in different ways leading to different states of             captured”. The granular control policies relate to a
data archive, for which situation we provide an example.              higher-level procedural policy (a procedural policy on
The conceptual gap between policy formulation and                     Provenance for the current example) which in turn relates
policy implementation is addressed by a possibility to                to an even higher-level and most abstract guidance policy
define policy-related Activities as “black boxes” with                (a policy on Authenticity for the current example). Three-
(initially) only interfaces defined; this can be hopefully            level structure of guidance policies, procedural policies
done by policy makers themselves without entirely                     and control policies constitute a very well developed
delegating this policy design phase to policy                         SCAPE digital preservation policy framework.
implementers (software developers).                                       SCAPE stopped short of the actual implementation of
    Implementation of a sensible data policy is a                     control policies, so when EUDAT [3] decided to use the
challenging task even within the boundaries of a                      SCAPE framework for policy considerations, it was also
particular organization. In a situation when the                      decided to supplement this framework with the catalogue
organization is using a collaborative data infrastructure             of practical data policies [14] developed by an RDA
along with its own organization-specific IT services, the             (Research Data Alliance) Practical Policy Working
implementation of a data policy is going to be even more              Group. The practical data policies in this catalogue are
intricate and is likely to rely on loosely coupled services.          expressed as iRODS [6] functons specifically suitable for
An approach to data policy modelling suggested in this                implementation in EUDAT B2SAFE service [5] based on
work is going to address this challenge, along with the               iRODS platform.
alleviation of the earlier mentioned problems of the                      Having well-defined control policies or practical
policy elements reusability and the policy application                policies is not enough though for semantically clear
results predictability.                                               modelling of a data policy as a whole, as the application
    The work is inspired by needs of EUDAT                            (execution) of a policy composed of granular machine-
Collaborative Data Infrastructure [3] and refers to it for            executable statements may lead to quite different
illustration of certain ideas, also the main incentive for            outcomes depending on the order in which granular
the work was modelling policies for the case of long-term             policies are applied.
digital preservation. However, the suggested modelling                    The problem of policy decomposition is in fact
technique is universal and can be considered for all                  interrelated with the problem of policy validation. To
archives or e-infrastructures that are interested in all sorts        illustrate this, let us consider a simple case when there is
of data management (not only long-term digital                        a couple of easily identifiable policy statements
preservation) that require a clearly defined policy linked            contained in the same policy document which we want to
to machine-executable policy implementations.                         decompose and validate through execution of two
    Conceptual challenges of data policy modelling are                granular policies. Let the statements in a composite
discussed first, specifically the problem of policy                   policy (perhaps, but not necessarily so, added one to
decomposition into policy elements, then an example is                another through some policy update by different policy
given of how Activity Model can be used for policy                    managers) be:
modelling. This is followed by suggestions on what IT                 [1] Image files having size of more than X gigabytes
architecture for data policy management will be required                  should be stored in file storage A; otherwise they
to support the suggested modelling techniques.                            should be stored in file storage B.
                                                                      [2] Image files of type RAW should be converted in JPG
2 Data policy and a problem of its                                        format.
decomposition                                                             If a certain file of type RAW is more than X gigabytes
                                                                      in size but becomes less than X when converted in JPG
2.1 Insufficiency of granular policy definition
                                                                      then, depending on the higher-level guiding policy and
    Data policy is often created as a conventional textual            on the order in which these granular policies are applied
document that contains certain statements about what                  in the actual service implementation, the result of the
should or should not be done with data, with implied or               combined application of the two granular policies can be
sometimes explicit logical “ANDs” and “ORs” that glue                 any of the following:
statements together in an aggregated policy. This                          1. File is moved as RAW in storage A and remains
composite nature of policies is why it seems natural to                          stored in A as RAW.
break down the policy document into granular                               2. File is moved as RAW in storage A then
statements, model each statement using some formalism                            converted in JPG and remains stored in A.
and then execute the statements using some IT solution.                    3. File is converted in JPG and stored in B.
    One of the most advanced efforts on data policy                        4. File is moved as RAW in storage A and remains
decomposition was performed by SCAPE project [12]                                stored in A as RAW; also a copy of it converted
that created an extensive catalogue of preservation policy                       in JPG is stored in B.
elements [13] which are in fact granular textual                          This is to illustrate that validation of the data policy


                                                                 80
implementation is hard as any of the listed outcomes may            is used for the expression and, where necessary,
be considered being right or wrong depending on the                 recomposition of granular policies (policy elements) and
validator’s point of view.                                          for their assembling in the whole, with that formalism
    Also let us take into account that policy validation can        being reasonably friendly to machines as well as to
be based on some statistical selection of samples (so that          humans. The humans – policy managers themselves or a
problematic boundary cases of RAW data sized only                   not-so-skilled modeller – can use the formalism for a
slightly over X gigabytes threshold may not be selected             flexible policy definition that can be fairly easily
in a sample and hence go unnoticed), or that a policy               modified depending on the true policy intentions and on
validation procedure allows some tolerance towards                  the feedback received from the archive or e-infrastructure
small amount of failed policy checks (so that even if a             where the policy is implemented. The role of software
few files have ended up somewhere that a particular                 developers is then to implement an engine for the
policy interpretation considers to be a wrong place, this           formalism (quite similarly to the second approach). The
does not trigger a policy violation alert).                         machine just executes the policy expressed using that
    So even if the data policy can be, seemingly                    formalism.
successfully, decomposed into granular policies that are                The differences amongst approaches are presented in
easy to define and validate as machine-executable                   Table 1; in essence, they are different “weights”
statements, the actual result of the policy implementation          (different levels of demand) for the skills of policy
does not necessarily match the intentions of policy                 managers, policy modellers and policy implementers.
designers or policy managers, as the backwards process
                                                                        Table 1 Differences amongst policy modelling
of the policy composition – assembling it from the
                                                                                         approaches
granular policies (policy elements) – can be performed
with substantial variations.                                          Policy          Demands       Demands       Demands
2.2 Possible responses to the challenge of granular                   modelling       for policy    for policy    for policy
policies insufficiency                                                approach        manager       modeller      implemen-
                                                                                      skills        skills        ter skills
    One possible response to the outlined challenge could
be setting up an elaborated policy governance                                                       None
                                                                      Policy
framework, i.e. well-defined business processes that                                                (policy
allow human agents (policy managers) to look after the                governance
                                                                                                    modeler
policy implementation, i.e. accumulate and analyse                    framework
                                                                                                    can be
feedback from the environment where the policy is                     +
                                                                                                    replaced
applied and supply the result of this analysis as updated             requirement
                                                                                      High          by            High
requirements to software developers who work on the                   s
                                                                                                    business
actual software implementation of the policy. This                    managemen
                                                                                                    analyst
approach requires a good organizational culture and a                 t + specific
                                                                                                    or/and
substantial human resource involved in data policy                    software
                                                                                                    software
management and in policy implementation; documented                   testing
                                                                                                    tester)
requirements will serve as an interface between policy
managers and policy implementers. Some “magic”                        Policy
should happen in between so that high-level policy                    modelling       Low           High          Medium
definitions translate into actual policies implementation             language
in software code, this is why policy validation is likely to
demand extensive software testing with specific policy-               Formalism
related test cases.                                                   for granular
    Another possible response is having an elaborated                 policy
means of expression for the entire data policy (a                     elements        Medium        Medium        Medium
sophisticated policy modelling language): both for the                definition
definition of granular policies and for the definition of             and
logic than binds the granular policies into the whole. An
                                                                      composition
example of this approach is RuleML [8] that is
considered a candidate for a detailed expression of data
policy in EUDAT e-infrastructure [3]. This approach                     The preferable approach could easily be the third one
requires skilled human resource for policy modelling; the           as it empowers policy modelers themselves with
modeler and a sophisticated model produced by her                   reasonable means of policy expression and therefore can
becomes then an interface between policy managers and               reduce overheads and risks of communicating a policy
policy implementers (the role of the latter is less                 from policy managers through modelers to implementers.
prominent than in the first approach, in a sense that               A remote analogy of the third approach could be the
software developers should not interpret requirements               proliferation of SQL language that, despite its
but just implement – or adopt – a certain engine that               sophistication, has become a lingua franca of not only
executes formal rules defined by the savvy policy                   software engineers but is widely used by logistics and
modeller).                                                          even sales departments is all sorts of business.
    The third possible response is that a certain formalism             The formalism to be used for data policy expression


                                                               81
should not be something as developed as SQL though,                       The full RDF serialization of the Activity Model is
neither should it be purely textual: it can be based on the            published in [11]; it is really simple and requires only
idea of “building blocks” with possible graphical                      RDF Schema and an “inverseOf” OWL statement for its
representation of them, hence providing an easy-to-                    expression, i.e. what is often referred to as RDFS Plus.
operate semantic wrapper for machine-executable
                                                                              Table 2 Activity Model aspects explained
statements. On the other hand (unlike SQL which allows
the actual data manipulation), these “building blocks” for                                             Examples
data policy definition are likely to remain only a wrapper
                                                                         Aspect      Description
to the actual machine-executable implementations of                                                  Research per Research data
granular policies which will be inevitably specific to a                                                  se        analysis
particular service even within the same archive or e-
infrastructure. As an example, for EUDAT B2SAFE [5]                                 Something that Previous         Raw data
that is based on iRODS platform [6] these granular
                                                                          Input     is taken in or research
implementations can be iRODS functions and for other
EUDAT services based on other software platforms the                                operated on by
policy implementations can be something else. A                                     Activity
common semantic wrapper will be then a reasonable
                                                                                    Something that Raw data         Derived
means of a clear policy modelling and a clear definition
of interfaces between policy “building blocks” across a                             is intentionally                (analyzed)
                                                                          Output
variety of different IT services.                                                   produced by                     data
    This work strongly prefers the third approach and                               Activity
suggests considering Activity Model [11] for
semantically clear modelling of data policies in all IT                             Something that Sample           One or more
services within the same data archive or e-infrastructure,                Scope     Activity is    properties       experiments
as well as for policy interoperability across different data                        aimed at or
archives and e-infrastructures.                                                     deals with
3 Activity Model as a semantic wrapper for                                        Something that Scientific         IT
machine-executable policies                                                       affects or     instrument         environment
3.1 Activity Model in a nutshell                                                  supports
                                                                        Condition Activity, or
    Activity Model [11] was initially suggested for
modelling granular research activities and combining                              gives it a
them in networks so that, as an example, the output of                            specific
one Activity can be the input of another one, e.g. these                          context
combined Activities may represent certain phases in
research data analysis. It has been clear though that                               Something or Investigator       Data analyst
Activity Model can suit all sorts of activities as it is pretty           Actor     somebody who
generic; as an example, it may well suit for modelling                              participates in
data provenance across different IT services within e-                              Activity
infrastructure.
      The main “building block“ of the Activity Model is                            Something that Environment New software
an “activity cell” represented by Figure 1 with its aspects               Effect    is a           pollution   module
(that can be thought of as incoming and outcoming                                   consequence
relations) explained in Table 2.                                                    of Activity

                                                                          Activity “cells” can be combined in chains or
                                                                       networks, and not necessarily in a way that the Output of
                                                                       one Activity is the Input to another. As an example, a data
                                                                       management policy can be the Output of one Activity
                                                                       (policy design) and the Condition that affects another
                                                                       Activity, e.g. data replication in the archive.
                                                                          The model flexibility when any aspect of one Activity
                                                                       can be matched with any aspect of another Activity is
                                                                       supported by the fact that aspects do not have to have
                                                                       types associated with them.
                                                                       3.2 Proposed extensions of the Activity Model
                                                                          In order to use Activity Model for data policy
Figure 1 Research activity “cell”; it can be used for                  modelling, we will need to make a profile of the model
semantic definition of any activity                                    by specifying certain types of Activity as subclasses (in


                                                                  82
case of an RDF serialization of the model – RDFS                    (software platform where policies are executed), other
subclasses). Suggested extensions are presented in Table            parts of the Activity Model, e.g. its Inputs, Outputs, or
3. Conceptually, Generic Data Management Activities                 Conditions may require additional semantically clear
should cover the needs of data engineering that are                 extensions. However, it is unclear at the moment whether
related to machine-interpretable policy implementations,            these potentially required extensions should be a part of
Logical Switch Activities should cover the needs of data            the universal Activity Model profile for data policies, or
analysis and machine-assisted reasoning, and Control                it is better to introduce them as necessary, as parts of
Activities should cover the needs of IT services                    policy execution engine implementations on particular
deployment and operation.                                           software platforms.
    Compared to modelling data policies with workflows,
                                                                    3.3 Examples of the Activity Model data policies
the suggested approach based on the definition of policy-
                                                                    profile application
related Activities should allow more loosely coupled
implementations of policy management IT solutions. As                   The role of the suggested model extensions will be
an example, the “data engineering” part of policy                   clearer by giving an example of their application to the
implementation represented by Generic Data                          modelling of a particular policy. The example will be a
Management Activity can be performed on a software                  policy with two granular statements about data
platform fully controlled by a specific user community or           movements depending on data size and data format that
organization (e.g. a research institution), the operation           were considered in Section 2.1.
(the actual execution of control statements) represented               We will need to define first a File Characterization
by Control Activity can be performed by collaborative               Activity:
data infrastructure (e.g. by EUDAT CDI [3]) and the                    @prefix am:
logic of combining policy elements represented by                      <http://.../stuff/ActivityModel#> .
Logical Switch Activity can be performed by either the                 @prefix ampp:
organization or the data infrastructure, or by a third-party           <http://.../ActivityModel#PolicyProfile> .
                                                                       GDMA_FileChar                              a
service.                                                            ampp:GenericDataPolicyActivity
    If the policy was modelled by an executable                     GDMA_FileChar am:hasInput File
workflow, it would require the presence of all three                GDMA_FileChar am:hasOutput FileSize
                                                                       GDMA_FileChar am:hasOutput FileFormat
aspects: data engineering, reasoning and execution – in                GDMA_FileChar am:hasOutput File
the same workflow likely operated by a single universal                GDMA_FileChar am:hasScope ImageFiles
workflow engine. This would mean not only an                           GDMA_FileChar                am:hasCondition
operational limitation but a conceptual / modelling                 ServiceInstance
                                                                       GDMA_FileChar am:hasActor CertainScript
limitation, too, as all the participants (stakeholders) of             GDMA_FileChar am:hasEffect FileCharLog
policy implementation would have to adhere to the
conceptual framework and the format required by the                     In short, GDPA_FileChar activity takes a file as an
workflow engine. Modeling with interconnected                       input and produces values for the file size and file
Activities as semantic wrappers to particular                       format (which can be semantically clearly defined as
implementations leaves more freedom to conceptualize                necessary – e.g. with measurement units and
and to operate data policies that are going to be executed          format IDs in a file type registry) as outputs; the initial
by loosely coupled IT services.                                     file is passed over as another output. To derive the file
                                                                    size and format, the activity uses CertainScript
Table 3 Additions to the core Activity Model required
                                                                    (which again can be semantically clearly defined as
for data policy modelling
                                                                    necessary – e.g. with references to a software repository).
 Type to add             Comment / Description
 Generic Data            Subclass of Activity for data              As an additional outcome (better defined not as Output
 Management              policy definition. It can be               but as Effect) of the file characterization activity, we
 Activity                considered a semantic wrapper              get theFileCharLog log file. The scope of activity is
                         for a variety of data handling             defined as ImageFiles (so that other kinds of files
                         Activities, e.g. Activities for            can      be     handled      by      differently    defined
                         data characterization or data              Characterization Activities; what “ImageFiles”
                         transformation.                            actually means can be clearly defined with e.g. a
                                                                    reference to a certain taxonomy entry). The Condition
Logical Switch            Subclass of Activity for logical          is defined as ServiceInstance (which means that
Activity                  switches of all sorts                     Actor:CertainScript operates in some particular IT
                                                                    service environment).
Control Activity          Subclass of Activity for an
                                                                        Mapping of Activity to a particular software
                          interface with a particular
                                                                    implementation can be performed using Activity ID
                          software     platform    where
                                                                    and a reference to a repository with a clear software
                          policies are executed. This is a
                                                                    identity, e.g. a software versioning repository.
                          semantic wrapper for the actual
                          call to a platform-specific                   The graphic representation of this Characterization
                          script or function.                       Activity (which, in the ideal world, can be designed in
                                                                    a certain authoring tool with graphical user interface
   Depending on a particular operational environment                and producing the above RDF as a serialization) is
                                                                    illustrated by Figure 2.


                                                               83
                                                                     platform. Alternatively, rules modelling language or
                                                                     workflow templates (and appropriate engines for them)
                                                                     can be used – yet, in this case, the actual usage of these
                                                                     modelling languages or workflow templates would be
                                                                     limited to the policy logic enwrapped in the Logical
                                                                     Switch Activity, allowing freedom for different
                                                                     implementations of other types of Activities involved in
                                                                     the policy definition.
                                                                         How to express control statements in the Output is
                                                                     subject to particular implementations, too. The only
                                                                     consideration which is important for the moment –
                                                                     important both from conceptual and from
                                                                     implementation perspectives – is having the list of
                                                                     control statements as a clearly defined interface between
                                                                     Logical Switch Activity and Control Activity.
Figure 2 Definition of a Data Policy Activity for image                  Control Activity takes the list of control statements as
files characterization                                               Input and makes platform-specific function or procedure
                                                                     or script calls that implement the control statements.
    The problem of the policy composition out of two                 Actors for Control Activity are particular functions /
granular policies outlined in Section 2.1 can be addressed           procedures / scripts and the Effects of it are log and error
with the help of other classes of activities that we                 files or messages – whatever is used for traceability in a
introduced earlier: Logical Switch and Control. For the              particular implementation. Condition is, similarly to the
sake of simplicity (as we are going just to illustrate it how        file characterization activity definition, a particular
the policy modelling can be done) we will not be defining            software platform or IT service where Actors operate.
all aspects for these activities, e.g. we can omit Scope or          Figure 4 presents an example of a diagrame for the
Effect but they may be required in a real policy modelling           Control Policy.
situation.
    The Logical Switch activity will take File, FileSize
and FileFormat as Inputs, a particular logic of handling
file moves to either storage A or B, as well as file
conversion, will be Condition. The Activity yields a list
of particular control statements (like “move File to
storage A”, “Convert file in JPG format”) as Output. The
shape of such defined Logical Switch activity is
illustrated by Figure 3.


                                                                     Figure 4 Definition of a Control Activity for policy
                                                                     execution
                                                                         Generic Data PolicyActivities (such as data
Figure 3 Definition of a Logical Switch Activity for                 characterization) can be combined with Logical Switch
handling image files                                                 Activities and Control Activities in a chain or a network
                                                                     of activities. For our example, the resulted chain is
    The semantically clear definition of a Logical Switch            illustrated by Figure 5. It represents the full model of a
Activity gives an idea of how we suggest to address the
                                                                     certain data policy expressed as a chain of semantically
problem of a policy composition from granular policy                 clear activities with interfaces between them, as well as
statements. The hope is, if the logic of producing control           interfaces to activity implementations in particular IT
statements is made explicit, as well as the control                  services or software platforms.
statements themselves, this will eliminate the ambiguity                 It is worth mentioning once again that every aspect in
of a policy composed of granular policy statements.                  the Figure 5 diagrame (such as File, Size, Format, Script
    A good question is what formalism, if any, will be               or Log) should be thought of not as a particular artefact
adequate for the expression of logic in the Condition of             or a value but as a semantic wrapper of an artefact or a
the Logical Switch. The short answer is: it depends on the           value. As a particular model serialization, these semantic
policy engine implementation. In an extreme case, this               wrappers can be RDF statements about artefacts or
Condition can be just a mandatory textual explanation                values.
(commentary) of the logic implemented by the Actor
(which is omitted in the Figure 3), i.e. by an executable
function or a procedure or a script for a particular IT


                                                                84
                                                                   suggested approach and therefore such authoring tools
                                                                   should be a part of a sensible IT architecture for data
                                                                   policy management. In addition, what is required is a
                                                                   repository where policy designs can be stored and
                                                                   retrieved from.


Figure 5 Example of full policy definition
    In real data policy modelling situations, it may be
necessary to define more than one instance of each
Activity type; as an example, there could be two Data
Characterization Activities defined (one for the file size
and another for the file format) in place of one in our
example. Nevertheless, even differently defined
Activities could be combined in a semantically clear
network representing the same data policy.
    If Activities in Figure 5 are clearly defined and              Figure 6 IT architecture for activity-based policy
sensibly combined in the Activity network, this                    management
eliminates any ambiguity in policy definition and
execution exemplified by two interfering granular                      Activity network interpretation engine picks up
policies discussed back in Section 2.1 so that the actual          Activity network from the authoring tools or repository
result of the policy implementation becomes predictable            and executes them. In order to execute activity networks
and can be formally validated.                                     in a particular IT environment (software platforms and
    One of the strengths of the suggested model is a               services), a mapping engine is required that maps
combination of its reasonable expressivity with its high           Activities and their aspects (such as Conditions or
flexibility as it is based on the idea of composition of           Outputs) to configuration files and executable scripts.
activities that can be a) modelled differently b)                      In addition to this generic mapping engine, specific
implemented differently and c) operated (executed)                 engines for logical conditions and control statements can
differently. In the above example, scripts for file                be implemented. Effects repository stores Effect aspects
characterization and scripts for policy execution can be           of each Activity; it is a generalization of logging service
implemented using different software and operated by               and contains semantically clear tracks of Activities
different components of the same service, or by different          execution. Policy search interface can be designed for
services, or even by different e-infrastructures.                  searching and sharing data policies.
    The actual chain or network of activities, as well as              For the purposes of data archive or data
definition of each of them (i.e. definition of all semantic        infrastructure audit, a policy validation engine is
wrappers) could be done in a certain authoring tool with           required that talks to policy search interface and to
a graphic user interface and RDF as a model serialization          Effects repository. The actual validation can be based on
format. Development of such a tool has been beyond                 matching graphs of artefacts resulted from policies
resources available for this conceptual work; however,             execution with graphs of Activities in the policy design.
such a tool is worth mentioning as one of the elements of
an IT architecture that can support data policies
                                                                   5 Conclusion
formulation, execution and validation.                                 The problem of data policy modelling with
                                                                   reasonable crosswalks between high-level (read: textual)
4 IT architecture for activity-based data                          policies and their machine-executable implementations
policy management                                                  has yet to find a satisfactory solution. The challenges of
                                                                   policy design and implementation are even bigger when
    The proposed IT architecture is presented by Figure 6          collaborative data infrastructures are operated in
with the most essential components and information                 combination with the in-house software platforms.
flows (that would constitute a core operational platform               The problem of semantically clear crosswalks and the
for data policy management) designated as filled-in                problem of data policy implementation across
boxes and arrows; more advanced components and flows               organization-specific and external IT services can be
are designated as dashed boxes and arrows with a blank             addressed by adoption of certain policy modelling
background.                                                        techniques and tools. Activity Model [11] can be a
   As already suggested, having policy Activities                  reasonable means for the design of such tools, with the
                                                                   idea that data policies can be represented as networks of
authoring tools with GUI and possibility to serialize
Activity networks in a semantically explicit format such           Activities with interconnected aspects of them.
as RDF is essential for good levels of adoption of the                 This work has introduced extensions to the Activity
                                                                   Model in order to make it fit for the task of data policy


                                                              85
modelling. An example of using the Activity Model for              [6] iRODS: Integrated Rule-Oriented Data System.
the definition of a particular data policy has been given,             https://irods.org/
and a possible IT architecture has been considered that            [7] EUDAT             Data        Policy       Manager.
can support data policy management based on Activity                   https://github.com/EUDAT-B2SAFE/B2SAFE-
networks.                                                              DPM
Acknowledgements                                                   [8] RuleML                    Wiki                pages.
                                                                       http://wiki.ruleml.org/index.php/RuleML_Home
    This work is supported by EUDAT 2020 project that              [9] SWRL: A Semantic Web Rule Engine.
receives funding from the European Union’s Horizon                     https://www.w3.org/Submission/SWRL/
2020 research and innovation programme under the grant
                                                                  [10] ProvONE: A PROV Extension Data Model for
agreement No. 654065. The views expressed are those of
                                                                       Scientific           Workflow            Provenance.
the author and not necessarily of the project.
                                                                       http://vcvcomputing.com/provone/provone.html
References                                                        [11] Bunakov, V. Core semantic model for generic
                                                                       research activity. In 15th All-Russian Conference
[1] Giaretta, D. Advanced Digital Preservation.                        "Digital Libraries: Advanced Methods and
    Springer, Heidelberg (2011)                                        technologies, Digital Collections" (RCDL 2013),
[2] Bunakov, V., Jones, C., Matthews, B., Wilson, M.                   Yaroslavl, Russia, 14-17 Oct 2013, CEUR
    Data authenticity and data value in policy-driven                  Workshop Proceedings (ISSN 1613-0073) 1108,
    digital collections. OCLC Systems & Services:                      79-84          (2013).        Persistent       URL:
    International digital library perspectives, vol. 30                http://purl.org/net/epubs/work/10938342
    issue 4, pp. 212-231 (2014). doi: 10.1108/OCLC-               [12] SCAPE: Scalable Preservation Environments
    07-2013-0025. Open Access version of the preprint:                 project. http://scape-project.eu/
    http://purl.org/net/epubs/work/12299882
                                                                  [13] SCAPE Catalogue of Preservation Policy Elements.
[3] EUDAT Collaborative Data Infrastructure.                           http://scape-project.eu/wp-
    https://www.eudat.eu/eudat-cdi                                     content/uploads/2014/02/SCAPE_D13.2_KB_V1.
[4] EUDAT services. https://www.eudat.eu/services-                     0.pdf
    support                                                       [14] Practical     Policy    Implementations      Report.
[5] EUDAT                  B2SAFE               service.               http://dx.doi.org/10.15497/83E1B3F9-7E17-484A-
    https://www.eudat.eu/b2safe                                        A466-B3E5775121CC


                                                             86