=Paper=
{{Paper
|id=Vol-1442/paper_22
|storemode=property
|title=Using Ontology and Data Provenance to Improve Software Processes
|pdfUrl=https://ceur-ws.org/Vol-1442/paper_22.pdf
|volume=Vol-1442
|dblpUrl=https://dblp.org/rec/conf/ontobras/DalpraCSBCWD15
}}
==Using Ontology and Data Provenance to Improve Software Processes==
<pdf width="1500px">https://ceur-ws.org/Vol-1442/paper_22.pdf</pdf>
<pre>
         Using Ontology and Data Provenance to Improve
                       Software Processes
 Humberto L. O. Dalpra1, Gabriella C. B. Costa2, Tássio F. M. Sirqueira1, Regina
   Braga1, Cláudia M. L. Werner2, Fernanda Campos1, José Maria N. David1
1
    UFJF – Federal University of Juiz de Fora – Department of Computer Science, Juiz de
                                    Fora – MG – Brazil.
    2
        UFRJ – Federal University of Rio de Janeiro – COPPE – Systems Engineering and
                 Computer Science Department, Rio de Janeiro – RJ –Brazil.
                  humbertodalpra@gmail.com, tassio@tassio.eti.br,
                         {gabriellacbc,werner}@cos.ufrj.br,
              {regina.braga, fernanda.campos, jose.david}@ufjf.edu.br
         Abstract. Provenance refers to the origin of a particular object. In
         computational terms, provenance is a historical record of the derivation of
         data that can help to understand the current record. In this context, this work
         presents a proposal for software processes improvement using a provenance
         data model and an ontology. This improvement can be obtained by process
         data execution analysis with an approach called PROV-Process, which uses a
         layer for storing process provenance and an ontology based on PROV-O.

1. Introduction
Process can be defined as a systematic approach to create a product or to perform some
task [Osterweil, 1987]. Currently, many organizations are investing in the definition and
improvement of their processes aiming to improve product's quality. However, the
increase of process data generated makes the analysis of them more complex. It requires
the use of techniques to allow proper analysis of these data, extracting records that, in
fact, will contribute to process improvement. One way of analyzing this data is using
provenance techniques and models.
         Buneman et al. (2001) define data provenance as the description of the origins of
a piece of data and how it is stored in a database. Thus, to capture the origin of process
data, it is necessary to capture the process flow specification (prospective provenance)
and process execution data (retrospective provenance), in order to have the information
regarding the success, failure, delays and errors, during process execution.
        Lim et al. (2010) state that the provenance can be captured prospectively and
retrospectively. Prospective provenance captures the abstract workflow specification (or
process) enabling future data derivation. Retrospective provenance captures process
execution, i.e., data derivation records.
         To obtain the benefits of provenance, data have to be modeled, gathered, and
stored for further queries [Marinho et al., 2012]. After the capture and storage of
process provenance data, it can be used for analysis that enables process improvement
(e.g., shorter execution time and greater efficiency of the results). One possible way to
analyze processes provenance data is through the use of ontology and the inference
mechanisms provided by it, enabling the discovery of strategic information for software
project managers. This paper proposes a layer for the storage of software process
provenance data and the analysis of these data using an ontology. A W3C provenance
model called PROV [Groth and Moreau, 2013] was used both for storage and analysis
of these data.
       The remainder of this paper is structured as follows: Section 2 presents related
works that deal with provenance and processes. Section 3 is dedicated to describe the
approach to improve software processes using an ontology called PROV-Process. The
next section presents an overview of the PROV-Process ontology, which was based on
PROV-O, describing the extensions made on it. Section 5 discusses the analysis of an
industry software process using the PROV-Process approach and the possibilities to
improve future executions of this process through the information obtained by PROV-
Process ontology. Finally, conclusions are presented in Section 6.

2. Related Work
Missier et al. (2013) present D-PROV, an extension of PROV specification, with the
aim of representing process structure, i.e., to enable the storage and query using
prospective provenance. An example of using D-PROV in the context of scientific
workflows defined by Data ONE scientists was shown in the article. This work was
used as basis to capture prospective provenance in PROV-Process approach.
        Miles et al. (2011) propose a technique, called PRiME, to adapt application
projects to interact with a provenance layer. The authors specify the steps involved in
applying PRiME and analyze its effectiveness through two case studies.
        Wendel et al. (2010) present a solution to failures in software development
processes based on PRiME, the Open Provenance Model and a SOA architecture. They
use Neo4j to store the data, Gremlin to query and REST web services as the connection
to the tools.
        Junaid et al. (2010) propose an approach where a provenance system intercepts
the actions of users, processes and stores these actions to provide suggestions on
possible future actions for the workflow project. These suggested actions are based on
the actions of the current user and are calculated based on the provenance information
stored.
       Similar to the related work mentioned above, PROV-Process approach aims to
improve future software process executions, through provenance data. However, other
approaches do not use ontologies as a technique for query provenance data or use any
inference mechanism, as PROV-Process approach does. Through ontology inferences,
we derive strategic information to suggest software process improvement, as shown in
the next sections.

3. PROV-Process Overview
PROV-Process is an approach for storage and analysis of software process provenance
data in order to improve future process execution. The main objective of the approach is
to identify improvements for future software process instances by using a provenance
layer (comprising a database, an ontology and mechanisms to manipulate these
components).
       As shown in Figure 1, after the process modeling, a process instance can be
created. Both the process model and the model of the generated instance are stored in
PROV-Process Database, through a prospective mechanism. After that, the process
instance can be executed and the retrospective data provenance is stored through the
PROV-Process approach. This storage is done using a relational database, which has
been modeled using PROV-DM specification [Moreau and Missier, 2013].


                          Figure 1: PROV-Process Approach


         PROV-DM types and relations are organized according to six components.
PROV-Process Database implements all these components using a relational database.
Figure 2 shows, for example, tables of the first component, which comprise entities,
activities and their interrelations: Used (Usage), WasGeneratedBy (Generation),
WasStartedBy (Start), WasEndedBy (End), WasInvalidatedBy (Invalidation), and
WasInformedBy (Communication).
     All the data stored in the PROV-Process relational database are exported to the
PROV-Process ontology. This ontology is described in details in next section.

4. PROV-Process Ontology
Ontology research has become more widespread in Computer Science community.
Although the term has been limited to the philosophy sphere in the past, it has earned
specific roles in Artificial Intelligence, Computational Linguistics and Databases
[Guarino, 1998].
       PROV-Process Ontology was developed from the PROV-O ontology
[Belhajjame et al., 2013], which was defined based on PROV-DM data model. PROV-
O defines the vertices of PROV (Agent, Entity and Activity) as classes and uses object
properties for the interrelations representation. The core classes and properties from
PROV-O are shown in Figure 3.


                       Figure 2: Part of PROV-Process Database
        Figure 3: PROV-O: Core Classes and Properties [Belhajjame et al., 2013]


        Classes and properties in PROV-O can be used directly to represent provenance
information or one can specialize them for modeling specific applications. Thus,
PROV-O can also be specialized to create new classes and properties to model
provenance information for different domains and applications. Based on this, we create
some new properties on PROV-O (generating PROV-Process Ontology), in order to
adapt it to the software process domain and to allow the inference of new information to
improve software processes. Examples of these properties are presented in the
following.
      A group of rules (using Property Chains) was added in PROV-O in the
‘wasAssociatedWith’ data property:
           1. used o wasAttributedTo
           2. wasStartedBy o wasAttributedTo
           3.wasEndedBy o wasAttributedTo
        These rules state that, as show in Figure 4, if an activity used, was stated by or
was ended by an entity and that entity was assigned to an agent, we can infer that an
activity is associated with an agent.
                        Figure 4: wasAssociatedWith properties chains


       In the PROV-O, a data property called processInstanceId that corresponds to the
generated/executed instance identifier from the main process was also inserted.
        Finally, it should be noted that all records, called Attributes in PROV-Process
database, must be exported to the PROV-Process Ontology as new data properties with
their respective value.

5. Evaluation
In order to evaluate the applicability of the ontology of PROV-Process to software
process, the approach was applied to a process from a Brazilian software development
company [Ceosoftware, 2015]. A flow model shown in Figure 5 was created based on
the specifications of this process.
        To do this evaluation, real data execution of the process expressed in Figure 5
was analyzed. Thus, retrospective provenance of this process instances was stored using
the PROV-Process relational database. It should be noted, however, that the execution
data of the whole process were not provided by the company, but just a part of it.
       In this work, 10 process execution instances, which have been fully completed,
were analyzed. Regarding the obtained data, the following were used:
RDM1 (change request) number;
         Information if an RDM was created from a previous RDM;
         Date and time of RDM opening;
         Type of RDM;
         Responsible for opening the RDM (Origin);
         Changed modules and components during the deployment task;
         Team responsible for implementation of the solution;

1
  RDM is an acronym for ‘change request’, in Portuguese, used by the company which provided the data
for this research. It means a registration opened by support, client or commercial department, to make
changes / adjustments in software system.
         Situation of RDM;
         Date and time of RDM completion.
      These process execution data were obtained through a spreadsheet sent by the
company responsible for the project2.


               Figure 5: Process to manage requests and changes in software


2
  All the execution data used for the implementation of this assessment can be found at this link
http://gabriellacastro.com.br/dsc/ex1/ex1.xlsx . Each row of this table represents a distinct execution of
the process.
                    Table 1: Data execution example – Part 1
 RDM Number Outspread Opening Date Opening Time Type                            Origin
       30006        0     10/03/2013        14:54:00 Module liberation          Client
       30006        1     06/11/2014        17:18:00 Module liberation          Client


                    Table 2: Data execution example – Part 1
RDM                                                          Closed      Closed
Number    RDM Module Module           Component Team         Date        Time
    30006 Financial  DLL - ERP PDA clsValidacao VB6           10/03/2013    22:06:00
    30006 Financial  DLL - ERP PDA clsValidacao VB6           06/12/2014    10:41:00


        The obtained data (examples about these data can be seen in Tables 1 and 2,
 where the dates are using the format MM/DD/YYYY) were imported to the PROV-
 Process database according to the following criteria:
         For all executions whose data were analyzed, three records were set in Activity
           table, with their names:
               o Opening the Request for Change;
               o Solution Implementation;
               o Change RDM to Complete.
         The RDM number was inserted as an attribute of each of the above activities,
           by using Attribute and Activity_Attribute tables.
         If a particular instance of execution corresponds to the unfolding of a previous
            RDM, a record in WasInfomedBy table was created.
         Date and time of RDM open were included using the startTime attribute of the
           activity Opening the Request for Change.
         RDM type was inserted as an attribute of the Opening the Request for Change
           activity using Attribute and Activity_Attribute tables.
         The role responsible for the Opening the Request for Change activity was
           inserted in Agent table, using the name field and the Person type.
         Relationship between Opening the Request for Change activity and the
           responsible for the same activity were inserted as records of the
           WasAssociatedWith table.
         Values as module, RDM module and component were included as records
           using Entity table and were associated with the Solution Implementation
           activity by creating records in Used table.
         Values as module, RDM module and component, included as records using
           Entity table, have been associated with agents who manipulated it by
           creating records in WasAttributedTo table.
         Role responsible for Solution Implementation activity was inserted in Agent
           table, using the name field and the Person type.
        Relationship between Solution Implementation activity and the responsible for
          the same activity were inserted as records in WasAssociatedWith table.
        Date and time of RDM completion were inserted using the endTime attribute
          of the Change RDM to Complete activity.
        As in the flow model (Figure 5) the role responsible for the Change RDM to
          Complete activity is the Quality Team. This role was inserted as record in
          the Agent table and was associated with this task by inserting a record in the
          wasAssociatedWith table.
        In order to identify which instance of the process execution a particular activity
           is associated with, a related attribute called processInstanceId was added to
           all activities by using the Attribute and Activity_Attribute tables.


        After inserting the process execution data in the PROV-Process relational
database, all the data were entered as individuals and their relationship in PROV-
Process Ontology3. From this point, through the ontology inference engine, the
derivation of strategic information was possible. As examples of information inferred
from retrospective provenance data of this process, we can highlight four types:


        1) Activities that influenced the generation of other activities, that is, as can be
seen in red mark in Figure 6, Opening the Request for Change (id = 1) influenced
Opening the Request for Change (id = 4). The same information was also inferred for
the tasks of the same type with the ids 7, 13 and 19.


           Figure 6: Activities that influenced the generation of other activities


        2) Agents that could be associated with the Solution Implementation activity,
considering that they already handled the artifacts involved in this activity in any other
execution of the process. Figure 7 shows, for example, that Solution Implementation
activity (id = 11) was influenced by DotNet agent (id = 5), given that this agent handled
common artifacts to this activity in other instances of this process. The same type of
information (agents that could be associated with the Solution deployment task) also
occurs for Solution Implementation activity with ids equal to 8, 20, 23, 26 and 29.


3
    The generated ontology with all the individuals          can   be   found   at   this   link
http://gabriellacastro.com.br/dsc/ex1/ex1-english.owl .
                       Figura 7: Agents that influenced an activity


        3) A list of all activities in which an agent was involved, as well as the artifacts
(entities) handled by her/him, as can be seen in Figure 8. Although this type of
information can be obtained through queries on PROV-Process relational database,
using the ontology and inference engine, this information can be obtained more easily
(with a simple SPARQL query).


                 Figure 8: Activities and agents handled by DotNet agent


       4) A list of all activities where an artifact (entity) was consumed, as can be seen
in Figure 9. Although this type of information can be obtained through queries on
PROV-Process relational database, using the ontology and inference engine, this
information may be obtained more easily (with a simple SPARQL query).
               Figura 9: Activities where an artifact (entity) was consumed


       Information inferred from the use of ontology proposed by the PROV-Process
approach could help to improve process performance offering to the project manager
information acquired at the time of the instantiation of a new process. This information
might suggest, for example, the most appropriate agents and artifacts to be handled first,
according to the type of problem reported / reason for opening the RDM.

6. Conclusion
This paper presented an approach, called PROV-Process, that obtains strategic
information to the project manager enabling her/him to take decisions that can improve
process performance. Therefore, this approach presents the advantages of using data
provenance coupled with ontology. Through the use of ontology, it is possible to detect:
(1) activities that influenced the generation of other activities; (2) agents that could be
associated with the solution of the deployment task, considering that they already
handled the artifacts involved in this task in any other execution of the process; (3) A
list of activities in which an agent was involved, as well as the artifacts (entities)
handled by her/him. No metric had be used for comparison of results.
        Process data execution (retrospective provenance of the software process) are
stored in a relational database, modeled based on PROV-DM specification. As a result,
its data feed an ontology, created from the PROV-O model. With this and using an
inference machine, one can infer new information about the process.
       To evaluate the PROV-Process approach, it was applied to a real industry
software process.
       As threats to validity, we can cite:
          The partner company did not inform all process performance data, and only
           made available a spreadsheet with some of these data. This lack of detail
           directly impacts on a greater specificity in the results.
          Data obtained from the partner company did not include information about
           the actors who, in fact, performed the activity. They informed only the team
           that performed a certain activity.
      Currently, we are working in the following improvements: (1) Implementing
new rules indicating other actions that can help to improve processes; (2) Applying the
approach to other real case studies.
References
Belhajjame, K., Cheney, J., Corsar, D., Garijo, D., Soiland-Reyes, S., Zednik, S., Zhao,
     J.    (2013)    “PROV-O:        The      PROV       Ontology”.     Available     in
     <http://www.w3.org/TR/prov-o/>. Accessed in July 2015.
Buneman, P., Khanna, S. and Tan, W. C. (2001) “Why and where: A characterization of
  data provenance”. In: 8th International Conference on Database Theory, London. pp.
  4-6.
Ceosoftware. (2015) “Soluções criativas e inovadoras”. Available                       in
  <http://www.ceosoftware.com.br/>. Accessed in July 2015 (in Portuguese).
Groth, P., Moreau, L. (2013) “PROV - Overview”. Available in
  <http://www.w3.org/TR/2013/NOTE-prov-overview-20130430/>. Accessed in July
  2015.
Guarino, N. (1998) “Formal ontology in information systems” In: Proceedings of the
  first international conference (FOIS'98), Trento, Italy (Vol. 46). IOS press, pp. 3-15.
Junaid, M. M., Berger, M., Vitvar, T., Plankensteiner, K., Fahringer, T. (2009)
   “Workflow composition through design suggestions using design-time provenance
   information”. In: E-Science Workshops, 2009 5th IEEE International Conference on.
   IEEE. pp. 110-117.
Lim, C., Lu, S., Chebotko, A., Fotouhi, F. (2010) “Prospective and Retrospective
  Provenance Collection in Scientific Workflow Environments”. In Proceedings of the
  2010 IEEE International Conference on Services Computing (SCC '10). IEEE
  Computer Society, Washington, DC, USA, pp. 449-456.
Marinho, A., Murta, L., Werner, C., Braganholo, V., Cruz, S. M. S. D., Ogasawara, E.,
  Mattoso, M. (2012) “ProvManager: a provenance management system for scientific
  workflows”. Concurrency and Computation: Practice and Experience, v. 24, n. 13,
  pp. 1513-1530.
Miles, S., Groth, P., Munroe, S., Moreau, L. (2011) “PrIMe: A methodology for
  developing provenance-aware applications”. ACM Transactions on Software
  Engineering and Methodology (TOSEM), v.20 n.3, pp.1-42.
Missier, P., Dey, S. C., Belhajjame, K., Cuevas-Vicenttín, V., Ludäscher, B. (2013) “D-
  PROV: extending the PROV provenance model with workflow structure”.
  In: Proceedings of the 5th USENIX Workshop on the Theory and Practice of
  Provenance (TaPP). USENIX Association, Berkeley, CA, USA, Article 9, pp. 1-7.
Moreau, L., Missier, P. (2013). “Prov-dm: The prov data model”. Available in
  <http://www.w3.org/TR/2013/REC-prov-dm-20130430/>. Accessed in July 2015.
Osterweil, L. (1987) “Software processes are software too”. In Proceedings of the 9th
  international conference on Software Engineering. IEEE Computer Society Press, pp.
  2-13.
Wendel, H., Kunde, M., Schreiber, A. (2010). “Provenance of software development
  processes”. In Provenance and Annotation of Data and Processes, v. 6378 of Lecture
  Notes in Computer Science, pp. 59-63. Springer Berlin Heidelberg.

</pre>