-

On the interpretation of traces of low level events in business process logs

Bettina Fazzinga

fazzinga@icar.cnr.it 1

Sergio Flesca

flesca@dimes.unical.it 0

Filippo Furfaro

furfaro@dimes.unical.it 0

Elio Masciari

masciari@icar.cnr.it

Luigi Pontieri

pontieri@icar.cnr.it 0 DIMES, University of Calabria , Italy 1 ICAR-CNR , Italy

2018

24 27

We consider the scenario where the executions of different business processes are traced into a log, whose traces describe the process instances as sequences of low-level events (representing basic kinds of operations). In this context, we address the following problem: given a description of the processes' behaviors in terms of high-level activities (instead of low-level events), and in the presence of uncertainty in the mapping between events and activities, find all the interpretations of each trace Φ, in terms of the process model that Φ conforms to and of the sequence of activities that may have triggered the events in Φ. We describe the probabilistic framework presented in [7] supporting the extraction of a compact representation of Φ's interpretations, where each interpretation is associated with a probability score representing the probability of being the actual one.

Business processes Graph structure Probabilistic conditioning

Thanks to the increasing diffusion of automated tracing systems, the analysis of log data describing executions of business processes has gained momentum with the growth of the Process Mining research field, which addresses the “confrontation between event data and process models” [ 1 ] through new process-aware data analysis tasks, such as: inducing a process model [ 2 ], quantifying “how much” a log and a model conform one to the other [ 14 ], and supporting advanced querybased process analytics [ 13, 6, 8, 9 ].

However, all the approaches and tools developed in this field require that each log event can be mapped to well-defined activities, corresponding to some high-level view of the process. As a matter of fact, this assumption often does not hold in practice: in the logs of many processes, the events just represent low-level operations, with no clear reference to the business activities that were carried out through these operations, as shown in the following example.

processes activities

G D P N A

I F event types γ δ ξ α β φ &1/0"$#)% $% /,0$1.)""$% +$)"%)- &!'"(#")$#%*+ /213+"4"5$% -$%"4'("8/"144# +"++(1) &$'' ",$(' ",$(' 1) $ 67 activity activity label description A Send an alert to managers D Dispatch a contract proposal F Fix the issue G Get more information from the client I Retrieve data from the IS N Notify the request’s outcome P Define a service package

R Receive a request

Example 1 (Running example). Consider the case of a phone company, where two business processes are carried out: a process W1 for the activation of services, and an issue-management process W2.

An abstract description of the behaviors of these processes is available in the form of loose process models, that specify which high-level activities compose each process, and several temporal constraints over the execution of the activities. The activities of processes W1 and W2 are shown in Figure 1 via process-activity links. In addition, assume that the following constraints are known to be satisfied by the two processes’ instances: (i) every instance of either process starts and ends with an execution of activities R and N , respectively, and these activities cannot be executed multiple times in the instance; (ii) in every instance of W1, any execution of activity P (resp. D) must be followed (resp. cannot be followed) by an execution of activity D (resp. P ).

All these activities are performed by executing low-level operations (e.g., email exchanges, phone calls, database accesses). Figure 1 reports all the operations, regarded each as an event and denoted with a Greek letter, as well as their mapping to the high-level activities.

Notably, the event-activity mapping is many to many: for example, activity G can produce an instance of either event δ or event γ, while an instance of event δ can be generated by the execution of either activity G or N . Thus, any execution of a process activity generates an instance of one of the events associated with the activity, and the sequence of all the event instances that are produced for an entire process instance is stored in the log as the trace of the process instance. For example, an instance of W1 corresponding to the activity sequence R I G P N might be stored in the log in the form of a trace α β γ β ξ or of a trace α β δ β δ, where α (resp. β, γ, etc.) denotes an instance of event α (resp. β, γ, etc.).

Assume that we want to interpret a trace Φex = α β ξ ξ as an execution of either W1 or W2, that means replacing each event in Φex with one of the activities that could have generated it, while ensuring that the resulting activity sequence complies with one of the process models. Only three of such activity sequences exist for Φex: R I D N , R I A N , and R P D N . Indeed, the sequence R P A N too complies with the event-activity mapping, but it is not a valid interpretation, since it does not conform to either W1 (no occurrence of D appears in the sequence after that of P ) or W2 (P is not an activity of W2).

In the process management scenario like that illustrated above, we address the following problem, called interpretation problem: Given a log L, containing traces generated by an arbitrary number of business processes, and a set W of (partial) behavioral models for the processes, we want to interpret each trace Φ = e1 . . . em in L by establishing: 1) the process whose execution generated the sequence of events stored in Φ, and 2) for each low-level event ei in Φ, which activity generated ei as a result of its execution.

The need of bridging some abstraction gap between log events and process activities has been addressed by several recent works [ 11, 4, 3, 15, 12 ], However, these approaches cannot actually solve the interpretation problem stated above as: (i) they assume that all the given traces come from a single business process; and (ii) they return just one “optimal” interpretation for each trace. In fact, the problem stated here has been only considered in [ 5 ] and in [ 10 ] in simplified settings. Both these interpretation techniques have been considered as terms of comparison in the experimental evaluation of our approach performed in [ 7 ]. 2

Our framework

In our context, an instance w of a process W is the execution of a sequence A1 . . . Am of activities; in turn, the execution of each activity Ai yields an instance ei of an event Ei; hence, the trace describing w consists in the sequence e1 . . . em of event instances. For any event ei occurring in a trace, we assume that the starting time of its execution is stored in the log, and denote it as ei.ts. In the following, we assume given the sets W, A, E of (types of) processes, activities, and events, respectively.

We assume that every process W ∈ W is “regulated” by a “composition rule”, that restricts the sequences of activities that are allowed to be executed when W is enacted. The set of composition rules of the processes in W will be denoted as CR. More specifically, the composition rule associated with W is a tuple hActSet, StartAct, EndAct, ICi, where: - ActSet is the set of activities that are allowed to be executed in any instance of W ; - StartAct (resp. EndAct ) ⊆ActSet is the set of the starting (resp. ending) activities of W , which are the activities that are allowed to be executed at the beginning (resp. end) of any instance of W ; and - IC is a set of constraints of the form A ⇒T B (called must-constraint ) or A ⇒T ¬B (called not-constraint ), where A, B ∈ A and T is of the form ‘≤ c’, where c is a constant.

Here, A ⇒T B (resp. A ⇒T ¬B), that basically imposes that, within every instance of W , the beginning of an instance a of A always (resp. never) precedes the beginning of an instance b of B such that the width of the interval between the starting times of a and b satisfies T . Omitting T is the same as specifying T =‘≤ ∞’.

Example 2 (Running example (continued)). Continuing Example 1, it holds: W1.ActSet = {G, D, P, N, A, R, I}, W2.ActSet = {N, A, R, I, F }, W1.StartAct = W2.StartAct = {R}, W1.EndAct = W2.EndAct = {N }, W1.IC = {P ⇒ D, D ⇒ ¬P }, and W2.IC = ∅.

The solution of an instance Φ = e1 . . . em of the interpretation problem is called interpretation for Φ consistent with CR (or, simply, consistent or valid interpretation, when Φ and CR are understood), and is a pair hσ, W i, where: – σ is called sequence-interpretation and is a sequence A1 . . . Am of activities, where ∀j ∈ [1..m] Aj ∈ cand-act (ej ) (where cand-act (ej ) is the set of activities whose execution is known to possibly generate an instance of ej ), meaning that each ej is interpreted as the result of executing activity Aj ; – W is called process-interpretation and denotes a process, meaning that Φ is interpreted as generated by process W ; – σ conforms to the composition rule of W .

Example 3 (Running example (continued)). Consider trace Φ1 = αβγξ, where α, β, γ, and ξ are instances of α, β, γ and ξ, respectively. The correspondence depicted in Figure 1 between the types of the events occurring in Φ1 and the activities is encoded by: cand-act (α) = {R}, cand-act (β) = {I, P }, cand-act (γ) = {G} and cand-act (ξ) = {A, D, N }. On the basis of this cand-act (·), Φ1 can be interpreted as the result of executing one of the following sequences of activities: σ1 = R I G A, σ2 = R P G A, σ3 = R I G D, σ4 = R P G D, σ5 = R P G N , and σ6 = R I G N . By also looking at the composition rules of the process models, we have that: (i) σ1 is inconsistent with the composition rule of W1, as A ∈/ W1EndAct, and also with W2’s composition rule, since both G ∈/ W2.ActSet and A ∈/ W2.EndAct ; (ii) σ2 is inconsistent with the composition rule of W1, since it violates P → D and it is A ∈/ W1.EndAct, and it is inconsistent with also with the composition rule of W2, since G, P ∈/ActSet and A ∈/ W2.EndAct ; (iii) σ3 is inconsistent with the composition rules of both W1 and W2, since D ∈/ W1EndAct, D ∈/ W2.EndAct and G ∈/ W2.ActSet ; (iv) σ4 is inconsistent with the composition rules of both W1 and W2, since D ∈/ W1EndAct and D ∈/ W2.EndAct (also G, P ∈/ W2.ActSet ); (v) σ5 is inconsistent with the composition rule of W1, since it violates P → D, and with W2’s composition rule, since G ∈/ W2.ActSet ; and (vi) σ6 is consistent with the composition rule of W1, and inconsistent with W2’s composition rule, since G ∈/ W2.ActSet. Thus, we have only one consistent interpretation for Φ1, that is hσ6, W1i. Consider now the trace Φ2 = αβξξ, which coincides with the trace Φex of Example 1. As discussed in that example, the only consistent sequence-interpretations for Φ2 are σa = R I D N , σb = R P D N , and σc = R I A N , leading to the following interpretations: hσa, W1i, hσb, W1i, hσc, W1i and hσc, W2i.

As a matter of fact, an instance of the interpretation problem admitting multiple solutions. This situation is likely to happen frequently, for the following reasons: 1) the correspondence between events and activities is many to many, so that an event instance can be interpreted as the result of different activity executions; 2) the description of the process models provided by composition rules can be rather “loose”, and hence there may be process instances consistent with the composition rules of different processes: that is, the execution of a sequence of activities can be interpreted as the execution of different processes.

The presence of this uncertainty suggests to address the interpretation problem in probabilistic terms: we model the correspondence between events and activities probabilistically. In this direction, we assume that the many-tomany correspondence between events and activities is modeled by a probability distribution function (pdf) pa(A|E) returning the a-priori probability that an instance of event E is generated by an execution of A. For instance, pa(A1|E1) = 0.75 and pa(A2|E1) = 0.25 means that any instance of event E1 is generated by the execution of either A1 or A2, and that the former case is three times more probable than the latter. This pdf is said to be a-priori since it assigns probabilities to the activities, of which the event can be viewed as the low-level result, without looking at the context where the event instance happened. That is, it provides an interpretation for the events that occur in a trace without looking at the other events occurring in the same trace.

In the same spirit, we assume given a pdf pa(W ) over W, associating each W ∈ W with the a-priori probability that a trace in a log encodes an instance of W . For instance, given W = {W1, W2}, pa(W1) = pa(W2) = 0.5 means that, in the absence of further knowledge, W1 and W2 are equi-probable interpretations of any trace.

Given the a-priori pdfs pa(A|E) and pa(W ), we could employ a na¨ıve way to solve the interpretation problem, that is the following: first, assume that the event instances in Φ are independent of one another; then, use the a-priori pdf pa(A|E) to probabilistically interpret each ei in Φ = e1 . . . em as an instance of an activity Ai, and the pdf pa(W ) to probabilistically interpret the whole Φ as an instance of one of the processes, say Wj . As implied by the independence assumption, every sequence-interpretation σ = A1 . . . Am obtained this way will be associated with the probability pa(σ) = Πim=1pa(Ai|Ei) (that is, the product of the a-priori probabilities of the event-to-activity mappings at each step), while the process-interpretation Wj will have probability pa(Wj ). In turn, the interpretation hσ, Wj i will have probability pa(hσ, Wj i) = pa(σ) · pa(Wj ). Unfortunately, this na¨ıve approach is unlikely to provide reasonable results, as explained in the following example.

Example 4 (Running example (continued)). Assume that the a-priori probabilities of the activities are pa(R|α) = 1, pa(I|β) = pa(P |β) = 0.5, pa(G|γ) = 1, pa(G|δ) = 0.4, pa(N |δ) = 0.6, pa(A|ξ) = 0.2, pa(D|ξ) = 0.5, pa(N |ξ) = 0.3, and pa(F |φ) = 1, and those of the the processes are pa(W1) = 0.6, pa(W2) = 0.4. Consider trace Φ1. Every sequence-interpretation of Φ1 is associated with a probability, implied by pa and the independence assumption. In particular: pa(σ1) = pa(R|α) · pa(I|β) · pa(G|γ) · pa(A|ξ) = 0.1; pa(σ2) = 0.1; pa(σ3) = pa(σ4) = 0.25; pa(σ5) = pa(σ6) = 0.15. Moreover, owing to the values of pa(W1) and pa(W2), we can say that Φ1 encodes an execution of W1 with probability 0.6, and of W2 with probability 0.4. In turn, the probabilities of any interpretation hσ, W i for Φ1 (with W ∈ {W1, W2} and σ ∈ {σ1, . . . , σ6}) is pa(hσ, W i) = pa(σ) · pa(W ). In particular: pa(hσ1, W1i) = 0.1 · 0.6 = 0.06, pa(hσ1, W2i) = 0.1 · 0.4 = 0.04, and so on. However, we know from our running example that there is only one interpretation consistent with the composition rules, that is hσ6, W1i. Moreover, the sum of the probabilities of the consistent interpretations should be 1 (or, equivalently, every invalid interpretation should have zero probability), while, in this case, the a-priori probability of the only valid interpretation hσ6, W1i is pa(σ6) · pa(W1) = 0.09, and the other interpretations (that are all invalid) have non-zero probability.

Example 4 shows that the na¨ıve approach, relying only on the independence assumption and the a-priori pdfs for interpreting the traces, may yield wrong conclusions. In fact, to solve the interpretation problem, we resort to probabilistic conditioning, that is a well-known paradigm used in the general context of forcing integrity constraints over probabilistic databases where the assumption of independence between tuples is originally used. In our case, applying the conditioning paradigm means revising the a-priori pdf pa(hσ, W i) over the interpretations into pa(hσ, W i | CR). The latter returns either 0, if the interpretation hσ, W i is invalid, or the ratio of its a-priori probability to the sum of the a-priori probabilities of all the valid interpretations, otherwise.

The problem of conditioning a pdf is generally complex. In our scenario, we could revise pa(hσ, W i) into pa(hσ, W i | CR) by first enumerating all the interpretations of Φ, then discarding those not satisfying the composition rules, and finally revising the a-priori probabilities of the remaining ones. Unfortunately, this approach is often infeasible, as the interpretations to deal with are too many. For instance, if, for a trace of 50 event instances, two activities are on average compatible with each event of the trace, we have to consider 250 sequence-interpretations.

Our main contribution is a framework for efficiently constructing a compact representation (called cas-graph) of the interpretations of an input trace Φ that are consistent with CR, and of the conditioned pdf pa(hσ, W i | CR). 2.1

Cas-graph Figure 2 reports a simplified example of cas-graph, built over trace Φ2 = αβξξ of our running example. The nodes of the cas-graph are of the form hi, Act, W i, where i is the step of the trace over which the node has been built, Act is the activity of which ei has been interpreted to be an execution, and W is the process of which Φ2 has been interpreted to be an instance. The source nodes of the cas-graph are not built over trace steps, but they are only used as the nodes from which all the paths representing interpretations of Φ2 as instances of a same process W start. This cas-graph represents the three interpretations hσa, W1i, hσb, W1i, hσc, W1i (encoded by the three source-to-target paths originating from the source node over W1), and the interpretation hσc, W2i (encoded by the source-to-target path originating from the source node over W2). Moreover, each source node n is assigned the conditioned probability that Φ is an instance of the process in n, whereas each edge hni−1, nii is assigned the conditioned probability that event ei corresponds to an execution of the activity in ni given that the event ei−1 corresponds to an execution of the activity in ni−1. Basically, the probabilities of the two source nodes in Fig. 2 mean that Φ2 is an execution of W1 (resp. W2) with probability 90% (resp. 10%). Moreover, the probabilities assigned to the edges encode the conditioned probabilities of the interpretations: the conditioned probabilities that hσa, W1i, hσb, W1i, hσc, W1i and hσc, W2i actually are the pairs h activity sequence, process i that generated Φ2 can be obtained by multiplying the probabilities of the source nodes and of all the edges appearing along the paths that encode each of these interpretations (that is, 15%(= 90% × 1 × 58.3% × 28.6% × 1), 37.5%, 37.5% and 10%, respectively).

The nodes of a cas-graph are said to be activity nodes. Each activity node n is defined over an event ei of Φ, and represents an interpretation of ei as an instance of some activity A within the execution of some process W . In a casgraph, an activity node is the successor of activity nodes encoding alternative interpretations of the previous event ei−1, and its successors are activity nodes encoding alternative interpretations of the next event ei+1.

The construction of the cas-graph is performed by considering the steps of Φ in order and, at each step, by progressively building the activity nodes over the current step. Specifically, constructing an activity node n over the current step requires checking whether the interpretation that it provides of the current step is consistent (according to CR) with the interpretations of the past events encoded by the predecessors of n. Hence, in order to efficiently support this compatibility check, each activity node n also contains a compact description of what the interpretations of the previous steps impose on the interpretations of the future steps, due to the presence of CR (for the sake of simplicity, this 0.9 0,⊥,W1

0.1 0,⊥,W2 1 1

0.583 2,I,W1 1,R,W1 0.417 2,P,W1 Fig. 2. A (simplified) example of cas-graph built over trace Φ2 of our running example. compact description was not reported in the nodes of the “simplified” cas-graph in Fig. 2). For more detail, we kindly refer the reader to [ 7 ]. 3

Conclusion

We discussed the framework for efficiently interpreting business process logs composed of low-level events proposed in [ 7 ]. The proposed structure allows a compact representation of all the interpretations compatible with the constraints defined over the process models.

1. van der Aalst , W.M.P. : Process Mining: Discovery, Conformance and Enhancement of Business Processes. Springer ( 2011 )

2. van der Aalst , W.M.P. , Weijters , T. , Maruster , L. : Workflow mining: Discovering process models from event logs . TKDE 16 ( 9 ), 1128 - 1142 ( 2004 )

3. Baier , T. , Mendling , J. , Weske , M. : Bridging abstraction layers in process mining . Inf. Syst . 46 , 123 - 139 ( 2014 )

4. Bose , R.P.J.C. , van der Aalst , W.M.P.: Abstractions in process mining: A taxonomy of patterns . In: Proc. BPM . pp. 159 - 175 ( 2009 )

5. Fazzinga , B. , Flesca , S. , Furfaro , F. , Masciari , E. , Pontieri , L. : A probabilistic unified framework for event abstraction and process detection from log data . In: Proc. CoopIS , ODBASE, and C &TC. pp. 320 - 328 ( 2015 )

6. Fazzinga , B. , Flesca , S. , Furfaro , F. , Masciari , E. , Pontieri , L. : A compression-based framework for the efficient analysis of business process logs . In: Proc. SSDBM ( 2015 )

7. Fazzinga , B. , Flesca , S. , Furfaro , F. , Masciari , E. , Pontieri , L. : Efficiently interpreting traces of low level events in business process logs . Inf. Syst . 73 , 1 - 24 ( 2018 )

8. Fazzinga , B. , Flesca , S. , Furfaro , F. , Masciari , E. , Pontieri , L. , Pulice , C. : A framework supporting the analysis of process logs stored in either relational or nosql dbmss . In: Proc. ISMIS . pp. 52 - 58 ( 2015 )

9. Fazzinga , B. , Flesca , S. , Furfaro , F. , Masciari , E. , Pontieri , L. , Pulice , C. : How, who and when: Enhancing business process warehouses by graph based queries . In: Proc. IDEAS . pp. 242 - 247 ( 2016 )

10. Fazzinga , B. , Flesca , S. , Furfaro , F. , Pontieri , L. : Online and offline classification of traces of event logs on the basis of security risks . J. of Intell. Inf. Syst . ( 2017 )

11. Gu¨nther, C. , Rozinat , A., van der Aalst , W.M.P. : Activity mining by global trace segmentation . In: Proc. Workshops BPM . pp. 128 - 139 ( 2009 )

12. Mannhardt , F., de Leoni , M. , Reijers , H.A., van der Aalst , W.M.P. , Toussaint , P.J.: From low-level events to activities-a pattern-based approach . In: Proc. BPM . pp. 125 - 141 ( 2016 )

13. Polyvyanyy , A. , Ouyang , C. , Barros , A., van der Aalst , W.M.: Process querying: Enabling business intelligence through query-based process analytics . Decision Supp. Syst . 100 , 41 - 56 ( 2017 )

14. Rozinat , A., van der Aalst , W.M.P. : Conformance checking of processes based on monitoring real behavior . Inf. Syst . 33 ( 1 ), 64 - 95 ( 2008 )

15. Tax , N. , Sidorova , N. , Haakma , R., van der Aalst, W.M.P. : Event abstraction for process mining using supervised learning techniques . In: Proc. IntelliSys 2016 : Volume 1 . pp. 251 - 269