m3 – A Behavioral Similarity Metric for
                Business Processes

           Matthias Kunze, Matthias Weidlich, and Mathias Weske
    {matthias.kunze, matthias.weidlich, mathias.weske}@hpi.uni-potsdam.de

                Hasso Plattner Institute at the University of Potsdam
                   Prof.-Dr.-Helmert-Strasse 2-3, 14482 Potsdam


       Abstract. With the increasing uptake of business process management,
       companies maintain large scale process repositories consisting of hundreds
       or thousands of process models. So far, discovery within these reposi-
       tories is limited to free text search or folder navigation. In a separate
       stream of research, similarity measures were introduced to get a better
       understanding of the relationships between process models. Unfortunately,
       calculating such similarity is complex, so that these techniques cannot
       be used in large process model repositories, where they would be most
       valuable.
       To overcome this issue, we introduce the m3 -metric, which is based on
       behavioral profiles that provide an abstraction on the detailed behavior
       of processes. This metric can be computed efficiently and enables tree
       based similarity search within large process model repositories.


1    Introduction
In recent years we saw large business process model collections grow in many orga-
nizations, whereas the effective management of such repositories requires efficient
capabilities to find process models among hundreds or thousands of candidate
models. The question of similarity between process models has been thoroughly
studied. Still, existing approaches do not scale well in computation complexity,
nor do they address transitivity, which is essential for efficient similarity search.
    Similarity metrics provide such a property and significantly increase search
performance, as we showed for process model structures, i.e., the graph edit
distance, in [7]. In this paper we address behavioral aspects of processes and
present the m3 -metric: A metric based on behavioral profiles that provides a
similarity ranking of process models relative to a given query model and can be
employed in metric similarity search methods, cf. [14]. Behavioral profiles focus on
ordering relations between pairs of activities in a process model. While this notion
abstracts from the actual behavior of a process, it is computed efficiently [11].
Approaches that take the complete state space of a process into account, in turn,
suffer from exponential complexity due to the state space explosion problem.
    The remainder of this work is structured as follows: In Section 2 we present
previous work related to the topic of process model similarity and searching, while
Section 3 introduces formal concepts for the m3 -metric. In Section 4 we show
how the aforementioned metric is constructed from behavioral profile relations
and present its rationale by means of an illustrative example, before we conclude
this work and give an outlook on future studies in Section 5.


2   Related Work
Similarity of process models has been addressed from various angles. An overview
of linguistic, structural, and behavioral measures used for similarity search of
process models can be found in [4]. Measures for structural similarity, e.g., the
one based on the graph edit distance [3], neglect common behavior expressed
in a different syntax when comparing process models. Modeling a loop with a
loop activity in BPMN or with a control flow cycle would, therefore, impact
on structural similarity of process models in a negative manner. Measures for
behavioral similarity are insensitive to such syntactical differences. They may be
based directly on the sets of possible traces of process models, e.g., by computing
the intersection of traces of two models. In order to get a more fine-granular
measure, an n-gram representation of the sets of traces may be used to judge on
similarity [12]. Other approaches advocate the application of causal footprints to
approximate the behavior and to measure similarity of process models [10]. Still,
these approaches are computationally hard, so that recent techniques aim at a
multi-step approach that narrows the search space in a step-wise manner [13].
We avoid such problems as behavioral profiles are computed efficiently for a
broad class of process models. A behavioral abstraction close to the behavioral
profile has been applied for matching BPEL process definitions [5]. However, the
approach is restricted to BPEL processes and transitivity aspects of the proposed
measures are not discussed.
    In traditional databases, data is generally made up of simple structures and
attribute data—indexing techniques have been very successfully elaborated on and
implemented. However, for complex data, such as process models, these techniques
are not applicable, because no intrinsic ordering exists among data objects and
mapping them to simple values, i.e., hashing, is not meaningful. Similarity search
addresses this field where nothing but pairwise distances between data objects
can be measured [14]. This concept requires the distance—or dissimilarity—of
two objects to be a proper metric, and thus to provide transitivity. By that, it
becomes possible to predict or at least constrain the distance of a pair of data
objects, if one knows the respective pairwise distances of these data objects to a
third one. Several indexing techniques have been developed [2,6]. However, the
above process model similarity measures have not been shown to provide proper
metrics.


3   Background
This section introduces the background of our work in terms of the characteristics
of a distance metric, a formal model, and the concept of a behavioral profile.
3.1     Distance Metric

To efficiently1 search within a space of given objects, it is necessary to partition
that space and exclude some of the partitions from exhaustive search. Partitioning
is relatively easy for objects whose features can be mapped to vectors, i.e., in
coordinate spaces. However, such a representation cannot be generally assumed,
in particular for process behavior or graph structures, cf. [7]. However, in metric
spaces—a generalization of coordinate spaces—nothing but a distance with certain
properties is required to partition the space, the notion of such a distance is a
metric [14].

Definition 1 (Distance Metric). A metric space is a pair S = (D, d) where D
is the domain of objects and d : D × D → R is a metric, i.e., a distance function
with the following properties:
  – symmetry: ∀oi , oj ∈ D : d(oi , oj ) = d(oj , oi )      V
  – nonnegativity: ∀oi , oj ∈ D, oi 6= oj : d(oi , oj ) > 0 ∀oi ∈ D : d(oi , oi ) = 0
  – triangle inequality: ∀oi , oj , ok ∈ D : d(oi , ok ) ≤ d(oi , oj ) + d(oj , ok )

Particularly, the triangle inequality states that every pair of distances between
three objects is larger than the remaining. This allows deriving minimum and
maximum bounds for the distance of two points, if their respective distances to
another point are given, and thus partitioning the search space.


3.2     Process Models

A process model is based on a graph containing activity nodes and control nodes.
It captures the commonalities of most process description languages.

Definition 2 (Process Model). A process model is a tuple P = (A, s, C, N, F, T )
where:
 – A is a finite non-empty set of activity nodes,
 – C is a finite set of control nodes,
 – N = A ∪ C is a finite set of nodes with A ∩ C = ∅,
 – F ⊆ N × N is the flow relation, such that (N, F ) is a connected graph,
 – •n = {n0 ∈ N |(n0 , n) ∈ F } and n• = {n0 ∈ N |(n, n0 ) ∈ F } denote direct pre-
   decessors and successors, we require ∀ a ∈ A : | • a| ≤ 1 ∧ |a • | ≤ 1,
 – s ∈ A is the only start node, such that •s = ∅ and ∀ n ∈ N : s F ∗ a with F ∗
   as the reflexive transitive closure of F ,
 – T : C → {and, xor} associates each control node with a type.

We assume trace semantics for process models. The behavior of a process model
P = (A, s, C, N, F, T ) is a set of traces TP . It comprises a set of lists of the form
σ = hs, a1 , . . . , an i with n > 0, n ∈ N, ai ∈ A for all 0 < i ≤ n, which represent
the execution order of activities. These traces follow on common Petri net-based
formalizations [9].
1
    An efficient algorithm is one that avoids examining every point in the set.
3.3   Behavioral Profiles
A behavioral profile captures behavioral characteristics of a process model by
three relations between pairs of activity nodes. These relations are based on the
notion of weak order. Two activities of a process model are in weak order, if there
exists a trace in which one activity occurs after the other.
Definition 3 (Weak Order Relation). Let P = (A, s, C, N, F, T ) be a process
model and TP its set of traces. The weak order relation P ⊆ A × A contains all
pairs (x, y), such that there is a trace σ = n1 , . . . , nm in TP with j ∈ {1, . . . , m−1}
and j < k ≤ m for which holds nj = x and nk = y.
Based on the weak order relation, the behavioral profile is defined as follows.
Definition 4 (Behavioral Profile). Let P = (A, s, C, N, F, T ) be a process
model. A pair (x, y) ∈ A × A is in one of the following relations:
 – The strict order relation P , if x P y and y 6P x.
 – The exclusiveness relation +P , if x 6P y and y 6P x.
 – The interleaving order relation ||P , if x P y and y P x.
The set BP = { P , +P , ||P } of all three relations is the behavioral profile of P .
We illustrate the relations of the behav-
ioral profile for the BPMN model in Fig. 1.
It holds A      D as both activities are or-
dered if they occur together in a trace and
B||C due to the concurrent execution of
both activities. An activity is either exclu-
sive to itself (e.g., A + A in Fig. 1) or in
interleaving order to itself. Further details Fig. 1. Example BPMN model a
on behavioral profiles can be found in [11],
which also shows how a behavioral profile of a process model is computed in
polynomial time to the size of the model under the assumption of soundness.
Soundness is a correctness criterion that guarantees the absence of behavioral
anomalies [1].


4     Construction of the m3 -Metric
We assume two process models P and Q to be similar, if they expose a common
share of behavior, i.e., they have a common set of activities that yield equal
behavioral profiles: ( P ∩ Q ) ∪ (+P ∩ +Q ) ∪ (||P ∩ ||Q ) 6= ∅. The larger this
overlap of behavioral profiles is, the more similar two process models are. We
quantify this overlap by means of the established Jaccard similarity coefficient
for the similarity of two sets: sim(A, B) = |A∩B|
                                             |A∪B| . If two sets of behavioral profile
relations consist of the same pairs, they are equal, i.e., their similarity is 1. If two
behavioral profile relations have no common pairs, their similarity coefficient is 0.
From the relations of the behavioral profile we propose three individual similarity
coefficients:
Exclusiveness Similarity captures the amount of exclusiveness, i.e., pairs of
   activities that must not occur together, shared by the two models,
                |+ ∩+Q |
   s+ (P, Q) = |+PP ∪+Q  |.
Strict Order Similarity quantifies to which degree two processes expose an
   overlap in their order dependencies for pairs of activities,
                 |   ∩     |
   s (P, Q) = | PP ∪ Q   Q|
                             .
Interleaving Order Similarity accounts for the observation that parallel exe-
   cution of activities covers also sequential execution of the same activities in
   any order, i.e., activities that are executed in parallel can also be executed in
   a certain sequence
                     and the according traces are therefore considered similar.
                1    |(    ∪|| )∩   |   |   ∩(   ∪||Q )|
   s|| (P, Q) = 2 · | PP ∪||PP ∪ QQ| + | PP ∪ QQ ∪||Q |
                                                         .

A distance metric expresses a dissimilarity of two objects. Analogously, there
exists a set distance that is constructed from the Jaccard similarity coefficient
which has been proven to be a metric [8]: d(A, B) = 1 − sim(A, B). Thus, each
of the aforementioned similarity measures translates into a single distance metric.
Through weighted summation of these three single metrics, we can compose them
into one (thus the name m3 -metric). This composition preserves the properties
of a metric.

Definition 5 (m3 -metric). Let P and Q be two process models and s+ (P, Q),
s (P, Q), and s|| (P, Q) the similarity metrics based on behavioral profiles. Then,
the m3 -metric is defined as
    m3 (P, Q) = 1 − wi · si (P, Q)
                    P
                      i                                         P
with i ∈ {+, , ||} and weighting factors wi ∈ (0, 1) such that     wi = 1.
                                                                   i

To illustrate this metric consider the sample processes, Fig. 1-3. The relations of
the behavioral profile for these models are summarized in Table 1. We chose the
following weights to demonstrate the m3 -metric: w+ = 0.5, w = 0.3, w|| = 0.2.
Here, we understand exclusiveness as the strictest criterion and thus give it the
highest weight to penalize violations thereof. Interleaving order offers the greatest
flexibility and thus is considered the weakest criterion, which is why it receives
the smallest weight.


                                             Fig. 3. Example BPMN model c
    Fig. 2. Example BPMN model b
    Table 1. Relations of the behavioral profile for the example process models

           Model a, Fig. 1         Model b, Fig. 2           Model c, Fig. 3
 +         {(A,A), (B,B), (C,C),   {(A,A), (B,B), (B,C),     {(A,A), (B,B), (C,C),
           (D,D)}                  (C,C), (D,D)}             (D,D)}
           {(A,B), (A,C), (A,D),   {(A,B), (A,C), (A,D),     {(A,B), (A,C), (A,D),
           (B,D), (C,D)}           (B,D), (C,D)}             (B,C), (B,D), (C,D)}
    ||     {(B,C)}                 ∅                         ∅


Building the metric space of behavioral profiles M = (B, m3 ) for the three
example process models and computing the m3 -distances, we get m3 (a, b) = 0.117
and m3 (b, c) = 0.183. According to our metric, the behavioral distance between
models a and b is smaller than the one between models b and c.
    Since m3 is a metric, it features the triangle inequality, which allows us to
bound the distance of models a and c without actually computing it. Based
on Def. 1, |m3 (a, b) − m3 (b, c)| ≤ m3 (a, c) (lower boundary) and |m3 (a, b) +
m3 (b, c)| ≥ m3 (a, c) (upper boundary), i.e., 0.066 ≤ m3 (a, c) ≤ 0.3. This approxi-
mation is confirmed by the actual computed value, which is m3 (a, c) = 0.067. The
computed distances also comply with our perception of the behavioral similarity
of the sample process models. Since possible traces of a cover the traces of c, due
to the parallel branch, these two models are more similar (less distant) to each
other than a and b, and b and c respectively.


5        Conclusion
Efficient similarity search requires a distance notion that obeys to certain proper-
ties: It must be a proper metric. We proposed a metric that allows comparing
and searching process models with behavioral aspects in mind, based on the
concept of behavioral profiles. These profiles are computed efficiently for a broad
class of process models [11]. We explained that metric with a simple example.
    The presented metric is our first attempt to investigate similarity of process
models in terms of behavioral profiles. In future work, we shall address the
metric, identify and rank further similarity coefficients, and construct a more
sophisticated metric that is substantiated through exhaustive experiments, e.g.,
a regression analysis. The expressiveness of such a metric shall be compared to a
reference model collection that has been evaluated by business process experts.
Further, we will address the suitability of this improved metric in similarity
search, as it is vital for a metric to be well discriminating in order to enable
efficient searching with confident results.
References
 1. W.M.P. van der Aalst. Workflow verification: Finding control-flow errors using
    petri-net-based techniques. In BPM, volume 1806 of LNCS, pages 161–183, 2000.
 2. Edgar Chávez, Gonzalo Navarro, Ricardo Baeza-Yates, and José Luis Marroquı́n.
    Searching in Metric Spaces. ACM Comput. Surv., 33(3):273–321, 2001.
 3. Remco M. Dijkman, Marlon Dumas, and Luciano Garcı́a-Bañuelos. Graph matching
    algorithms for business process model similarity search. In Umeshwar Dayal, Johann
    Eder, Jana Koehler, and Hajo A. Reijers, editors, BPM, volume 5701 of Lecture
    Notes in Computer Science, pages 48–63. Springer, 2009.
 4. Marlon Dumas, Luciano Garcı́a-Bañuelos, and Remco M. Dijkman. Similarity
    search of business process models. IEEE Data Eng. Bull., 32(3):23–28, 2009.
 5. Rik Eshuis and Paul W. P. J. Grefen. Structural matching of bpel processes. In
    ECOWS, pages 171–180. IEEE Computer Society, 2007.
 6. Gisli R. Hjaltason and Hanan Samet. Index-driven similarity search in metric
    spaces (survey article). ACM Trans. Database Syst., 28(4):517–580, 2003.
 7. Matthias Kunze and Mathias Weske. Metric Trees for Efficient Similarity Search in
    Process Model Repositories. In Proceedings of the 1st International Workshop on
    Process in the Large (IW-PL ’10), Hoboken, NJ, September 2010.
 8. Alan Lipkus. A Proof of the Triangle Inequality for the Tanimoto Distance. Journal
    of Mathematical Chemistry, 26:263–265, 1999. 10.1023/A:1019154432472.
 9. Niels Lohmann, Eric Verbeek, and Remco M. Dijkman. Petri net transformations
    for business processes - a survey. T. Petri Nets and Other Models of Concurrency,
    2:46–63, 2009.
10. Boudewijn van Dongen, Remco Dijkman, and Jan Mendling. Measuring Similarity
    between Business Process Models. In Advanced Information Systems Engineering,
    volume 5074 of Lecture Notes in Computer Science, pages 450–464. Springer Berlin
    / Heidelberg, 2008.
11. Matthias Weidlich, Jan Mendling, and Mathias Weske. Efficient consistency mea-
    surement based on behavioural profiles of process models. IEEE Transactions on
    Software Engineering, 2010. To appear.
12. Andreas Wombacher. Evaluation of technical measures for workflow similarity based
    on a pilot study. In Robert Meersman and Zahir Tari, editors, OTM Conferences
    (1), volume 4275 of Lecture Notes in Computer Science, pages 255–272. Springer,
    2006.
13. Zhiqiang Yan, Remco M. Dijkman, and Paul Grefen. Fast business process similarity
    search with feature-based similarity estimation. In Robert Meersman, Tharam S.
    Dillon, and Pilar Herrero, editors, OTM Conferences (1), volume 6426 of Lecture
    Notes in Computer Science, pages 60–77. Springer, 2010.
14. Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, and Michal Batko. Similarity
    Search: The Metric Space Approach. Springer-Verlag New York, Inc., Secaucus, NJ,
    USA, 2005.