=Paper= {{Paper |id=Vol-3310/paper6 |storemode=property |title=Summarizing Process Traces for Analysis Tasks: An Intuitive and User-controlled Approach |pdfUrl=https://ceur-ws.org/Vol-3310/paper6.pdf |volume=Vol-3310 |authors=Phuong Nguyen,Vatche Isahagian,Vinod Muthusamy,Aleksander Slominski |dblpUrl=https://dblp.org/rec/conf/ijcai/NguyenIMS22 }} ==Summarizing Process Traces for Analysis Tasks: An Intuitive and User-controlled Approach== https://ceur-ws.org/Vol-3310/paper6.pdf
Summarizing Process Traces for Analysis Tasks: An
Intuitive and User-controlled Approach
Phuong Nguyen1 , Vatche Isahagian2 , Vinod Muthusamy2 and Aleksander Slominski2
1
    Google
2
    IBM Research


                                         Abstract
                                         Domains such as business processes and workflows require working with multi-dimensional ordered
                                         objects. There is a need to analyze this data for operational insights. For example, in business processes,
                                         users are interested in clustering process traces to discover per-cluster process models that are less
                                         complex. Such applications require the ability to measure the similarity between data objects. However,
                                         measuring the similarity between sequence-based data is computationally expensive. We present an
                                         intuitive and user-controlled approach to summarize sequence-based multi-dimensional data. Our
                                         summarization schemes provide a trade-off between the quality and efficiency of analysis tasks. We also
                                         derive an error model for summary-based similarity under an edit-distance constraint. Evaluation results
                                         over real-world datasets show the effectiveness of our methods.




1. Introduction
Many application domains produce data in the form of multi-dimensional sequence of objects.
For example, in business processes, an underlying process model is represented as a directed
acyclic graph of activities, the traces generated from the execution of the model are regarded
as instances of the underlying model. Each trace consists of a sequence of activities sorted
by time, where each activity in the trace appears in the process model and may be repeated1 .
Figure 1 shows an example of a loan application process model, along with a sequence of
activities—with multi-dimensional attributes—that represent a possible execution trace of the
model. For example, an activity can contain information about the responsible person and
department, the person who performs the activity, and the group to which she belongs. As an
example from another domain, Figure 2 shows a sample trace of a semiconductor manufacturing
workflow, where activities are sequenced and have multi-dimensional attributes, such as the
sector where the activity is performed and the person responsible for it.
  There is a need to get operational insights from such datasets. For example, in business
process management, discovered models are often complex and difficult to comprehend [1],
so users cluster process traces and applying process discovery algorithms [2] on each cluster.
These latter models tend to be both less complex and more accurate since there is less diversity

PMAI@IJCAI22: International IJCAI Workshop on Process Management in the AI era, July 23, 2022, Vienna, Austria
$ pvnguye2@illinois.edu (P. Nguyen); vatchei@ibm.com (V. Isahagian); vmuthus@us.ibm.com (V. Muthusamy);
aslom@us.ibm.com (A. Slominski)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings           CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073




                  1
                      Trace, process trace, and sequence interchangeably refer to an instance of a process.
                                               Verify                                                  Send
                                                                                                                                   Activity                Sector     Responsible
                                               employment                                              approval
                                                                                                                                 Pull wafers              CONTROL         A
                                                                                                                    Record Wafer IDs / Attach Wafer ID Map CONTROL        A
               Record loan           Request           Review                   Review loan
               application           credit report     credit report            application                                 Initiator Coordination        INIT ATTN       A
 Loan
 application                                                                                                               lamp degas in chamber E         METAL          B
 received
                                                                                                       Send
                                     Perform title     Review title                                    rejection          sputter etch in chamber D        METAL          B
                                     search            report
                                                                                                                      low stress 10 kW TaN - Rs monitor    METAL          B
               Timestamp,Responsible,Department,Trace,Resource,Activity,Group
                                                                                                                           lamp degas in chamber E         METAL          B
               09/22/15 10:16 AM,Resource21,General,Trace-11,Resource21,Record loan application,Group 1
               09/26/15 08:10 AM,Resource21,General,Trace-11,Resource10,Request credit report,Group 4                     sputter etch in chamber D        METAL          B
               10/01/15 03:05 PM,Resource21,General,Trace-11,Resource21,Review credit report,Group 1                        low stress 10 kW TaN           METAL          B
               10/15/15 10:00 AM,Resource21,General,Trace-11,Resource15,Verify employment,Group 2
               10/20/15 12:30 PM, Resource21,General,Trace-11,Resource15,Review loan application,Group 1                Wafer transport to 1-2 from 7-2     DIEL          C
               10/31/15 04:30 PM,Resource21,General,Trace-11,Resource21,Send approval,Group 4                            200C 1000W Lo OH oxide             DIEL          C
                                                                                                                           transport from 1-2 to 5-2        DIEL          C
Figure 1: Loan application process and a
sample trace.                                                                                                      Figure 2: Sample trace of a semiconductor
                                                                                                                   manufacturing workflow.

among the traces within a cluster. In another example, scientists are interested in querying the
provenance of workflow executions to look for executions similar to the one in their query.
   Analyzing multi-dimensional sequence data poses a number of challenges. The first is
computational complexity. For example, using edit-distance to capture the similarity between
sequences [3] is computationally expensive since edit-distance is quadratic to the sequence
length and business processes sequences in can consist of hundreds of items. This is especially
challenging when dealing with large datasets and in applications such as traces clustering,
where a lot of similarity computations need to be calculated. This complexity can lead to delays
that affect interactive applications, such as similarity search, where users interact directly with
the application and expect results in a timely manner. The second challenge is to combine
multi-dimensional attributes of data with the sequential structure between data objects into
a unified approach. Edit-distance, for example, only considers the number of operations to
transform one trace into another.
   We employ summarization schemes to enable efficient analysis of multi-dimensional data
under edit-distance constraints. We focus on analysis tasks that are based on edit-distance
because it is a widely used measure for similarity. Sections 2 and 3 introduce the key approach:
instead of performing the analysis on the original high-dimensional data, which is computa-
tionally expensive, we transform the data into a summary or embedding space that has fewer
dimensions, so that the same analysis can be computed more efficiently. Section 4 introduces
our topic-summarization schemes to incorporate the multi-dimensional attributes of data items
into the analysis and produce summaries that capture the semantics of process traces, while
enabling the flexible trade-off between quality and efficiency of analysis tasks on summaries.In
Section 5, we develop an error model for the edit-distance measure in the summary space to
provide some guarantees for the results of analysis tasks on summaries. Finally, Section 6 shows
the effectiveness of our summarization scheme on a number of datasets.


2. Trace Summarization Approach
We assume the existence of an original dataset that consists of a set of process traces or logs of
workflow executions. Running an analysis, which would typically be computationally expensive
        Original data
              …
                                                             “Ground-truth”
              …                  Analysis under edit-
             …                   distance constraint
                                                        Relate
                  Generate summaries                                                   Receive       Review        Inform
                                                        approximate                   application   application   decision
                                                        results to the
      Summaries of data                                 ground-truth

                                                            Approximate       Figure 4: Topic-based representation of the
                                 Analysis under edit-       results
             …                   distance constraint
                                                                              process in Figure 1.


 Figure 3: Overview of our approach.

due to the high-dimensionality of the data, provides results which are deemed as exact or
“ground truth" answer. As shown in Figure 3, our approach is to transform the original data into
a new summary space with fewer dimensions, thus avoiding the computationally expensive
analysis on original data. The output of any analysis in the summary space is an approximation
of the “ground truth". To show the practicality of our proposed approach, we need to address
the following challenges: (1) How to generate summaries of data in a controlled and intuitive
manner, and (2) Relate the approximate results on summaries to the results on original data?
   To address these two challenges, we define sequential-order-preserving summarization and
introduce a summarization scheme that is intuitive and give users more control over the resulting
summaries. We also present an error model for summary-based similarity measure under edit-
distance constraint and show that it provides guarantees over the results of clustering and
similarity search tasks.


3. Definitions
A multidimensional set O is a set of objects O and a set of associated attributes A = (𝒜1 , 𝒜2 , ...,
𝒜|A| ): O = ⟨O, A⟩, each object 𝑜 ∈ O is defined as a tuple: 𝑜 = (𝒜1 (𝑜), 𝒜2 (𝑜), ..., 𝒜|A| (𝑜)), in
which each 𝑖-th dimension corresponds to the value of attribute 𝒜𝑖 of 𝑜, denoted as 𝒜𝑖 (𝑜).
   A Multidimensional Sequence p of size 𝑚 on a multidimensional set O is defined as an ordered
set of 𝑚 objects in O: p = (𝑝1 , 𝑝2 , ..., 𝑝𝑚 ), 𝑝𝑖 ∈ O, 1 ≤ 𝑖 ≤ 𝑚. We denote 𝜄p (𝑝) as the index,
or position, of an object 𝑝 in a sequence p. In the above definition, 𝜄p (𝑝𝑖 ) = 𝑖, ∀1 ≤ 𝑖 ≤ 𝑚.
   For example, Figure 2 presents a sequence of objects defined on a multidimensional set with
three attributes: Activity, Sector, and Responsible.
   Our interest is in different forms of summarization of multidimensional sequences to improve
efficiency of sequence analysis. Before defining summarization of sequences, we define the
notion of many-to-one mapping of objects between multidimensional sets as an object mapping
function f from an original multidimensional set O to a summary set S , f : O → S , so that
for each 𝑝 ∈ O, ∃!𝑠 ∈ S : 𝑠 = f (𝑝).

Definition 1. A f -summarization of a sequence p on O is defined as a summary sequence s
on S , denoted as s = f (p), where each object 𝑝 ∈ p is replaced by its many-to-one mapping f :
𝑠 = f (𝑝), while retaining the same index 𝜄s (𝑠) := 𝜄p (𝑝).

  A summarization of a sequence is said to preserve the sequential relationship from the original
sequence if it satisfies the following definition:
Definition 2. A f -summarization of a sequence p, denoted as s = f (p), is a sequential preserving
summarization of p if: ∀𝑝, 𝑝′ ∈ p, if 𝜄p (𝑝) < 𝜄p (𝑝′ ), then 𝜄s (𝑠) ≤ 𝜄s (𝑠′ ), with 𝑠 = f (𝑝), 𝑠′ =
f (𝑝′ ).

   By retaining the indices of objects in the original sequence, f -summarization (c.f, definition 1)
preserves sequential relationships, which is vital in improving the efficiency of sequence analysis.
Therefore, we define the notion of reduced f -summarization, in which adjacent duplicate objects
in the summary sequence are collapsed to reduce the size of a summarized sequence.

Definition 3. A reduced f -summarization of a sequence p on O is defined as a sequence s on S ,
denoted as s = f * (p), where each object 𝑝 ∈ p is replaced by its f -based mapping 𝑠 = f (𝑝) in s
and, ∀𝑝𝑖 , 𝑝𝑖+1 ∈ p, 1 ≤ 𝑖 ≤ |p| − 1, if 𝑝𝑖 = 𝑝𝑖+1 , then 𝜄s (𝑝𝑖 ) = 𝜄p (𝑝𝑖+1 ).

Theorem 1. A reduced f -summarization is sequence preserving.

Proof. Omitted due to space constraints. Available in [4].


4. Topic-Based Summarization
To incorporate the multidimensional attributes of a sequence’s data items, we begin by outlining
Attribute-based summarization as an f -summarization2 where f is a mapping on O. This
scheme provides an intuitive way for users to choose attributes as a summarization criteria and
produces summaries that are easy to interpret. It does not give users control over the average
length of summarized sequences, which we refer to as resolution. This is because attribute
values are static and already defined with the original data.
   Longer summarized sequences are more expensive to analyze, but attribute-based summariza-
tion offers little control in the sequence length. We seek a way for users to trade-off between
efficiency and accuracy of data analysis. For example, for similarity search, users might tolerate
false positives (e.g., 0.9 false positive rate) for faster response (e.g., results within 5 seconds).
We observe that business processes can often be represented by higher-level process models of
fewer dimensions. Figure 4 shows an example of a more abstract version of the process model
in Figure 1, where each activity corresponds to multiple activities in Figure 1.
   We propose a topic-based summarization technique that captures the many-to-one mapping
from the original sequences to one with fewer dimensions, where each topic is an abstract
representation of a set of original dimensions. Since the topics are implicit from the original
sequences, we first perform dimensionality reduction on the original sequences to transform
the original dimensions to topics. Then, we define the notion of topic-based summarization
using the new representation.
   Algorithm 1 highlights the main steps in the topic-based summarization process. Before
applying dimension reduction techniques to the original sequences (Line 3), it is important
to have an appropriate data representation for sequences (Line 2). We begin by selecting an
attribute of the original sequences and transform multidimensional sequences to the appropriate
attribute-based summarization. It is often intuitive to pick the attribute with the most number

    2
        Unless explicitly stated, a summarization will refer to reduced summarization.
Algorithm 1 Topic summarization process steps
 1: procedure GenerateKTopicSummarization(S, 𝜅, O, 𝜆)
 2:    M = Generate_Vectors(S) (by Equation 1)
 3:    M′ , W = Dim_Reduction(M, 𝜅)
 4:    // Calculate pairwise similarities 𝜃(𝑎𝑖 , 𝑎𝑗 )
 5:    for each pair (𝑎𝑖 , 𝑎𝑗 ) ∈ O do
 6:         𝒮𝑖𝑗 = 𝐶𝑎𝑙𝑐𝑆𝑖𝑚(𝑎𝑖 , 𝑎𝑗 )
 7:    // Perform hierarchical clustering
 8:    ℋ = hierarchical_clustering(𝒮)
 9:    // Flatten the hierarchy ℋ
10:    C = flatten_hierarchy(ℋ, 𝜅)
11:    return C


of dimensions as this attribute likely captures the most essential information about the objects
in the original multidimensional set. For example, in Figure 2, Activity is the attribute with the
most number of dimensions and it is also the base attribute to represent sequences, while other
attributes, such as Sector and Responsible, provide supporting information for Activity.
   We then represent each sequence p as a numeric vector (𝜗1 , 𝜗2 , ..., 𝜗|𝒜* | ), where 𝒜* is the
base attribute set that sequences are transformed to in the first step and |𝒜* | is the number of
dimensions on 𝒜* . We measure 𝜗𝑖 for p in a way that captures both the local importance of
each dimension and its specificity to a sequence. To capture the local importance, we use the
frequency of the 𝑖-th dimension in p, denoted as tf𝑖p , that is defined by the number of items in
p whose values equal the 𝑖-th dimension of 𝒜* , denoted as 𝑎𝑖 . To capture the specificity, we
use the popularity of a dimension across all sequences: df𝑖 = |{p ∈ S|𝑎𝑖 ∈ p}|, where S is the
set of all sequences. Intuitively, the higher df𝑖 is, the more popular the 𝑖-th dimension is and
thus, the less specificity it is to a sequence. The formulation of 𝜗𝑖 is as follows:
                                   {︃
                                                                |S|
                                        (1 + 𝑙𝑜𝑔(tf𝑖p )) × 𝑙𝑜𝑔( df  ) if 𝑎𝑖 ∈ p
                            𝜗𝑖 =                                  𝑖                             (1)
                                    0                                otherwise
   After representing sequences as vectors, the set of sequences S can be represented as a matrix
M, whose size is |S|×|𝒜* | where each row corresponds to a vector representation of a sequence
in S. With this matrix representation, we can apply off-the-shelf dimension reduction techniques
on M, such as non-negative matrix factorization (NMF), principle component analysis (PCA),
or singular value decomposition (SVD), among others (Line 3). The results of these techniques
can be presented as two matrices M′ and W. M, whose size equals |S| × 𝑘 with 𝑘 being the
number of new dimensions (i.e., 𝑘 = |S |), represents the original sequences on the summary
space. W, whose size equals |O| × 𝑘, represents the original dimensions on the new dimensions,
or topics (i.e., each row is a vector representing the distribution of an original dimension over
the set of new dimensions).
   After dimensionality reduction, we produce a many-to-one mapping from the original dimen-
sions to topics (Line 6). Two dimensions 𝑎𝑖 , 𝑎𝑗 in the original space are likely to be in the same
topic if their corresponding vectors in W have high similarity (e.g., using Cosine similarity).
In addition, 𝑎𝑖 and 𝑎𝑗 are likely to be in the same topic if they frequently appear next to each
other in a sequence (i.e., they represent two closely related activities in the underlying process
model). From these insights, we model the problem of finding an optimal many-to-one mapping
from the original dimensions to topics as a constrained optimization problem:
                                       ∑︁                                       ∑︁
                 argmax       𝜆·                     𝜃(𝑎𝑖 , 𝑎𝑗 ) + (1 − 𝜆) ·               𝜔(𝑎𝑖 , 𝑎𝑗 )𝜃(𝑎𝑖 , 𝑎𝑗 )
                    f
                                   f (𝑎𝑖 )=f (𝑎𝑗 )                             (𝑎𝑖 ,𝑎𝑗 )

                 subject to   f :O→S                                                                                (2)
                              ∀𝑎𝑖 , 𝑎𝑗 ∈ O, if f (𝑎𝑖 ) ̸= f (𝑎𝑗 ), then 𝑎𝑖 ̸= 𝑎𝑗 .
                              |S | = 𝑘.
  where 𝜃(𝑎𝑖 , 𝑎𝑗 ) is the similarity between dimensions 𝑎𝑖 and 𝑎𝑗 based on their corresponding
representation in W, 𝜔(𝑎𝑖 , 𝑎𝑗 ) is the number of times 𝑎𝑖 and 𝑎𝑗 are adjacent in input sequence
set S, and 𝜆 is used to bias towards similarity between dimensions or the number of adjacent
appearances. We now can formally define the notion of topic summarization as follows:
Definition 4. (𝑘-Topic Summarization) A 𝑘-topic summarization of sequences from original
multidimensional set O to a summary set S is defined as a reduced 𝑓 -summarization, where the
mapping f is the solution of the optimization problem defined in (2).
   Finding an optimal k-topic summarization is NP-hard (a variant of the set partitioning
problem). We take a greedy heuristic approach similar to the agglomerative clustering algorithm
(Line 8). It starts by treating each original dimension as a singleton cluster, then merging nearby
pairs of dimensions until all clusters have been merged into a single cluster. This step creates
a hierarchy where each leaf node is a dimension and the root is the single cluster of the last
merge. Because we want a partition of disjoint 𝑘 clusters as the new dimensions, the next step
is to cut the hierarchy at some point to obtain the desirable number of clusters. To find the
cut (Line 10), we find the minimum similarity threshold so that the distance between any two
dimensions in the same cluster is no more than that threshold and there are at most 𝑘 clusters.


5. Error Model for Edit-Distance on Summaries
We seek to relate the approximate results of analysis tasks on the summary space to those on
the original space. Since a similarity measure underlies a lot of analysis tasks, such as similarity
search and traces clustering, we focus on the relationship between the similarity of sequences
on the summary space with that on the original space under edit-distance constraint: ed(p, q)
& ed(f (p), f (q)), where ed is the edit-distance function and f is a summarization function. We
select edit-distance as the similarity measure because it captures both the structural similarity
(i.e., whether two sequences consist of data items in similar order) and content-based similarity
(i.e., whether two sequences share similar set of data items) between sequences. Furthermore,
edit-distance’s results, presented as a chain of edit operators to transform a sequence to the
other, can be easily interpreted by users, which makes it widely popular in practice.
    In terms of the relationship between ed(p, q) and ed(f (p), f (q)), we are interested in the
contractive property.
Definition 5. Given a summarization f , we said that the edit-distance measure satisfies the
contractive property on f if ed(p, q) ≥ ed(f (p), f (q)), ∀p, q.
  The contractive property guarantees that performing edit-distance based similarity search
on the summary space using f will yield results with 100% recall [5]. Specifically, given
a query sequence p and an edit-distance threshold 𝜒, the similarity search task needs to
find all sequences in the sequence set S that have edit-distance with p smaller or equal than
𝜒: S* = {q ∈ S|ed(p, q) ≤ 𝜒}. If the contractive property holds for a summarization
f , it is sufficient to find all sequences q that satisfy the threshold 𝜒 on the summary space:
S̄ = {q ∈ S|ed(f (p), f (q)) ≤ 𝜒}. Because if ed(p, q) ≤ 𝜒, then ed(f (p), f (q)) ≤ 𝜒; we can
guarantee that if q ∈ S* , then q ∈ S̄ (i.e., 100% recall).
    While the contractive property does not hold in general for edit-distance between summarized
sequences, we show that it holds under certain circumstances. The first of which is when f is a
non-reduced many-to-one.

Theorem 2. If f is a non-reduced many-to-one summarization on O, as defined in definition 1,
then we have: ed(p, q) ≥ ed(f (p), f (q)), ∀p, q on O.

Proof. Omitted due to space constraints. Available in [4].

  For reduced many-to-one summarization f , we are able to derive rules to indicate whether the
contractive property holds for edit-distance of a particular pair of sequences p, q.

Theorem 3. Given two sequences p, q in the original space O, if f is a reduced many-to-one
summarization on O, as defined in definition 3, then:

    • If Γp,q ≥ Λf (p),f (q) , then we have ed(p, q) ≥ ed(f (p), f (q)); or edit-distance on summary
      space by f satisfies the contractive property.
    • If Γf (p),f (q) > Λp,q , then we have ed(p, q) < ed(f (p), f (q)); or edit-distance on summary
      space by f does not satisfy the contractive property.

where Λp,q = 𝑚𝑎𝑥(|p|, |q|) and Γp,q = ||p| − |q||, with |p| being the length of p.

Proof. Omitted due to space constraints. Available in [4].

  While Theorem 3 does not cover all cases, we empirically show that the number of sequence
pairs whose edit-distances on reduced many-to-one summarization that violate the contractive
property is very small. Thus, it has a high recall for similarity search task.


6. Evaluation
We evaluate the effectiveness and efficiency of our summarization schemes on two analysis
tasks: trace similarity search and traces clustering.
Datasets: We use datasets from multiple domains: the Lithography dataset (596 traces with
1066 types of activities, each having multi-dimensional attributes) is from a real semiconductor
manufacturing process, the BPIC 2015 dataset (1199 traces with 289 activity types) is from a
building permit application process, and the BANK dataset (2000 traces with 113 activity types)
consists of synthetically generated logs from a large bank transaction process. Evaluations were
conducted on a 2.7GHz quad-core Intel Core i7 machine with 16GB of RAM 3 .
    3
    Results for BPIC experiments are available at [4]. The Lithography dataset is production dataset provided by
IBM and is private. Other datasets is available at https://data.4tu.nl/repository/collection:all.
                                 k=2         k=5        k=10        k=20       k=50        k=100
                    Topic      0.000%      0.003%      0.006%      0.007%     0.010%      0.014%
                  Random       0.002%      0.010%      0.007%      0.021%     0.027%      0.033%

Figure 5: Similarity false negatives: percentage of sequence pairs in the Lithography dataset where
edit-distance in the summary space violates the contractive property.

Summarization schemes: We compare results of analysis tasks using our proposed summa-
rization schemes (i.e., Topic and Attribute), Random summarization, which randomly maps
an original dimension to a new dimension in the summary space, and with the analysis results
on the original space. Although Random-based summaries lack interpretability, as shown
in [6], a random summarization scheme on sequence graph can yield good results. We vary
the number of dimensions 𝑘 in the summary space used by Random and Topic and vary the
attributes used by Attribute.

6.1. Evaluation Results on Similarity Tasks
The contractive property holds for most of the cases, as seen in Figure 5 which shows the
percentage of sequence pairs in the Lithography dataset, out of over 177,000 pairs, whose
edit-distances violate the contractive property in the summary space using 𝑇 𝑜𝑝𝑖𝑐 and 𝑅𝑎𝑛𝑑𝑜𝑚
summarization over different number of summary dimensions 𝑘. Since the recall rate is high,
we focus on the false positive rate of the similarity search results.
Evaluation metrics: Given an edit distance threshold 𝜒, the false positive metric tells us that,
out of all sequence pairs that satisfy ed(f (p), f (q)) ≤ 𝜒 on the summary space, how many of
them actually satisfy the threshold in the original space: ed(p, q) ≤ 𝜒.
Effectiveness: Figure 6 shows the effectiveness of the summarization schemes on the similarity
search task for the Lithography, and BANK datasets4 . The y-axis reports the false positive
results, while the x-axis corresponds to different edit-distance thresholds. As expected (Figure 6a,
6b, 6d, 6e), the higher the number of dimensions in the summary space (denoted by 𝑘), the
better the result (i.e., lower false positive rates). That is because, with more dimensions in the
summary space, summaries of sequences more resemble the original sequences. Thus, there is
little difference between edit-distances on the summary space and in the original space.
   Comparing the summarization schemes on the same number of dimensions, Random out-
performs Topic (at the cost of interpretability and efficiency, as we will show later). For
Attribute (Figure 6c), since we cannot control the number of dimensions (as it depends on
the attribute data), the quality of the results also depend on the chosen attribute. Specifically,
the 𝑇 𝑟𝑎𝑐𝑘𝑒𝑑𝐵𝑦 5 attribute outperforms 𝑆𝑒𝑐𝑡𝑜𝑟 and 𝑇 𝑜𝑜𝑙. This is in part because there are
more dimensions on 𝑇 𝑟𝑎𝑐𝑘𝑒𝑑𝐵𝑦’s summary space, and thus the summaries on the 𝑇 𝑟𝑎𝑐𝑘𝑒𝑑𝐵𝑦
space more resemble the original sequences. 𝑆𝑒𝑐𝑡𝑜𝑟 and 𝑇 𝑜𝑜𝑙 produce similar results, since
similar 𝑇 𝑜𝑜𝑙s are often used in the same 𝑆𝑒𝑐𝑡𝑜𝑟.

    4
       We only evaluate Attribute summarization on the Lithography dataset because this dataset’s attributes
provide better semantics compared with BANK.
     5
       Three main activity attributes are used on the Lithography data: 𝑇 𝑟𝑎𝑐𝑘𝑒𝑑𝐵𝑦 represents the person in
charged of the activity; 𝑆𝑒𝑐𝑡𝑜𝑟 represents the area/department where the activity is taken, and 𝑇 𝑜𝑜𝑙 represents the
tool used to perform the activity.
                         1                                                                                                                            1                                                                                                                               1
                       0.9                                                                                                                          0.9                                                                                                                             0.9
                       0.8                                                                                                                          0.8                                                                                                                             0.8
 False positive rate




                                                                                                                              False positive rate




                                                                                                                                                                                                                                                              False positive rate
                       0.7                                                                                                                          0.7                                                                                                                             0.7
                       0.6                                                                                                                          0.6                                                                                                                             0.6
                       0.5                                                                                                                          0.5                                                                                                                             0.5
                       0.4                                                                                                                          0.4                                                                                                                             0.4
                       0.3                                                                                                                          0.3                                                                                                                             0.3
                       0.2                                                                                                                          0.2                        k=2                                            k=5          k=10                                     0.2
                                            k=2                                 k=5                          k=10                                                                                                                                                                                        Sector   Tracked By   Tool
                       0.1                                                                                                                          0.1                        k=20                                           k=50         k=100                                    0.1
                                            k=20                                k=50                         k=100
                         0                                                                                                                            0                                                                                                                               0
                             0   10 20 30 40 50 60 70 80 90 100                                                                                           0    10 20 30 40 50 60 70 80 90 100                                                                                               0    10 20 30 40 50 60 70 80 90 100
                                       Edit-distance threshold                                                                                                       Edit-distance threshold                                                                                                           Edit-distance threshold


                             (a) Random (Lithography)                                                                                                         (b) Topic (Lithography)                                                                                                     (c) Attribute (Lithography)
                                                                                                               1                                                                                                          1
                                                                                                             0.9                                                                                                        0.9
                                                                                                             0.8                                                                                                        0.8




                                                                                                                                                                                                  False positive rate
                                                                                       False positive rate



                                                                                                             0.7                                                                                                        0.7
                                                                                                             0.6                                                                                                        0.6
                                                                                                             0.5                                                                                                        0.5
                                                                                                             0.4                                                                                                        0.4
                                                                                                             0.3                                                                                                        0.3
                                                                                                             0.2                                                 k=2    k=5                                             0.2                           k=2       k=5                       k=10
                                                                                                             0.1                                                 k=10   k=20                                            0.1
                                                                                                                                                                 k=50                                                                                 k=20      k=50
                                                                                                               0                                                                                                          0
                                                                                                                   0     10    20      30     40     50                 60                                                        0       10       20        30                       40
                                                                                                                              Edit-distance threshold                                                                                      Edit-distance threshold


                                                                                                                   (d) Random (BANK)                                                                                              (e) Topic (BANK)
Figure 6: False positive rates by different summarization schemes on similarity search task using the
Lithography, and BANK datasets.

                                                                                                                                                                                                             14000
                                                                               60                   Topic               Random                                                                                                        Topic         Random
                                                                                                                                                                                                             12000
                                                   Processing time (seconds)




                                                                                                                                                                                 Processing time (seconds)




                                                                               50
                                                                                                                                                                                                             10000
                                                                               40
                                                                                                                                                                                                                        8000
                                                                               30                                                                                                                                       6000
                                                                               20                                                                                                                                       4000

                                                                               10                                                                                                                                       2000

                                                                               0                                                                                                                                              0
                                                                                       2                           5       10      20     50                            100                                                           2           5      10       20                               50
                                                                                                                       Number of dimensions                                                                                                      Number of dimensions


                                                                                                   (a) (Lithography)                                                                                                                        (b) (Bank)
Figure 7: Efficiency comparison of processing time between Random and Topic summarizations using
the Lithography, and Bank datasets.

Efficiency: To evaluate the efficiency of the summarization schemes, we vary the number of
dimensions 𝑘 in the summary space and measure the time to calculate the edit-distance between
all pairs of sequences. We see in Figure 7, that for both Random and Topic, larger 𝑘, which
leads to longer longer sequences in the summary space, results in longer processing time. For
similar values of 𝑘, Topic outperforms Random, which verifies Topic’s ability to capture the
semantic relationship between the original dimensions, and thus significantly reduces the size
of sequences in the summary space, as well as the processing time. More importantly, even at
different values of 𝑘 where we observed similar effectiveness of results by Random and Topic
(e.g., 𝑘 = 2 with Random and 𝑘 = 10 with Topic on the Lithography dataset in Figure 6),
Topic is still much more efficient than Random.

6.2. Evaluation Results on Traces Clustering
Evaluation metrics: We evaluate the clustering results using process-specific metrics [3]:
weighted average conformance fitness, and weighted average structure complexity. While
                       50
                       45                                                                   N=3                  N=4                N=5
                                                                           Approach
 Conformance fitness

                       40
                       35                                                             Arcs Places Trans.   Arcs Places Trans. Arcs Places Trans.
                       30                                                 Original ED 3930 1505 1964       3419 1328 1709 3508 1356 1753
                       25
                       20                                                 Topic       3855 1474 1927       3261 1269 1630 2685 1061 1342
                       15
                       10                                                 Random      3959 1500 1979       3733 1418 1866 3552 1351 1775
                        5                                                 Sector      3697 1429 1848       3043 1195 1521 2775 1094 1387
                        0
                                 N=3        N=4               N=5         Tool        3722 1441 1860       2792 1110 1396 2758 1104 1379
                                       Number of clusters                 Tracked By 3482 1357 1741        2827 1121 1413 2650 1051 1325
                       Original ED     Topic                Random
                       Sector          Tool                 Tracked By

Figure 8: Conformance fitness                                            Figure 9: Traces clustering results’ structural complexity com-
comparison.                                                              parison. (Green and red boxes denote best and worst results, res-
                                                                         pectively.)

the process model’s conformance fitness quantifies the extent to which the discovered model
can accurately reproduce the recorded traces, the structure complexity quantifies whether the
clustering results produce process models that are simple and compact. Given a summarization
scheme, we first transform all sequences to the summary space, and then perform traces
clustering (using hierarchical clustering) with edit-distance as the similarity measure. Then,
a process model is generated for each cluster using the Heuristic mining algorithm [7] and
then converted to the Petri-Net model for conformance analysis. Given the Petri-net model,
we use two publicly available plugins from the ProM framework [8] for fitness and structural
complexity analysis: The Conformance Checker Plugin is used to measure the fitness of the
generated process models and the Petri-Net Complexity Analysis Plugin is used to analyze the
structural complexity of the process models. After fitness and complexity scores are calculated
for each cluster, the final scores are calculated as the average score over all clusters, weighted
by the cluster size.
Effectiveness of summarization schemes: Figure 8 highlights the conformance fitness of
the clustering results in the summary space by different summarization schemes6 on the
Lithography dataset. Surprisingly, using summarization schemes not only helps improve
the efficiency of the clustering task (as we showed earlier in the efficiency evaluation), but also
helps produce clusters with process models of higher fitness, compared with the clustering
results in the original space. The trend is similar when varying the number of clusters 𝑁 . That
is because measuring trace similarity on the summary space helps remove noise that often
exists when measuring similarity using the original representation. Among summarization
schemes, Attribute helps produce clustering results of higher conformance fitness (especially
when using the 𝑇 𝑟𝑎𝑐𝑘𝑒𝑑𝐵𝑦 attribute). That is because Attribute summarizations capture
better the semantic relationship between traces (e.g., traces are similar if the corresponding
sequences of 𝑆𝑒𝑐𝑡𝑜𝑟, 𝑇 𝑜𝑜𝑙, or 𝑇 𝑟𝑎𝑐𝑘𝑒𝑑𝐵𝑦 are similar).
   In terms of the structural complexity (Figure 9), Attribute summarizations outperform
other summarization schemes, again due to its ability to capture semantic relationships between
traces, producing clusters whose process models capture traces with similar semantics, and
thereby having simpler model structures. On the other hand, Random , unable to capture the
semantic relationships between traces, is the worst performer.

                        6
                            We use 𝑘 = 2 for Random, and 𝑘 = 20 for Topic, as these are similarly effective for similarity search.
   In both conformance fitness and structural complexity tests, Topic summarization approaches
Attribute. Unlike Attribute summarization, which does not give users control over the
resolution of the summaries, Topic summarization provides a qualitative advantage in offering
a tunable parameter, 𝑘, to trade-off between the effectiveness and efficiency in the analysis task.


7. Related Work
Subsequence mapping and sequence retrieval is an active area of research. One common approach
is to summarize original sequences using q-grams [9, 10] and measure the similarity between
two sets of q-grams. DRESS [9] uses the most frequent codewords as references to identify a set
candidate matches of a query. MinSearch [10] partitions strings into a hierarchy of substrings
and builds an index comprised of a set of hash tables, so that strings having common substrings
and thus small edit distance are grouped into the same hash table. These methods do not
preserve the sequential relationship between data items from the original sequences, and do
not consider sequences of multi-dimensional attributes of each data item.
   Graph similarity and mining focuses on transforming the original graph – based on graph
substructures e.g. trees [11], branches [12] – to a compact representation before measuring sim-
ilarity. Recent techniques [13] make use of disjoint substructures of graphs to capture structural
differences between graphs. Theses graphs lose their representation and interpretability after
being transformed into substructure representation.
   Embedding methods [14][15][16] improve efficiency of similarity search on complex data. Few
of the embedding approaches guarantee properties of similarity measure on the embedding space,
such as contractive property. For example, it may require that the similarity measure between
data on the embedding space to be from a specific family of measure (e.g., Minkowski metric).
Furthermore, techniques that transform original sequences into vector-based representation do
not maintain the sequential relationship between data items on the new representation.
   There has been a significant amount of research on various topics related to graph summa-
rization. We refer the reader to the following surveys [17, 18]. OLAP [18] enables interpretable
summaries of original graph at various resolutions as aggregate graphs. Chen et al. [6] show
that random summaries are capable of mining frequent graph patterns and effectively reduce
the size of original graph. In this work, besides using explicit attributes, we leverage the implicit
topics as summarization criteria. We also show that, different from general graphs, random
summarization on sequences, although produces good effectiveness, suffers from efficiency.
   Efforts to address scalability issues in business process analysis focuses either on process
model discovery of complex traces [19], or the use of vector space-based dimensional reduction
to improve the performance of traces clustering [20]. Our focus is on improving efficiency of
traces clustering and similarity search under edit-distance constraint.


8. Conclusions
We introduce a method to perform efficient analysis on sequence-based multi-dimensional data
using intuitive and user-controlled summarizations. We define a topic summarization scheme
that offer flexible trade-off between quality and efficiency of analysis tasks and derive an error
model for summary-based similarity under an edit-distance constraint. The approach was found
to be both effective and efficient based on evaluations on real-world process datasets.


References
 [1] A. K. A. De Medeiros, et al., Process mining based on clustering: A quest for precision, in:
     BPM, 2007.
 [2] W. Van der Aalst, et al., Workflow mining: Discovering process models from event logs,
     TKDE (2004).
 [3] J. Bose, et al., Context aware trace clustering: Towards improving process mining results,
     in: SDM, 2009.
 [4] P. Nguyen, et al., Summarized: Efficient framework for analyzing multidimensional process
     traces under edit-distance constraint, arXiv:1905.00983 (2019).
 [5] P. Papapetrou, et al., Reference-based alignment in large sequence databases, VLDB (2009).
 [6] C. Chen, et al., Mining graph patterns efficiently via randomized summaries, VLDB (2009).
 [7] A. Weijters, W. M. van Der Aalst, A. A. De Medeiros, Process mining with the heuristics
     miner-algorithm, Technische Universiteit Eindhoven, Tech. Rep. WP (2006).
 [8] B. F. Van Dongen, et al., The prom framework: A new era in process mining tool support,
     in: Conference on Application and Theory of Petri Nets, 2005.
 [9] A. Kotsifakos, et al., Dress: dimensionality reduction for efficient sequence search, KDD
     (2015).
[10] H. Zhang, Q. Zhang, Minsearch: An efficient algorithm for similarity search under edit
     distance, in: KDD, 2020.
[11] W. Zheng, et al., Graph similarity search with edit distance constraint in large graph
     databases, in: CIKM, 2013.
[12] Z. Li, et al., An efficient probabilistic approach for graph similarity search, in: ICDE, 2018.
[13] J. Kim, D.-H. Choi, C. Li, Inves: Incremental partitioning-based verification for graph
     similarity search., in: EDBT, 2019, pp. 229–240.
[14] M. Espadoto, et al., Toward a quantitative survey of dimension reduction techniques, IEEE
     transactions on visualization and computer graphics (2019).
[15] C. Faloutsos, K.-I. Lin, FastMap: A fast algorithm for indexing, data-mining and visualiza-
     tion of traditional and multimedia datasets, 1995.
[16] J. T.-L. Wang, et al., Metricmap: an embedding technique for processing distance-based
     queries in metric spaces, IEEE Transactions on Systems, Man, and Cybernetics (2005).
[17] Y. Liu, et al., Graph summarization methods and applications: A survey, ACM (CSUR)
     (2018).
[18] a. o. Queiroz-Sousa, A review on olap technologies applied to information networks, ACM
     Transactions on Knowledge Discovery from Data (TKDD) 14 (2019) 1–25.
[19] S. J. Leemans, et al., Scalable process discovery with guarantees, in: Conference on
     Enterprise, Business-Process and Information Systems Modeling, 2015.
[20] M. Song, et al., A comparative study of dimensionality reduction techniques to enhance
     trace clustering performances, Expert Systems with Applications (2013).