<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Interactive Process Clustering with t-SNE</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Steffen</forename><surname>Schuhmann</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">German Research Center for Artificial Intelligence (DFKI)</orgName>
								<address>
									<settlement>Saarbrücken</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Saarland University</orgName>
								<address>
									<settlement>Saarbrücken</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Jana-Rebecca</forename><surname>Rehse</surname></persName>
							<email>rehse@uni-mannheim.de</email>
							<affiliation key="aff0">
								<orgName type="department">German Research Center for Artificial Intelligence (DFKI)</orgName>
								<address>
									<settlement>Saarbrücken</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="institution">University of Mannheim</orgName>
								<address>
									<settlement>Mannheim</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sebastian</forename><surname>Baumann</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">German Research Center for Artificial Intelligence (DFKI)</orgName>
								<address>
									<settlement>Saarbrücken</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Saarland University</orgName>
								<address>
									<settlement>Saarbrücken</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Peter</forename><surname>Fettke</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">German Research Center for Artificial Intelligence (DFKI)</orgName>
								<address>
									<settlement>Saarbrücken</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Saarland University</orgName>
								<address>
									<settlement>Saarbrücken</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Interactive Process Clustering with t-SNE</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">0759C92DF07A503285BF54AB68B27517</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T09:06+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Trace Clustering</term>
					<term>Process Discovery</term>
					<term>Process Analytics</term>
					<term>Interactive Data Analytics</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Process trace clustering is a well-studied and powerful technique to support the discovery of high-quality process models. It splits an event log into more cohesive sublogs, such that the discovered process models are easier to read and to understand. However, existing clustering approaches typically optimize measures like fitness or precision instead of focusing on the model understandability and utility, as assessed by a process analyst. In addition, they offer no opportunity to influence or adapt the clustering result according to the analyst's use case or preferences. In this paper, we propose an interactive tool to trace clustering based on the t-SNE algorithm. Traces are represented in a two-dimensional graph, where they can be selected interactively for process discovery. We also offer the user some guidance with a predefined selection of possible clusters. Using this system, a process analyst is able to find a representative set of process models for each event log without any knowledge in programming and a basic understanding of the used discovery techniques.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The main goal of process discovery is to visualize a real-life business process, as recorded in an event log, in a human-readable way <ref type="bibr" target="#b6">[7]</ref>. In reality, however, discovery approaches often produce spaghetti models, i.e., highly complex models that are difficult to read and to understand <ref type="bibr" target="#b1">[2]</ref>. Spaghetti models originate in overly complex process logs. For example, if the process contains multiple variants for handling different types of business objects, all of those variants need to be included in the discovered model <ref type="bibr" target="#b6">[7]</ref>. In this case, it makes sense to split the event log into multiple logs and discover a separate model for each of them <ref type="bibr" target="#b0">[1]</ref>. The challenge lies in determining the best way to split the log, such that we end up with a minimal number of maximally useful models. For this purpose, process trace clustering is a well-known and effective technique, which has been extensively studied and applied in many contributions, e.g., <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b5">6]</ref>.</p><p>However, those existing clustering approaches typically optimize measures like fitness or precision, whereby model understandability and utility are considered after generating the process models of the clusters. In addition, they offer the analyst no opportunity to influence or adapt the clustering result according to the concrete use case or preferences and they are often not integrated with process discovery.</p><p>Therefore, we designed a novel interactive process clustering (IPC) tool based on the t-SNE algorithm <ref type="bibr" target="#b7">[8]</ref>. This algorithm is well suited for embedding highdimensional data, such as a trace similarity matrix, into a lower dimensional space. The embedding of such a similarity matrix can then be visualized in a two-dimensional graph, where traces with a high similarity are placed closer together.</p><p>We integrated this clustering technique into an interactive web-based tool, where the analyst can influence the clustering parameters, select clusters of traces, discover models for those clusters, and compare their similarity. Moreover, we included process discovery algorithms to compute a set of process models that appropriately represent the event log. Compared to existing clustering tools, IPC is both interactive and visual, giving the process analyst a useful guidance tool with a high degree of freedom. The visual representation of the two-dimensional embedding leads to a better understandability of the coherence in the event log. Also, the free selection provides the user with the ability to select groups based on the utility of the concrete use case.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Main Characteristics and Innovation</head><p>The objective of the IPC tool is to provide process analysts with an easily understandable and interactive visualization of trace similarities, to find an appropriate set of process models to represent the log at hand. As outlined in Fig. <ref type="figure">1</ref>, the underlying approach consists of four major steps, which are either backend computations or frontend interactions between the tool and the user. In the first step, we compute the pairwise similarity between all traces in the log. The resulting similarity matrix is used as the basis for the t-SNE embedding in the second step. This embedding is then visualized in a two-dimensional graph, which the process analyst can use to gain insights into the log and either manually select clusters for which to discover a process model or have the tool suggest clusters automatically. The set of discovered models is evaluated by comparing their similarity, following the idea that the less similar two models are, the more sense it makes to keep them as separate models.</p><p>To be easily accessible without a complex installation process, the IPC prototype<ref type="foot" target="#foot_0">4</ref> is implemented using web technologies and therefore usable with any modern internet browser. The user interface was designed to support process analysts by providing them the options needed for the cluster analysis without cluttering the UI with too many features or complex options. It is split into two main screens. Users first see the process log upload prompt. It contains a file picker to upload an event log in the XES<ref type="foot" target="#foot_1">5</ref> format. The second screen, shown in Fig. <ref type="figure">2</ref>, contains all elements used for the clustering process and can be divided into five main groups. On the left, the parametrization section contains all buttons for parametrizing and starting the t-SNE visualization, the clustering guidance, and the process discovery for a selected cluster. Next to it, there is a two-dimensional scatterplot, where the t-SNE results are displayed. It gives the user the option to select a subset of processes as a cluster by dragging a bounding box around it. After at least one process model was discovered for a selected cluster, the similarity matrix based on percentage of common activities is displayed on the right side. It shows the similarity between all generated process models in a color-coded matrix, where green elements highlight a low similarity and red elements indicate a high similarity between the process models. This color scheme originates in the goal to find process models that are as distinct as possible. Below the plot and the similarity matrix, there are three boxes containing descriptive data about the selected process instances, namely number of instances, average case duration, and average case length. Initially, these boxes display information about all contained traces. The lowest section of the user interface shows the process model table. This table contains the generated models along with their metadata. These include the name, which was used to generate the process model, an image of the plot highlighting the selected cluster used to generate the process model, the similarity metric used to generate the embedding, and a timestamp indicating the time of the model generation. The table also contains two interaction buttons, "Show Model" and "Delete". This former displays the generated process model in an overlay, the latter removes the process model from the table.</p><p>IPC is an easy-to-use tool for discovering process models from event log clusters by interactively selecting the clusters on a two-dimensional projection. This projection changes according to the chosen similarity metric and therefore represents different aspects of the event log, such as the similarity of the traces' structural composition. There are few other contributions in the process mining field that emphasize the visualization of trace clustering results, using, e.g., t-SNE. Schirmer et al. use it as a tool for event log pre-processing <ref type="bibr" target="#b4">[5]</ref>. Their approach is similar to ours in terms of visualization and similarity measures, but Fig. <ref type="figure">2</ref>. The IPC user interface in the clustering process it is not integrated with process discovery, uses pre-labeled data, and focuses more on finding outliers as a preprocessing step than on interactive process discovery. Different from our approach, it also does not implement a caching strategy, which may lead to computing times of several hours to days, depending on the size of the process log.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Tool Maturity</head><p>The tool was implemented as a demonstration unit to evaluate the usability and effectiveness of the proposed clustering approach in a user experience study. To ensure the quality of the software and allow the participants to focus on utility and usability, the implementation is build upon well-known frameworks like Flask<ref type="foot" target="#foot_2">6</ref> , scikit-learn <ref type="foot" target="#foot_3">7</ref> and PM4Py 8 in the backend and D3.js 9 and jQuery 10 in the frontend. A video providing a brief overview of the work with the evaluation dataset can be found online 11 .</p><p>The goal of our evaluation was to assess the utility of the IPC approach and the usability of the IPC tool. For this purpose, the 16 participants were given a short introduction to the tool and provided with a publicly available real-life event log. Then, they were asked to find a number of clusters, which they found appropriate for the given log, i.e., which adequately represented the log without producing too complex process models. Afterwards, they were asked to assess the tool using the User Experience Questionnaire <ref type="bibr" target="#b3">[4]</ref>. Users in general appreciated the tool, ranking it with a score around 1.8 in attractiveness, efficiency, and stimulation and only a slightly lower score (1.66) for novelty. However, evaluation scores were lower (around 1.2) for perspicuity and dependability.</p><p>Since some of the operations, like calculating the similarity matrix and the t-SNE algorithm, are computationally complex, we implemented caching system to reduce the number of these operations. This enables the users to run multiple analysis on the same data in a reasonable time. Therefore, we store the calculation results for the similarity matrix and t-SNE calculation on the server. These cached results are accessed by using a salted hash of the original event log. Since the current implementation does not feature a user management, the cached results are accessible for all users with access to the particular dataset. This way, all users benefit from the cached results after the initial calculation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusions and future work</head><p>This paper presents our tool for Interactive Process Clustering using t-SNE and manual as well as automatic cluster selection. The tool was developed to validate the usability of t-SNE in business process analysis and its relevance to trace clustering. The currently implemented similarity metrics focus on the structural composition of the process instances. In future work, we will extend those metrics to enable the user to focus on other aspects of the process traces, such as resources or other metadata. We also will include more advanced process discovery algorithm in a later release.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>Fig. 1. Main Idea for our Interactive Process Clustering (IPC) Approach</figDesc><table><row><cell cols="2">Sect. 3.2</cell><cell>Sect. 3.3</cell><cell>Sect. 3.4</cell><cell>Sect. 3.5</cell><cell></cell><cell></cell></row><row><cell>Compare</cell><cell></cell><cell>Cluster</cell><cell>Select &amp; Mine</cell><cell>Compare</cell><cell>0.8 0.8 0.9 0.9 1 1</cell><cell>0.6 0.6 1 1 0.9 0.9</cell><cell>1 1 0.6 0.6 0.8 0.8</cell></row><row><cell>Process Log</cell><cell>Trace Similarity Matrix</cell><cell>t-SNE Results</cell><cell>Process Model Collection</cell><cell></cell><cell cols="3">Process Model Evaluation</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_0">http://ipc.sschuhmann.de/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_1">http://www.xes-standard.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_2">https://flask.palletsprojects.com/en/1.1.x/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_3">https://scikit-learn.org/stable/index.html</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Context aware trace clustering: Towards improving process mining results</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">P J C</forename><surname>Bose</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Van Der Aalst</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2009 SIAM International Conference on Data Mining</title>
				<meeting>the 2009 SIAM International Conference on Data Mining</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="401" to="412" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Process mining based on clustering: A quest for precision</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K A</forename><surname>De Medeiros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Guzzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Greco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Van Der Aalst</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Weijters</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Van Dongen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Saccà</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Business Process Management Workshops</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="17" to="29" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Clustering traces using sequence alignment</title>
		<author>
			<persName><forename type="first">J</forename><surname>Evermann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Thaler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fettke</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Workshop on Business Process Intelligence (BPI-15), located at International Conference on Business Process Management</title>
				<meeting><address><addrLine>Innsbruck, Austria; Berlin Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2015-08-03">July 31 -August 3. 2015</date>
		</imprint>
	</monogr>
	<note>Proceedings of the 11th International Workshop on Business Process Intelligence</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Construction and evaluation of a user experience questionnaire</title>
		<author>
			<persName><forename type="first">B</forename><surname>Laugwitz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Held</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schrepp</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">USAB 2008: HCI and Usability for Education and Work</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Holzinger</surname></persName>
		</editor>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="63" to="76" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Visual support to filtering cases for process discovery</title>
		<author>
			<persName><forename type="first">L</forename><surname>Schirmer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Campagnolo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>González</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rodrigues</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Schardong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>França</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Barbosa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Poggi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lopes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 20th International Conference on Enterprise Information Systems</title>
				<meeting>the 20th International Conference on Enterprise Information Systems</meeting>
		<imprint>
			<publisher>Scitepress</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="38" to="49" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">A comparative analysis of process instance cluster techniques</title>
		<author>
			<persName><forename type="first">T</forename><surname>Thaler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ternis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fettke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Loos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 12th International Conference on Wirtschaftsinformatik</title>
				<editor>
			<persName><forename type="first">O</forename><surname>Thomas</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Teuteberg</surname></persName>
		</editor>
		<meeting>the 12th International Conference on Wirtschaftsinformatik<address><addrLine>Osnabrück</addrLine></address></meeting>
		<imprint>
			<publisher>Universität</publisher>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Process Mining: Data Science in Action</title>
		<author>
			<persName><forename type="first">W</forename><surname>Van Der Aalst</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016</date>
			<publisher>Springer</publisher>
		</imprint>
	</monogr>
	<note>2nd edn.</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Visualizing data using t-sne</title>
		<author>
			<persName><forename type="first">L</forename><surname>Van Der Maaten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Hinton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of machine learning research</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="2579" to="2605" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
