A Framework for Trace Clustering and Concept-drift Detection in Event Streams Sylvio Barbon Junior1 , Gabriel Marques Tavares1 , Paolo Ceravolo2 , and Ernesto Damiani3 1 Londrina State University barbon@uel.br | gtavares@uel.br 2 Università degli Studi di Milano paolo.ceravolo@unimi.it 3 Khalifa University ernesto.damiani@kustar.ac.ae 1 Introduction Concept-drift is a well-known problem that affects data streams where the underlying relations between a recorded tuple x and a system response y change over time [1]. Ignoring concept-drift can lead to a deterioration in the quality of predictive analytics and its capacity to represent the most recent concepts. Nevertheless, implementing a concept-drift adaptation strategy is not a trivial task due to different types of concept-drift and different adaptations in response to them. In this work, we discuss the Concept-Drift in Event Stream Framework (CDESF) that addresses some of these challenges for Trace Clustering (TC) [2] in data streams. Instead of creating an additional level of complexity that isolates the final user from the deep behaviour of the system, our goal is to offer a simple instrument to supervise concept-drift and tracking the evolution of clusters over time. 2 Trace Clustering with Event Streams The learning task addressed is TC in event streams to run analytics on Business Processes. Traces represent sequences of activities or tasks executed to achieve a goal. Unlike other stream-related problems, in TC the single tuple imputing the learning procedure is obtained by grouping multiple events through time, with past events affecting future ones. Model Update. Since ingestion of event data is asynchronous, we cannot rely on a training data set. For this reason, we introduced the idea of a Grace Period (GP) where there is no reference model and new events are used to feed the model construction. When the GP is declared over, the model creation is triggered. Subsequently, the model is updated based on the Nyquist sampling theorem, where the sampling frequency during data acquisition is forced to be at least twice the highest frequency observed before an update. A Time Horizon (TH), inputted by the user, specify how frequently the system has to check for a model update. However, if the conditions required by the Nyquist sampling theorem are not satisfied this TH is shifted over. Evolution of Clusters. One of our motivating goals is to construct an human-friendly visual and inference system. For this reason, to control and observe the system behaviour over time, we selected two simple metrics, based on the histogram of events in a trace and on the histogram of events’ timestamps. 153 When a new event is acquired it is attached to the respective trace that can be read as a string and compared to the histogram strings describing clusters. The histograms of events are compared using a Edit Weighted Distance (EWD). The histogram of timestamps is built from the same set of tuples but with some additional steps. First, for each tuple, a list of the differences between the events timestamps is created. Then, the list serves as input for quartiles calculation. Ultimately, the list values are placed into the quartile bins. Given a new event, its case timestamps are retrieved and binned. The bin is normalised and subtracted from the (also normalised) histogram. The result is the time-weighted distance (TWD) related to the interval between the events in a trace. We can now use EWD and TWD, together with a global time, as parameters describing the evolution of traces and clusters in the feature space. Concept-drift is detected in the feature space (EWD, TWD and time) by looking at distribution sparsity. When a new sample falls outside the boundary of any existing cluster, it is marked as an anomaly and its density is monitored. An increase in the number of samples within the radius of an anomalous example indicates a concept-drift. (b) Evolution of Clusters (a) Model Update 3 Expected Results We have observed the ability of our framework to effectively detect the anomalies and drift detection over the stream with different TH. Even more, an unfinished process case could be observed closely from the beginning leading to early identification of anomalous pattern. In this way, it is possible to mitigate a costly error, resist an attack before it reaches its goal, halt or migrate the fraudulent execution to a honeypot. References 1. Gama, J., Rodrigues, P.P., Spinosa, E., Carvalho, A.: Knowledge Discovery from Data Streams. Web Intelligence and Security - Advances in Data and Text Mining Techniques for Detecting and Preventing Terrorist Activities on the Web pp. 125–138 (2010), http://www.booksonline. iospress.nl/Content/View.aspx?piid=18418 2. Song, M., Günther, C.W., Van der Aalst, W.M.: Trace clustering in process mining. In: International Conference on Business Process Management. pp. 109–120. Springer (2008) 154