A Security System Event Log Analysis Dmitry Ju. Chalyy1 , Nikolai I. Ovchenkov2 , Ekaterina G. Lazareva2 , and Rafael R. Yaikov1 1 P.G. Demidov Yaroslavl State University, Yaroslavl, Russia, chaly@uniyar.ac.ru, yaikovrr@yandex.ru, 2 Electronika, PSC, LLC, Yaroslavl, Russia, ovchenkov, lazareva@elektronika.ru Abstract. In our work we consider event data from a security system which manages access control. Using business process mining and statis- tical analysis we process that data in order to make insights of it and to get useful process models. This includes techniques for identification cases of processes in the log and using ProM tool for creating process models. The results of the work-in-progress show highlights that are realized with useful process models. These were built from the scratch using several process instance identification techniques. Keywords: business process mining, security, event log, analysis 1 Introduction Security is an important property in today’s information systems. This imposes important challenges to understand security properties as precisely as possible because security violations may lead to serious incidents. Security systems are aimed at checking many events in order to respond to threats and ensure security of the enterprise. On the other hand, such systems must not make excessive restrictions since every restriction gradually degrades usefulness of the system as a whole. This justifies using intellectual techniques that can track functioning of the system. In our work we try to use Business Process Mining techniques [1,2]. They allow us to create executable models from event logs that are the simple and natural sources of data. We consider event logs from the security system which was developed by the local company. These logs represent time series of logged events that occur during a facility operation. Since the raw data does not contain any process description, our task was to identify processes that could be mined from the data which had been granted by the local company. We used Jupyter Notebook and Python for preprocessing data, statistical analysis and visualization. The well-known open-source ProM tool was used for creation of process mining models. The paper presents work-in-progress results and is organized as following. The first section contains raw data description. The next section describes sta- tistical analysis of the data. The third section contains results of process mining analysis. 2 Dataset Description The raw data that was acquired from a facility security system is originally represented as PostgreSQL database. It contains event logs, personal information of workers and interaction information with security devices. Event logs spun over a year of real time. However, we limit ourselves to one month. This was done under assumption that interactions with devices are routine operations which run periodically. The original database does not conform to IEEE CIS Task Force on Process Mining Manifesto[3]. We can rate logs as two star logs. The main shortcoming of logs is absence of information on business processes and cases. This makes process mining more challenging and forces us to make realistic hypotheses about how to identify cases. The preprocessing involves using psycopg2 Python library for making connec- tion to the original database and selecting data that is needed for our purposes. Obtained clean data is saved to csv file (146 Mb size) that contains following columns: – id, unique identification number of the event; – evusercode, a numeric code of the occurred event; – evregtime, event registration time; – subjectobj_devequip, event recorder; – subjectobj_devtype, event generation source; – description, detailed description of the event object; – subjectobj_value, object identifier; – pass_type, type of passage (out, in, empty); – channel, event comment (e.g. vertolet, notif, kdp, avk) Thus, there is no information on business processes and cases of processes in the log. However, it is possible to make statistical analysis on the log. 3 Statistical Analysis We use statistical methods to understand data and to make sure that data represent not white noise process. There are 1 191 019 records in our log files. Our log contains 40 unique activity codes (field evusercode) and 2449 persons (and subjectobj_value fields) that are involved in events. Figure 1 shows ranking of activities by frequency. We see that there are some frequent activities and long tail of rare activities. This leads us to the conclusion that cases of the business process should be simple enough and consist of several recurring activities with special inclusions of rare activities. Fig. 1. Ranking of all activities in the log by the frequency 4 Process mining Process mining techniques and tools help us to extract a formal model from an event log of a real process. We can use such a model to improve our understand- ing of the process, to analyze its properties and to propose modifications that enhance and optimize it. In our work we use ProM Lite 1.2 tool for discovering models from data. The quality of the model improves with the quality of input data log. There are no descriptions of processes that are captured by the log, so we assume that a given log of security system contains data belonging to a single process. This is a complex process which describes a security system as a whole. In the context of business process mining the log consists of events, cases, and resources. We identify an event as a record in the log. A case is defined as a single process instance. There is no notion of a case in the log, so we must elaborate what the case is. Each event relates to some activity. In [2] is stated that the definition of case and activity represents a minimum for process mining. Thus, we have one of the two necessary components for the analysis, and to conduct it, we model each case as a trace of consequent activities. In our work we tried three approaches for trace identification: 1. Each process instance is represented in the log by a single day of record. This means that the process under consideration is a daily functioning of security system as a whole. So we make a proposition that security system is rou- tinely operating in a day-by-day basis. Thus, it is possible to replay current operation of the system using the mined model and check conformance. 2. The next approach is to cut the log into traces, each of which represents a case when the change of the person that triggered the event occurs. This allows us to capture simple processes. However, this approach has the following obvious deficiency: process instances can occur in a parallel manner in the system, thus we cannot capture process instances that overlap in the log. 3. We start a new trace when there is a substantial delay between events in the log. Thus, we treat log as a sequence of process instances that follow one after the other with the hypothesis that security system switches from one mode to another. However, we cannot capture processes that can have significant delays between events. So, we have defined process instances in the log. This allows us to mark cases and use process mining to get a formal model of the process. The example of the model that was built by using the approach and inductive miner is shown on Fig. 2. Fig. 2. A process model constructed from event log using ProM Process mining is a process that should recover a general model that replays traces which are given in the log and allows other reasonable traces, i.e., produces a general model. We can use for example heuristic mining that produces causal nets to control generalization level of the model. However, here we concentrate on techniques that allow us to get various interpretations of the log. What if the log contains events from processes? This means that we must take the log and partition traces into equivalence classes, each of which represents a single process. The natural approach to make such a partitioning is clustering. We treat each cluster as a different process. We must introduce a notion of distance for using a clustering algorithm. In our work we encoded each trace as a string, so we can use Levenshtein distance that is a metric for measuring number of edit operations transforming one string to the other [4]. For the clustering we take the most frequent trace in the log and calculate Levenshtein distances from it to other traces. It maps each trace to a point on one-dimensional space. Fig. 3 shows distances between most frequent and other traces. We can see that most traces are located near the most frequent trace, and some (about 150 traces starting from rank 500) have a significant distance. Fig. 3. Log-log plot of trace ranks by the value of Levenshtein distance from the most frequent trace of the log The number of clusters is a parameter which represents a number of processes we want to recover from the log. Fig 3 gives a hint on the value of the parameter that corresponds to the approximate number of big steps of the graph plus a few clusters for unusual traces. The last cluster contains traces that are most different from the most frequent trace of the log. We may interpret these traces as examples of unusual behavior that became known during operation of the security system. Traces belonging to one cluster constitute one process. We have used ProM tool for constructing models using inductive miner 4. This uses trace identifica- tion method when there is a delay between two consequent traces, and depicts cluster number 8. Fig. 4. A fragment of the model of the traces that represent one cluster 5 Сonclusion The results of our analysis show possible ways to recover adequate models from data logs of two-star event logs. We have used a facility security system log that is not annotated. We were able to make automatic annotations and discover observable models. However, the work is still in its early stage, so the models must be evaluated by experts of the company that have developed and implemented the security system. Another possible direction of the research is to use different process mining methods [5,6,7] to the log of the security systems to discover useful models. References 1. van der Aalst W.M.P., Weijters A.J.M.M., Maruster L.: Workflow Mining: Discov- ering Process Models from Event Logs. IEEE Transactions on Knowledge and Data Engineering, 16, 1128–1142 (2004) 2. van der Aalst W.M.P.: Process Mining: Data Science in Action, 2nd edn. Springer (2016) 3. van der Aalst W. et al. Process Mining Manifesto. In: Daniel F., Barkaoui K., Dustdar S. (eds) Business Process Management Workshops. BPM 2011. Lecture Notes in Business Information Processing, 99. Springer, Berlin, Heidelberg, 169– 194 (2012) 4. Navarro,G.: A Guided Tour to Approximate String Matching, ACM Computing Surveys. 33 (1), 31–88 (2001). 5. Schönig S., Rogge-Solti A., Cabanillas C., Jablonski S. and Mendling J.: Efficient and Customisable Declarative ProcessMining with SQL. (2016). 6. Hompes B.F.A., Buijs J.C.A.M., van der Aalst W.M.P., Dixit P.M. and Buurman J.: Discovering Deviating Cases and Process Variants Using Trace Clustering. (2015). 7. Bose R. P. J. C., van der Aalst W.M.P.: Context Aware Trace Clustering: Towards Improving Process Mining Results. (2009).