=Paper=
{{Paper
|id=Vol-1930/paper-7
|storemode=property
|title=Activity Duration Prediction of Workflows by
using a Data Science Approach: Unveiling the Advantage of Semantics
|pdfUrl=https://ceur-ws.org/Vol-1930/paper-7.pdf
|volume=Vol-1930
|authors=Tobias Weller,Maria Maleshkova
|dblpUrl=https://dblp.org/rec/conf/semweb/WellerM17
}}
==Activity Duration Prediction of Workflows by
using a Data Science Approach: Unveiling the Advantage of Semantics==
Activity Duration Prediction of Workflows by using a Data Science Approach: Unveiling the Advantage of Semantics Tobias Weller and Maria Maleshkova AIFB Institute (KIT) Kaiserstr. 89, 76131 Karlsruhe, Germany {tobias.weller,maria.maleshkova}@kit.edu http://www.aifb.kit.edu. Abstract. Organizations often have to face a dynamic market environ- ment. Processes must be frequently adapted in order to stay competitive and allow an efficient workflow. Data Science approaches are currently often used in analysis methods to identify influential indicators on pro- cesses and learn predictive models to estimate the duration of an activity. However, current methods do not or only partially make use of semantic information in process analysis. The results are unprecise or incomplete, because not all influential indicators have been unveiled and therefore used in the predictive models. We want to make use of the semantics and show the advantage by apply- ing them on existing data science methods for predicting the duration of an activity in a process. Therefore, we 1) enrich process data with meta- information and background knowledge 2) extend existing data science methods so that they include semantic information in their analysis and 3) apply data science methods for predicting values and compare the results with methods, which do not use semantics. Keywords: Data Science, Workflow Analysis, Semantic Annotations, Activity Duration Prediction 1 Introduction Processes and data are key elements for value creation in companies. A large number of heterogeneous processes contribute to the added value. Large amount of data is created during the execution of processes. These process-generated data is increasingly an essential part of the respective business models. Therefore, Information Systems like ERPs (Enterprise Resource Systems), WFM (Work- flow Management Systems) and SCM (Supply Chain Management Systems) are recording events occurring in workflows in a structured way as they take place in order to comprehend past workflows but also using the information in possible analysis. In practice, however, it is often not possible for the employees, involved in the processes, to have all the relevant processes within the organization in mind. This hampers to include all the resulting data in possible analysis for 2 Tobias Weller and Maria Maleshkova the detection and optimization of weaknesses in a process, as well as using it in predictive models. However, customers and logistics are interested in knowing the duration of an activity. To improve processes and satisfy the requirements of customers and logistics is process analysis a continuous and important task of organizational development. The purpose of the process analysis is to evaluate organization-specific processes, to identify errors and to improve the possibili- ties, and to identify deviations from predefined standards, guidelines and existing processes in a system [11]. Process Mining [17] is a technique in the rapidly growing data science dis- cipline for analyzing complex processes. It combines traditional model-based process analysis and data mining techniques. It is used in various domains such as e.g. health-care [17, 18], and industry [1]. This well-known technique is often used because it allows for making implicit, and thus hidden, knowledge about a process transparent. Thus techniques are used to estimate the time of a task by using regression models [6] and descriptive statistical methods [2]. At the mo- ment, however, no semantics are used, and no background and expert knowledge is taken into consideration. Previous research has shown that the inclusion of background knowledge on data improves analysis in clustering [29] and similarity analysis of processes [8]. We want to strengthen this fact by including semantics in process data and apply well-established Process Mining methods on process data, which is not enriched with semantic information, and on process data that is enriched with semantics. We draw a comparison between the results of both approaches. Our goal in this work is to unveil the advantages of including seman- tics in Process Mining methods. We believe that by clearly demonstrating these, the trend of using semantic information in data science, as well as in process mining, approaches will grow, due to the key fact that the analysis results are improved. In order to achieve our goals, based on previous work we i) represent a pe- rioperative process in sBPMN and extend our annotation tool to ii) capture meta-information, as well as background knowledge about the process, iii) apply data science methods on the process for analyzing influencing factors on perfor- mance indicators of the process and iv) predict the duration of an activity by using different approaches, among others based on semantics and the analyzes of influencing factors. The used materials and methods are shown in section 2. We explain the approach, as well as the metrics used in the experiment sec- tion for comparing the results. The concrete implementation and evaluation of our approach is shown in section 3. We model a perioperative process and en- rich it with meta-information and background knowledge. The evaluation of our approach includes showing the application, and comparing the results of the dif- ferent approaches by using well-known metrics, including Root Mean Squared Error and Accuracy. Especially we go into detail in comparing other approaches with our approach, which exploits semantic information. The semantic infor- mation is modeled according to the Linked Data principles [3]. Related work is described in section 4. A short discussion, outlook and lessons learned is given in section 5. Activity Duration Prediction of Workflows 3 2 Material And Methods The main focus of this work is supporting process analysis by predicting the duration of an activity. We include background knowledge and exploit semantic descriptions of the process in order to make the predictions even more precise, compared to existing approaches. As an additional benefit of this work, the advantage of semantics is revealed. In the following we provide details on our approach. First we state our basic assumptions, as well as the modeling approach for the workflows. After the workflows are modeled, we explain the enrichment with additional information, including background information from experts. In the end we explain the used methods to predict the duration of an activity and describe the evaluation metrics, that are used to compare the results. There, we are keen to make our approach as comparable as possible and reuse well-known data science methods, as well as using leading evaluation metrics for comparing the results. We depict our approach in figure 1. Fig. 1. Performed approach in this work to predict the duration of activities. We assume a given set of process instances, stored in log files. This assump- tion is depicted as Data Export in figure 1. Usually the data of a workflow is recorded in an ERP system, which allows to export it in a flat file. Due to incor- rect and contradictory entries in the datasets, we also assume a data refinement step. Within this step, we delete false entries or replace them with the correct entry, if the entry can be computed. For analyzing and enriching the process instances and each of the corresponding activities with background knowledge, we first have to model the processes in an appropriate way to have it available in a structured format, which can be queried. A key fact in Knowledge Management is sharing and creating information col- laboratively. Various users might bring different kinds of background knowledge. Thus, we rely on a collaborative design and usage of our methods. Therefore, we apply a collaborative system that allows for a controlled environment and for tracing the provenance of information. Semantic Media Wiki1 (SMW) [28] serves as collaborative platform to capture and store annotations on documents. We have already tackled the capturing of processes in a collaborative environ- ment such as Semantic MediaWiki as part of previous work [31, 32]. This work builds upon this previous results. We used a BPMN Ontology from the Data & Knowledge Management (DKM) research unit [23] to structure processes. This 1 https://semantic-mediawiki.org, accessed: 2017-07-27 4 Tobias Weller and Maria Maleshkova ontology has a very detailed formalization in OWL 2 DL of the BPMN 2.0 spec- ification. Further ontologies that can be applied, in order to describe the data in more detail, are foaf2 , Dublin Core3 and SKOS4 . These ontologies are used to structure and describe the process data in detail. Background knowledge from experts can now be applied to have their knowledge available in a structure and reusable format. The captured knowledge can be used for analysis, as well as for predicting the duration of an activity. In addition to background knowledge from experts, meta-information about the workflow of a process is added to the activities. The captured meta-information is linked to the activities, which are influenced by or influence the information. For instance, if the information about the death of a patient during a surgi- cal process is available, it is attached to the activity that mostly causes this circumstance. The advantage of this approach is to have the information in a structured and standardized format that allows for querying and using it in analysis. Other meta-information includes the persons involved in the task, the runtime of an activity, and several parameters about the patient like if he is a smoker and the number of stays in the hospital. The background knowledge and meta-information are used for correlation analysis to identify and quantify the influence of variables on the process [33]. We used, depending on the level of measurements of the variables, different correlation analysis (Pearson Correla- tion [21], Spearman Correlation [26], Kendall Correlation [12] and Chi-squared test [22]). We build on these results in our models for predicting the duration of an activity. The main contribution of this work is the prediction of the duration of an activity. Therefore, we use various approaches, among others one which exploits the semantics in order to show the advantage of semantic information and a well-considered Data Science approach. We are keen to present our methods in a comparable way and reuse well-known methods and evaluation metrics. There- fore, we apply methods on the process instances which do and do not exploit the semantics. Previous methods [2] use the average value (and thus the expected value) of the duration of an activity as prediction. For comparison purposes we apply this approach in our scenario as well and call it in the following Average Value Approach. However, this approach is very basic. In our opinion, a process is more complex and has multiple key factors that influence the duration of an activity. Therefore, we analyzed process data according to certain dependencies. Thus we detected coherences between certain variables. The analysis exploits semantic background knowledge [33]. Based on the correlations, we can use the knowledge to select variables that influence the duration of an activity. Instead of using the average of all duration of an activity, as estimator, we shrink the set of activities to compute the average duration, based on similar variables which 2 http://www.foaf-project.org/, accessed: 2017-07-27 3 http://dublincore.org, accessed: 2017-07-27 4 https://www.w3.org/TR/2008/WD-skos-reference-20080829/skos.html, ac- cessed: 2017-07-27 Activity Duration Prediction of Workflows 5 influence the duration. We call this approach in the following Semantic Average Value Approach. Besides these two approaches, we also use a regression model. Usually are regression models used to quantify the coherence between values [10], however the model can also be used to predict the outcome, based on input variables. Thus we learn a regression model, based on the training data, and try to predict the test data, based on the input parameter x. However, as a drawback to the previous approaches, the regression models usually requires a lot of available data. Otherwise, the detected coherence between variables are not statistically verified. We call this last approach Regression Model Approach. Same as the used methods, we use existing and well-known metrics to evalu- ate the predictions of the duration of activities. Well-known metrics to quantify the quality of predictions are Mean Absolute Error (MAE), Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). Especially the last one is com- monly used as metric in forecasting models to quantify the result of predictions. RMSE computes the standard deviation between the actually observed value and the predicted value. MSE and RMSE are very similar, however, RMSE punishes outlier more or rather is more sensitive to outliers. Another metric that we use in our evaluation is the accuracy. Therefore, we compute the number of correct predictions divided by the total number of predictions. However, due to a ratio scale of the duration of activities, we allow a deviation of +/- 10%. An overview of the metrics and their formula is given in table 1. Name Formula 1 Pn Mean Absolute Error (MAE) n Pi=1 |yi − xi | 1 n 2 Mean Squared Error (MSE) n q i=1 i − xi ) (y Root Mean Squared Error (RMSE) n1 n P i=1 (yi − xi ) 2 Table 1. Metrics for evaluating predictions. For evaluating our model, we use k-fold cross-validation [27]. This validation technique is commonly used in statistical analysis. It splits the available data in k folds and trains for each fold one estimator and evaluates the estimators on the other k − 1 folds. This allows for testing the estimators on an independent data set. k-fold cross-validation is exemplary depicted in figure 2 with k = 4. For each estimator the metrics given in table 1 are computed. In addition, we compute the standard deviation of the RMSE. A value close to zero of the stan- dard deviation of the RMSE indicates a stable estimator over each independent fold. Thus, an equally distribution of the duration can be assumed. The advan- tage of the cross-validation is the usage of each observation for both, training and validation, and each observation is used for training exactly once. By using well-known metrics and validation techniques, we are able to com- pare our results with other approaches and allow for an enhanced comprehen- sibility of the results. The approach, as well as the used metrics, abstract from 6 Tobias Weller and Maria Maleshkova Fig. 2. k-fold cross-validation example by using k = 4. the use-case scenario and the domain. Thus, the approach can be applied to any domain. In the next section we implement our approach in a health-care use-case scenario and evaluate the different approaches. We compare the results by using the above metrics. 3 Experiments We evaluate our approach by using real-life data from the University Hospital Heidelberg. The process we considered is a perioperative process which describes the workflow of preparing the operating room, bringing a patient into the op- erating room, the incision and suture of the surgery, and bringing the patient out of the operating room. The considered process is depicted in figure 3 by using BPMN. We only had rectum resection surgeries records available, so we did not look at other performed surgeries. Therefore, we did not include the type of surgery, because it is the same for all considered data sets. Fig. 3. Considered perioperative process in the evaluation. The data we received were stored in spreadsheet format. It contained in- formation about the timestamps of every activity, basic information about the patient, like e.g. the age, height and weight, as well as information about former diseases and progress of the surgery. In total, 65 attributes were available. Not all attributes were for each data set available. Some were missing or contained false entries, like e.g. a negative loss of blood. Therefore, we first refined the data. The refinement of the data was done manually. We calculated the dura- tion of each activity according to the timestamps. However, some duration were negative, which we ignored in our analysis, because negative durations are not Activity Duration Prediction of Workflows 7 possible. We had in total 1,690 data sets available so that the refinement of the data was a huge workload. In the following, we consider three approaches: The first approach is the ba- sic approach by using the average value of the historic data as prediction of the duration for an activity [2] (Average Value Approach). The second approach is by using our semantic analysis in combination with the existing approach to show the advantage of semantics (Semantic Average Value Approach). We per- formed the correlation analysis on the semantically enriched data in a previous paper [33], so that we can build upon those results in order to use them for predicting the duration of an activity. The third approach is by using regression analysis to predict the duration of an activity (Regression Model Approach). We use k = 10 as cross-validation, because it is commonly used and size of the folds are appropriate. We use MAE, MSE and RMSE as evaluation metrics of each fold and compute the overall accuracy and the deviation of RMSE to evaluate the results. Average Value Approach: We split the data set in 10 uniform folds and computed on each the average value as predictive value for the other folds. We computed for each fold the MAE, MSE and RMSE. Afterwards, we computed for the RMSE the standard deviation to show how the value vary. A small deviation would show that the root man square error is on every fold uniformly. We com- puted for each activity the duration. Thus, in total, we performed our approach on 11 activities. The standard deviation of the RMSE for the duration of surgery is 15.2793. This shows that the predicted value of each fold is close to each other and does not vary. An average accuracy of 19.2683% is accomplished by using this method. Similar underperforming results are achieved by predicting other durations like e.g. the preparation of the operating room (Accuracy: 13.4146%) and the operating room time (Accuracy: 22.0122%), which is the time between incision and suture. Semantic Average Value Approach: There were 65 attributes available. However, due to incomplete and inconsistent data, we could not use every data point in our analysis [33]. Due to the performed correlation tests, we analyzed the data and found out which attributes correlate to each other, among others to the duration of activities. The facts if a patient has a relapse, the amount of red cell concentrate, received Fresh frozen plasma, and if a patient received an intraoperative radiotherapy influence the duration of a surgery. We focused on meta-information, which was known for an activity in advance. We calculated the average value of similar initial positions of the considered process, according to the correlated attributes, and used this as predictive value. The standard de- viation of the RMSE for the duration of surgery is 16.3042. Thus it is a bit higher than the RMSE of the previous approach. However, this is lead by the adapted predictions due to the correlated attributes. The average accuracy of predicting the duration of a surgery is 24.6104% and thus better than the previous ap- proach, were no semantics and the correlations were considered. Predicting the 8 Tobias Weller and Maria Maleshkova other duration were better than the previous approach, too. The preparation of the operating room achieved an overall accuracy of 19.5092% and the operating room time had an accuracy of 25.6173%. This clearly shows that our approach performs better than the basic procedure in all the three presented predictions of the duration of an activity. We also achieved improved results in the other nine cases of duration prediction. Nevertheless, we want to point out, that the results of 20% − 25% is still not satisfying and needs to be improved. Regression Model Approach: We compared our method with a regression analysis as last evaluation part. Regression Analysis is used to predict the du- ration of an activity in the past (see [6]). Therefore, for comparison reasons, we applied a regression analysis, too. We used a Polynomial Regression Analy- sis [14]. The best results were achieved by using a degree of 6 and using the age of the patient as input parameter for the duration of a surgery. The standard de- viation of the RMSE over all folds is 7.0497. This clearly points out the stability of the estimators. However, the results are not as good as the previous approach, in which we exploited the semantics. The accuracy of the Regression Model Ap- proach for predicting the operating time is 18.7662%. Similar underperforming results are achieved by predicting other duration like e.g. the Preparation of the operating room (Accuracy: 14.1830%). Likewise the previous results, the operat- ing room time had the highest accuracy of the three considered ones of 23.0263%. The coefficient of determination of the regression analysis is R2 = 0.0168, which shows that this model cannot explain the data sufficiently to predict values. Ta- ble 2 summarizes the results of the overall accuracy of the evaluated approaches and shows that exploiting meta-information and semantics achieves best results. Exploiting these information, if available, and using it in data science methods is essential for achieving better results. Approach Operating Time Preparation of the Operating Room Operating Room Time Average Value Ap- 19.2683% 13.4146% 22.0122% proach Semantic Average 24.6104% 19.5092% 25.6173% Value Approach Regression Model 18.7662% 14.1830% 23.0263% Approach Table 2. Overview of Overall Accuracy of the different approaches. 4 Related Work Our approach is addressed by roughly three kinds of work: 1) Match BPMN process to sBPMN, 2) annotating business processes with meta-information and 4) using Data Science methods for predicting the activity duration. BPMN [20] is a de-facto standard for representing processes in a very expres- sive graphical. BPMN defines semantics partly, which means that the symbols Activity Duration Prediction of Workflows 9 have meanings. However, the semantics just have little weights and there is no much attention paid to the formal definitions of the symbols. One advantage of BPMN is its executability [5]. Since 2011, it’s current version is BPMN 2.0 [20]. So far, existing work already addressed the transformation of BPMN into other languages like e.g. BPEL [7] and Petri-Net [16]. sBPMN (Semantic BPMN) was developed to allow for tackling the disadvantage of the lack of formal definitions and provide an unambiguous and consistent semantics [13]. sBPMN extends BPMN elements with additional information and background knowledge to en- hance analysis [?]. The second aspect that is tackled in this paper is the annotation of business pro- cesses with meta-information. This issue had been addressed among other by us in previous works by stating out a meta-model for processes [34], as well as ap- proaches for annotating decision trees and processes with meta-information [31]. Existing works has also pointed out the advantage of using meta-information in process mining [24]. A published survey summarizes existing approaches for business process annotations [15]. One aspect, considered in the survey, was the possibility to capture semantic annotations. The meta-information in processes are used for reasoning purposes, which is used to support analytics and opti- mization of processes [19]. The last addressed issue in our work is using Data Science methods for predict- ing activity duration. Previous approaches has mostly focused on control-flow discovery [30, 9]. However, when event logs contain time information, the discov- ered models can be extended with timing information. Most related to our work is the time prediction by van der Aalst [2]. Van der Aalst took the expected value in order to estimate the time of a future activity [2]. They only took the structure of the workflow into account but no other meta-information. However, there are factors, other than the structure of the workflow, which influence the duration of an activity. In health-care these factors might be the age of a patient, the condition and drugs that had been used by a patient. Therefore, this work is not sufficient for a more precise prediction. Another related work is the esti- mation of workflow execution time [4]. In this work, a more advanced method to estimated the execution time of workflows were used. The execution time of workflows was computed based on stochastic estimates of tasks’ execution time. For calculating the tasks’ execution time, they split up its execution time in multiple variables like e.g. resource preparation time, queuing time and data transfer time. We do not go into this detail, because we even do not have this detail of information in our time logs. For calculating the executions time, they performed, like we did, statistical hypothesis test for checking dependent vari- ables. Based on the results of statistical tests, they proposed a combination of Chebyshev-like distribution-free inequalities and distribution-based approaches to computer a tasks’ runtime. 10 Tobias Weller and Maria Maleshkova 5 Conclusions We showed an approach of exploiting semantic information in process data. Therefore, we first captured the process data and refined in the case of incon- sistent and incomplete data. Afterwards we transformed the data into sBPMN by using a BPMN Ontology from the Data & Knowledge Management (DKM) research unit [23]. This enabled us to enrich it with further information in a structured way and to use the semantic information about the process. We used the results from the enrichenment with meta-informatio in our prediction mod- els and compared it to existing standards. We used an existing approach [2], based on the average value of historic data as estimator. We compared this ap- proach with ours, which exploits the semantics, by computing the average value of similar data sets, according to performed correlation tests, from historic data as estimator. Thus, we did not used all attributes for determining similar data sets, but selected ones, which were determined by using correlation tests. As a last approach, we used a polynomial regression analysis to predict the duration of an activity. For evaluating the three approaches, we used k-fold cross-validation, with k = 10. Therefore, every data set was used as test and training data. Besides the fact that every data set is used as test and training data, we used it because it provides more accurate results of the models [25]. We computed for every fold the MAE, MSE and RMSE and provided the number of hits by allowing a deviation of +/- 10%. These well-known metrics are usually used in Data Science to indicate the results of a prediction model and allows for comparing the results. We showed that our approach performs best, compared to the other ones and achieved a better overall accuracy. In this work we focused on three out of 11 activity duration due to a better clarity. However, similarly to the presented results, the Semantic Average Value Approach performed best for predicting the other activity duration. The standard deviation of the RMSE over all folds were in the Semantic Average Value Approach higher, compared to the other ones. However, this is due to the individual predictions, based on the correlated attributes. Even if our approach outperforms the other two approaches, there is still a lot of work to be done in this area. Computing the duration of an activity is still a challenging task and the results also shows that, although we improved the accuracy, they are still not satisfying. Therefore, we have to improve our model. One way we would like to improve it is to include more attributes and data sets. Due to the increasing amount of data that is nowadays stored, we see this as a chance to collect more data and use it in our analysis and prediction models. Another future work that we would like to tackle is to transfer our approach to other domains and show its applicability. We built the approach in such a way that it abstracts from the underlying data and domain. Therefore, we expect that the approach is also usable in other domains and not limited to a certain scenario. In conclusion, we have taken a first step towards exploiting semantic informa- tion of processes, stored according to the Linked Data principles for predicting Activity Duration Prediction of Workflows 11 the duration of activities. We showed that our approach outperforms existing ones. The semantics in the data can be used in order to improve existing meth- ods. References 1. van der Aalst, W.M.P., Reijers, H.A., Weijters, A.J.M.M., van Dongen, B.F., Alves de Medeiros, A.K., Song, M., Verbeek, H.M.W.: Business process mining: An industrial application. Inf. Syst. 32(5), 713–732 (Jul 2007) 2. van der Aalst, W., Schonenberg, M., Song, M.: Time prediction based on process mining. Information Systems 36(2), 450 – 475 (2011) 3. Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. Semantic ser- vices, interoperability and web applications: emerging concepts pp. 205–227 (2009) 4. Chirkin, A.M., Kovalchuk, S.V.: Towards better workflow execution time estima- tion. IERI Procedia 10, 216 – 223 (2014) 5. Dijkman, R., Van Gorp, P.: BPMN 2.0 Execution Semantics Formalized as Graph Rewrite Rules, pp. 16–30. Springer Berlin (2010) 6. Dongen, B.F., Crooy, R.A., Aalst, W.M.: Cycle time prediction: When will this case finally be finished? In: Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. pp. 319–336. Springer- Verlag (2008) 7. Doux, G., Jouault, F., Bzivin, J.: Transforming bpmn process models to bpel pro- cess definitions with atl. In: In GraBaTs 2009 : 5th International Workshop on Graph- Based Tools (2009) 8. Ehrig, M., Koschmider, A., Oberweis, A.: Measuring similarity between semantic business process models. In: Proceedings of the Fourth Asia-Pacific Conference on Comceptual Modelling - Volume 67. pp. 71–80. Australian Computer Society, Inc. (2007) 9. Günther, C.W., Van Der Aalst, W.M.P.: Fuzzy mining: Adaptive process simplifi- cation based on multi-perspective metrics. In: Proceedings of the 5th International Conference on Business Process Management. pp. 328–343. Springer-Verlag (2007) 10. Hosmer, D.W., Lemesbow, S.: Goodness of fit tests for the multiple logistic regres- sion model. Communications in Statistics - Theory and Methods 9(10), 1043–1069 (1980) 11. Huang, S.M., Yen, D.C., Hung, Y.C., Zhou, Y.J., Hua, J.S.: A business process gap detecting mechanism between information system process flow and internal control flow. Decision Support Systems 47(4), 436 – 454 (2009) 12. Kendall, M.G.: A new measure of rank correlation. Biometrika 30(1/2), 81–93 (1938) 13. Kossak, F., Illibauer, C., Geist, V., Kubovy, J., Natschläger, C., Ziebermayr, T., Kopetzky, T., Freudenthaler, B., Schewe, K.D.: A rigorous semantics for bpmn 2.0 process diagrams. In: A Rigorous Semantics for BPMN 2.0 Process Diagrams, pp. 29–152. Springer (2014) 14. Lai, T., Robbins, H., Wei, C.: Strong consistency of least squares estimates in multiple regression ii. Journal of Multivariate Analysis 9(3), 343 – 361 (1979) 15. Lautenbacher, F., Bauer, B.: A survey on workflow annotation & composition approaches. In: Proceedings of the Workshop on Semantic Business Process and Product Lifecycle Management (SemBPM) in the context of the European Seman- tic Web Conference (ESWC). pp. 12–23 (2007) 12 Tobias Weller and Maria Maleshkova 16. Lohmann, N., Verbeek, E., Dijkman, R.: PetriNet Transformations for Busi- nessProcesses – ASurvey, pp. 46–63. Springer Berlin Heidelberg (2009) 17. Mans, R.S., van der Aalst, W.M.P., Vanwersch, R.J.B.: Process Mining, pp. 17–26. Springer International Publishing (2015) 18. Mans, R.S., Schonenberg, M., Song, M., Aalst, W., Bakker, P.J.: Application of process mining in healthcare–a case study in a dutch hospital. Biomedical engi- neering systems and technologies pp. 425–438 (2009) 19. Niedermann, F., Radeschütz, S., Mitschang, B.: Business Process Optimization Using Formalized Optimization Patterns, pp. 123–135 (2011) 20. OMG: Business process model and notation (bpmn), version 2.0 (January 2011) 21. Pearson, K.: Note on regression and inheritance in the case of two parents. Pro- ceedings of the Royal Society of London 58(347-352), 240–242 (1895) 22. Pearson, K.: On the criterion that a given system of derivations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine 50(302), 157–175 (1900) 23. Rospocher, M., Ghidini, C., Serafini, L.: An ontology for the business process modelling notation formal ontology. In: Information Systems – Proceedings of the Eighth International Conference. pp. 133–146. IOS PRess BV (Sep 2014) 24. Saylam, R., Sahingoz, O.K.: Process mining in business process management: Con- cepts and challenges. In: 2013 International Conference on Electronics, Computer and Computation (ICECCO). pp. 131–134 (Nov 2013) 25. Seni, G., Elder, J.: Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Morgan and Claypool Publishers (2010) 26. Spearman, C.: The proof and measurement of association between two things. The American Journal of Psychology 15(1), 72–101 (1904) 27. Stone, M.: Cross-validatory choice and assessment of statistical predictions. Journal of the royal statistical society. Series B (Methodological) pp. 111–147 (1974) 28. Völkel, M., Krötzsch, M., Vrandecic, D., Haller, H., Studer, R.: Semantic wikipedia. In: Proceedings of the 15th International Conference on World Wide Web. pp. 585– 594. ACM (2006) 29. Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means cluster- ing with background knowledge. In: Proceedings of the Eighteenth International Conference on Machine Learning. pp. 577–584. Morgan Kaufmann Publishers Inc. (2001) 30. Weijters, A.J.M.M., van der Aalst, W.M.P.: Rediscovering workflow models from event-based data using little thumb. Integr. Comput.-Aided Eng. 10(2), 151–162 (Apr 2003) 31. Weller, T., Maleshkova, M.: Capturing and annotating processes using a collabo- rative platform. In: Proceedings of the 25th International Conference Companion on World Wide Web. pp. 283–284. International World Wide Web Conferences Steering Committee (2016) 32. Weller, T., Maleshkova, M.: Towards a collaborative process platform: Publishing processes according to the linked data principles. In: Proceedings of the Workshop on Linked Data on the Web, LDOW 2016 (2016) 33. Weller, T., Maleshkova, M., Wagner, M., Ternes, L.M., Kenngott, H.: Analysis of semantically enriched process data for identifying process-biomarkers. In: Proceed- ings INTELLI. p. 6. IARIA XPS Press (November 2016) 34. Weller, T., Maleshkovaa, M.: Towards a process meta-model. In: Proceedings of the second Karlsruhe Service Summit Workshop-Advances in Service Research, Karlsruhe, Germany (2016)