=Paper=
{{Paper
|id=Vol-3380/short2
|storemode=property
|title=Fuzzy-based Process Mining to Discover the Coding Behavior: Challenges and Future Works
|pdfUrl=https://ceur-ws.org/Vol-3380/short2.pdf
|volume=Vol-3380
|authors=Pasquale Ardimento,Lerina Aversano,Mario Luca Bernardi,Marta Cimitile
|dblpUrl=https://dblp.org/rec/conf/wcci/ArdimentoABC22
}}
==Fuzzy-based Process Mining to Discover the Coding Behavior: Challenges and Future Works==
Fuzzy-based process mining to discover the coding behavior: challenges and future works⋆ Pasquale Ardimento1,∗,† , Lerina Aversano2,† , Mario Luca Bernardi2,† and Marta Cimitile3,† 1 University of Bari Aldo Moro, 4 Orabona Street, Bari, Italy 2 Unisannio University, Department of Engineering, Palazzo ex Poste, Benevento, Italy 3 UnitelmaSapienza University, Rome Abstract Discovering the coding behavior of programmers is an emerging application domain in the process mining field. Comprehension of how programmers head the coding of software have a strong potential to better support the coding workflow. In our previous work, we introduced and evaluated an environment, called CodingMiner, to generate event logs from IDE usage enabling the adoption of fuzzy-based process mining techniques to model the programmers’ coding process. The mined processes have shown different IDE usage patterns for programmers with different skills and performances. Our approach, currently, is not able to represent the behavior concerning the usage of programming core constructs such as, in the case of Object-Oriented paradigm, abstraction, object state and behavior. In this paper, we are interested to discuss the main research challenges and sketch possible actions to adopt for improving the realization of the proposed environment to also represent the behavior in using the core constructs. Keywords Process Mining, Fuzzy Miner, Coding Behavior, Programmer Activities, 1. Introduction The comprehension of software coding processes is not a simple task. By nature, these pro- cesses are profoundly iterative and characterized by a very loose ordering of their activities [1]. Programmers, starting from a model of the problem, write the source code by applying best practices of programming, and core structures and principles of the programming language paradigm [2]. This involves the hierarchical breakdown of the source code into smaller com- ponents, and, in turn, the choice or application of core structures and principles to implement such components [3]. This is a complex process mainly based on human creativity where OLUD 2022: First Workshop on Online Learning from Uncertain Data Streams, July 18, 2022, Padua, Italy. ∗ Corresponding author: Pasquale Ardimento † These authors contributed equally. Envelope-Open pasquale.ardimento@uniba.it (P. Ardimento); aversano@unisannio.it (L. Aversano); bernardi@unisannio.it (M. L. Bernardi); marta.cimitile@unitelmasapienza.it (M. Cimitile) GLOBE https://www.uniba.it/it/docenti/ardimento-pasquale/ (P. Ardimento); https://www.unisannio.it/user/596/ (L. Aversano); https://www.unisannio.it/user/12387 (M. L. Bernardi); https://www.unitelmasapienza.it/it/contenuti/personale/marta-cimitile (M. Cimitile) Orcid 0000-0001-6134-2993 (P. Ardimento); 0000-0003-2436-6835 (L. Aversano); 0000-0002-3223-7032 (M. L. Bernardi); 0000-0003-2403-8313 (M. Cimitile) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) the attitude, knowledge, and experience of programmers have a great impact on the resulting behavior during coding. Moreover, differences on coding behavior are not strictly related to the level of attitude, knowledge, and experience of programmers. Even among programmers of very similar experience levels, differences of as much as 100 to 1 were found across programmers in the time taken to write a given program. Additionally, across problems constructed to be of similar difficulty, an individual programmer often displayed a six-fold difference in writing time[4]. The awareness of the above-discussed critical issues suggests the study of process mining techniques to understand the behavior of the software coding, as many studies on the subject testify [5, 6, 7, 8, 9, 10]. Furthermore, when the programmers are students, understanding the coding behavior is valuable to predict their outcomes [11]. In a previous work [12], we mainly focused on using a fuzzy-model process mining approach to understand the behavior of the programmers in writing the software source code. In this regard, in [12] we defined an environment, called CodingMiner, to generate event logs from IDE usage enabling the adoption of fuzzy-based process mining techniques to study the programmers’ coding process. We executed an empirical evaluation using logs from the coding sessions of students attending the 2nd-year of a BSc degree in computer science. By using the CodingMiner, we highlighted emergent and interesting programmers’ behaviors during coding. The mined processes show different IDE usage patterns for programmers with different skills and performances. This environment, however, is not able to represent the different usage of language constructs, such as the ones above mentioned. In this paper, we are interested to discuss the main research challenges and sketch possible actions to adopt to improve the realization of the proposed environment and, in general, of a fuzz-based process mining approach to discover the coding behavior. Specifically, this paper briefly describes how the CodingMiner environment is defined, then discusses research challenges to improve the discovery of the coding behavior and, finally, points out possible approaches to address these challenges. The rest of the paper is structured as follows. In Section 2 the CodingMiner environment is introduced whereas in Section 3 challenges and guidelines to improve the realization of this framework. Section 4 provides paper conclusions. 2. The CodingMiner Environment Figure 1 gives an overview of the workflow mining approach based on IDE instrumentation. The core component is the Log Processor, a plugin for the Eclipse IDE that extends logging and monitoring capabilities of the development environment. As shown in the figure 1, while programmers interact with the development environment to implement a software system, they generate a stream of human-computer interaction (HCI) events. The Log Processor is triggered to capture these interactions. CodingMiner collects all these interactions and stores them as event logs in a central repository. The collected logs, when stored, need to be refined to apply process mining. Specifically, the following steps are executed: 1. interactions that are unrelated with projects in the programmer workspace are removed; Figure 1: The CodingMiner environment architecture. 2. inconsistencies in the data are detected and corrected; 3. the low-level logs are converted to the XES1 event log model; 4. low-level and high-level event streams are integrated. The refined and combined events stream can be analyzed by the Fuzzy Miner [13] to mine the coding process executed by programmers. These mined processes are shown, within the CodingMiner environment, to provide feedback to programmers about their behaviors during coding. In the following sub-sections, the Log Processor (Section 2.1) and the Fuzzy Miner (Section 2.2) components are briefly discussed. More details about CodingMiner environment are in [12]. 2.1. The Log Processor The Log Processor is an extension of the Fluorite plugin, an open-source instrumentation plugin, capable of recording low-level programmers’ interactions with the IDE without interrupting the coding activities. Log Process tracks both low-level events (e.g. mainly keyboard key presses, shortcuts, mouse movements and gestures) adding contextual information and high-level events (e.g., all the commands issued at IDE level, like create or open a file, close a project, open a view, reset perspective, etc.). For each single coding session performed by a single programmer a synthetic id to group events is recorded. Moreover, all the collected events have related to involved resources (including file resources). In this way, coding events are linked to the programmer’s actions on resources in the IDE. The Log Processor is also able to capture IDE commands (i.e., an action or command issued by the programmer to the IDE). It is also able to model a specific kind of interaction among the programmers and contains contextual information (e.g., the involved resources and their possible state changes). The events captured can be categorized as follows: • IDE Commands: all global commands issued to IDE (e.g. open a view, switch a perspec- tive or accessing a resource like a file); 1 http://www.xes-standard.org • Editor Commands: all activities happening in the editor (e.g., cut and paste commands, text selection and modification and all actions regarding code writing); • Debugging Events: all debugging activities (e.g., breakpoints definition, whatches on variables and their inspection at runtime, debug profiles definitions, etc.); • Refactoring Events: all the commands related to refactoring activities (e.g., selecting a refactoring among the one available, istructing the IDE on how to perform it and launch its execution). 2.2. Fuzzy Miner Component The collected events logs are finally mined by the Fuzzy Miner included in the ProM toolkit [13]. The Fuzzy Miner takes as input the logs of the programmer activities captured by the Log Processor and creates an appropriate representation of the development processes expressed in the mined log. Fuzzy Miner is particularly suitable for mining less-structured processes exhibiting unstructured and flexible behavior, like development process tend to be. The Fuzzy Miner is based on the main idea that some kinds of processes are better represented using adaptive techniques providing explicit flexibility [13]. It represents the mined process using a fuzzy model that is deliberately imprecise to omit behavior that has low significance or is not correlated with interesting patterns. 3. Challenges and Guidelines Each step of the CodingMiner environment presented in Fig. 1 gives rise to research challenges. In the following, we give an overview of some of these challenges and propose approaches to tackle them. • Recording. The main challenge in this step is to identify what actions must be recorded. The same action (e.g., InsertString command) can either be important or irrelevant in a given context. For example, typing text for adding a new method is an important event while typing text for an inline comment is an irrelevant event. For this reason, when a programmer makes a change it is necessary to know both the element type (e.g. class, interface, field, subclass, etc) involved and the change type made. Examples of change type for a class are: change of accessibility, add/remove/change inheritance, ad- d/remove/change attribute, add/remove/change comment, add/remove/change attribute, etc. Furthermore, capturing information about element and change type could help to construct a process model able to represent if and when the object-oriented language core topics have been used. Object-Oriented core topics are design, abstraction, hierarchy, typing, and encapsulation [14]. For example, Object-oriented design is meant as decomposition into objects carrying state and having behavior. A process model representing the coding behavior should represent when and how many properties of an object have been defined as well as when and how many methods have been defined. Furthermore, without information on element and change type, the process model will only reflect the way the programmers use the IDE to write source code but not the way they code. Existing coding event recording plugins, also including CodingMiner, are not actually able to capture this information. For example, the Eclipse plugin developed by Caldeira et al. [5], capable of listening to the actions programmers executes, aims to support the discovery of the coding processes and compare them in terms of efficiency and effectiveness. The authors evaluated the proposed approach on subjects attending the 3rd year of a BSc degree on computer science. The results obtained give some evidence that teams’ proficiency can be inferred by analyzing mined process models representing their behavior. However, the plugin captures only events within a project context and generic events at the Eclipse global context. This implies that the behavior observed concerns the usage of Eclipse IDE and not properly the developers’ coding behavior. In [8], the authors declare to detect both element and change type. The authors developed a constructor that classifies a source code history by fine grained changes and constructs an event log file. They used the Process Mining approach, the Inductive Miner process discovery algorithm, to understand the way programmers perform code production activities. A preliminary evaluation has been performed involving only three students in developing a program made up of one class. Unfortunately, at this time it is not possible consider the process model constructed as significant because it only represents the coding workflow of a single method defined in a class. In a real world coding process, instead, the coding workflow has to represent a more complicated reality made up of classes, sub-classes, abstraction, accessibility and so on. In [15] the authors developed a library, called Entry, for generating log of programmer’s behavior in the process of block programming and defined required common items in creating block log process. This library [16] is able to capture several information such as, for example, ”the number of times blocks / scenes / objects are created and used”, the number of times modifying conditional expression/internal variable of block”, ”operation time taken to finish the goal resolve the problem”, etc. This library could be used to construct tables of frequencies, durations and other statistics and also process models, by applying process discovery techniques but, also in this case, the models constructed would be not able to distinguish relevant from irrelevant situations. • Noise filtering. One of the main challenges of this stage is to separate noise from events that contribute to tasks. In a coding session noises can be represented by activities that can occur spontaneously at any point in the execution. Such activities, called chaotic activities [17], impacts the quality of the resulting process models obtained with process discovery techniques. For example, a searching activity can occur at any point for any task in the execution. To filter out such chaotic events, in [17] four novel promising techniques, rooted in information theory and Bayesian statistics, are described. The authors have shown, through experiments on seventeen real-life datasets, that all four proposed activity filtering techniques outperform frequency-based filtering on real data and that in all cases the performance is highly dependent on the characteristics of the event log. This means, however, that the ultimate decision on which activities to include has always to be supervised by the final user. • Segmentation. A coding session log, in its raw form, consists of one single sequence of events recorded during a session. During this session, a programmer may have performed several executions of one or multiple tasks. In other words, a coding session log may contain information about several tasks, whose actions and events are mixed in some order that reflects the particular order of their execution by the programmer. Moreover, the same task can be ”spread” across multiple logs, for example the creation of a class can be performed on several logs. A possible solution could consist in segmenting a coding session log into traces, such that each trace to one execution of a task (e.g. the definition of a class, the definition of an interface, etc). • Simplification. Coding process is not executed within rigid, inflexible workflow manage- ment systems and the like, which enforce correct, predictive behavior. Programmers write their source code mainly based on their knowledge, skills, and experience. Such a process does not enforce any defined behavior at all, in a somewhat it describes a more ”loose” manner that does not strictly define a specific path of execution. It is obvious that execut- ing such a process within such less restrictive environments will lead to more diverse and less-structured behavior. This abundance of less-structured observed behavior leads to construct “spaghetti” process models. In this cases a possible solution is represented by using the fuzzy miner approach. The Fuzzy Miner controls such imprecision by means of two metrics: significance and correlation. In particular, significance can be determined both for events and precedence relations over them: it provides us with a measure of the relative importance of behavior. It specifies the level of interest we have in events occurring in well-defined control flow conditions (e.g. precedence relationships, chain of events, and other relationships). For instance, frequency measurement or precedence constraints are a way to measure significance. Correlation is important when studying precedence constraints over the event stream. It measures how two events, following one another, are closely related. Based on these two metrics, which have been defined specially for this purpose, we can sketch the approach, proposed in the CodingMiner environment, for process simplification as follows: – more significant behaviors are preserved; – low significant but highly correlated behaviors are aggregated; – Less significant and lowly correlated behaviors are abstracted. Anyway, the Fuzzy Miner is not sufficient. Even if an event belongs to a task, it may still be redundant. For example, when a programmer defines the name of a new method with a mistake, and then he immediately renames it. In this case, the events that belong to the second time of naming the method are redundant. Depending on the context, the same event may be integral part of a routine or it may be redundant. Thus, classical frequency-based filtering approaches, like [18], cannot be applied to address this problem. One of the possible solutions is to use sequential pattern mining techniques to distinguish between events that are part of mainstream behavior and outlier events [19]. However, in case some events are rarely seen during a task execution they can be mistakenly treated as outliers. The outlined problem creates a need for semantic filtering. Groups of events can be combined into actions of a higher semantic meaning. The challenge here is to identify the semantic boundaries of an action and the attributes to form its payload. 4. Conclusions In previous work, we have exposed an environment, called CodingMiner, capable of analyzing coding logs of fine-grained programmer interactions with Eclipse IDE system, for Java appli- cation, to represent the coding behavior. We have already applied the CodingMiner in a CS2 course obtaining encouraging but not completely satisfactory results. As a preliminary step to improve this environment, here we sketched challenges that need to be overcome to improve the CodingMiner’s components and, in general, the fuzzy-based process mining approach to discover the coding behavior. We also provided some guidelines to tackle these challenges. One of the key challenges consists in how to discover the object-oriented language core topics in coding activities. Each action has to be associated to the element involved and the change type applied. We belief that, with this capability, our approach will become more meaningful, and applicable for teachers and programmers. References [1] R. Guindon, B. Curtis, Control of cognitive processes during software design: What tools are needed?, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’88, ACM, New York, NY, USA, 1988, pp. 263–268. URL: http: //doi.acm.org/10.1145/57167.57211. doi:1 0 . 1 1 4 5 / 5 7 1 6 7 . 5 7 2 1 1 . [2] R. Brooks, Towards a theory of the cognitive processes in computer programming, Int. J. Hum.-Comput. Stud. 51 (1999) 197–211. URL: http://dx.doi.org/10.1006/ijhc.1977.0306. doi:1 0 . 1 0 0 6 / i j h c . 1 9 7 7 . 0 3 0 6 . [3] N. Pennington, A. Y. Lee, B. Rehder, Cognitive activities and levels of abstraction in procedural and object-oriented design, Human-Computer Interaction 10 (1995) 171–226. doi:1 0 . 1 0 8 0 / 0 7 3 7 0 0 2 4 . 1 9 9 5 . 9 6 6 7 2 1 7 . [4] R. Brooks, Towards a theory of the cognitive processes in computer programming, Interna- tional Journal of Man-Machine Studies 9 (1977) 737–751. URL: https://www.sciencedirect. com/science/article/pii/S0020737377800394. doi:h t t p s : / / d o i . o r g / 1 0 . 1 0 1 6 / S 0 0 2 0 - 7 3 7 3 ( 7 7 ) 80039- 4. [5] J. Caldeira, F. Brito e Abreu, J. Reis, J. Cardoso, Assessing software development teams’ efficiency using process mining, in: 2019 International Conference on Process Mining (ICPM), 2019, pp. 65–72. doi:1 0 . 1 1 0 9 / I C P M . 2 0 1 9 . 0 0 0 2 0 . [6] J. Caldeira, F. B. e Abreu, J. Cardoso, R. Ribeiro, C. M. L. Werner, Profiling software developers with process mining and n-gram language models, CoRR abs/2101.06733 (2021). URL: https://arxiv.org/abs/2101.06733. a r X i v : 2 1 0 1 . 0 6 7 3 3 . [7] V. Shynkarenko, O. Zhevago, Visualization of program development process, in: 14th IEEE International Conference on Computer Sciences and Information Technologies, CSIT 2019, Lviv, Ukraine, September 17-20, 2019, Volume 2, IEEE, 2019, pp. 142–145. URL: https://doi.org/10.1109/STC-CSIT.2019.8929774. doi:1 0 . 1 1 0 9 / S T C - C S I T . 2 0 1 9 . 8 9 2 9 7 7 4 . [8] V. Shynkarenko, O. Zhevaho, Application of constructive modeling and process mining approaches to the study of source code development in software engineering courses, Journal of Communications Software and Systems 17 (2021) 342–349. URL: https://doi.org/ 10.24138/jcomss-2021-0046. doi:1 0 . 2 4 1 3 8 / j c o m s s - 2 0 2 1 - 0 0 4 6 . [9] M. Leemans, W. M. van der Aalst, Process mining in software systems: Discovering real-life business transactions and process models from distributed systems, in: 2015 ACM/IEEE 18th International Conference on Model Driven Engineering Languages and Systems (MODELS), volume 00, 2015, pp. 44–53. URL: doi.ieeecomputersociety.org/10. 1109/MODELS.2015.7338234. doi:1 0 . 1 1 0 9 / M O D E L S . 2 0 1 5 . 7 3 3 8 2 3 4 . [10] C. Liu, B. F. van Dongen, N. Assy, W. M. P. van der Aalst, Component behavior dis- covery from software execution data, in: 2016 IEEE Symposium Series on Compu- tational Intelligence, SSCI 2016, Athens, Greece, December 6-9, 2016, 2016, pp. 1–8. doi:1 0 . 1 1 0 9 / S S C I . 2 0 1 6 . 7 8 4 9 9 4 7 . [11] G. Casalino, G. Castellano, G. Zaza, Neuro-fuzzy systems for learning analytics, in: A. Abraham, N. Gandhi, T. Hanne, T.-P. Hong, T. Nogueira Rios, W. Ding (Eds.), Intelligent Systems Design and Applications, Springer International Publishing, Cham, 2022, pp. 1341–1350. [12] P. Ardimento, M. L. Bernardi, M. Cimitile, G. D. Ruvo, Learning analytics to improve coding abilities: a fuzzy-based process mining approach, in: 2019 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2019, pp. 1–7. doi:1 0 . 1 1 0 9 / F U Z Z - I E E E . 2 0 1 9 . 8 8 5 9 0 0 9 . [13] C. W. Günther, W. M. P. van der Aalst, Fuzzy mining – adaptive process simplification based on multi-perspective metrics, in: G. Alonso, P. Dadam, M. Rosemann (Eds.), Business Process Management, Springer Berlin Heidelberg, Berlin, Heidelberg, 2007, pp. 328–343. [14] A. f. C. M. A. Joint Task Force on Computing Curricula, I. C. Society, Computer Science Curricula 2013: Curriculum Guidelines for Undergraduate Degree Programs in Computer Science, Association for Computing Machinery, New York, NY, USA, 2013. [15] R.-J. Moon, K.-M. Shim, H.-Y. Lee, H.-J. Kim, Log generation for coding behavior analysis: For focusing on how kids are coding not what they are coding, in: 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, 2017, pp. 575–576. [16] entryjs library, https://github.com/entrylabs/entryjs/, 2016. [Online; accessed 14-June- 2022]. [17] N. Tax, N. Sidorova, W. M. van der Aalst, Discovering more precise process models from event logs by filtering out chaotic activities, Journal of Intelligent Information Systems 52 (2019) 107–139. [18] R. Conforti, M. L. Rosa, A. H. t. Hofstede, Filtering out infrequent behavior from business process event logs, IEEE Transactions on Knowledge and Data Engineering 29 (2017) 300–314. doi:1 0 . 1 1 0 9 / T K D E . 2 0 1 6 . 2 6 1 4 6 8 0 . [19] M. F. Sani, S. J. v. Zelst, W. M. van der Aalst, Improving process discovery results by filtering outliers using conditional behavioural probabilities, in: International Conference on Business Process Management, Springer, 2017, pp. 216–229.