=Paper=
{{Paper
|id=Vol-2978/saerocon-paper6
|storemode=property
|title=Hard Cases in Source Code to Architecture Mapping using Naive Bayes
|pdfUrl=https://ceur-ws.org/Vol-2978/saerocon-paper6.pdf
|volume=Vol-2978
|authors=Tobias Olsson,Morgan Ericsson,Anna Wingkvist
|dblpUrl=https://dblp.org/rec/conf/ecsa/OlssonEW21b
}}
==Hard Cases in Source Code to Architecture Mapping using Naive Bayes==
Hard Cases in Source Code to Architecture Mapping using Naive Bayes Tobias Olsson, Morgan Ericsson and Anna Wingkvist Department of Computer Science and Media Technology, Linnaeus University, Kalmar/Växjö, Sweden Abstract The automatic mapping of source code entities to architectural modules is a challenging problem that is necessary to solve if we want to increase the use of Static Architecture Conformance Checking in the industry. We apply the state-of-the-art automatic mapping technique to eight open-source systems and find that there are systematic problems in the automatically created mappings. All of these eight systems have small modules that are very hard to map correctly since only a few source code entities are mapped to these. All systems seem to use some naming strategy, mapping source code to modules; however, naming is often ambiguous. We also find differences in ground truth mappings performed by experts, which affect mappings based on these, and that architectural refactoring also affects the mapping performance. Keywords Orphan Adoption, Software Architecture, Source Code Clustering, Naive Bayes 1. Introduction the source code model to the architecture model to de- termine whether the source code dependencies are con- Our previous studies [1, 2] of automated techniques to vergent, absent, or divergent compared to the allowed map source code entities to high-level software archi- dependencies specified in the architecture model. tectural modules suggest that some entities are much The need for a mapping between the source code and harder to map correctly than others. Even using the architecture models is a significant reason why SACC has best algorithm and different parameters, certain entities not reached widespread use in the software industry [3, 5, always seem to fail to map correctly. We conduct an 7, 8]; the tools and methods exist, but the mappings do not exploratory study to determine whether our intuition is or are outdated. Many tools address this by combining correct, i.e., that these hard cases exist, and if they do, manual mapping and regular expressions to filter file, what their properties are, and what makes them hard to module, and package names. Still, such approaches have map correctly. proven to be time-consuming and error-prone [5, 7, 8]. The software architecture of a system captures major If we want to automate the mapping process using, e.g., design decisions at a high level of abstraction and en- machine learning, it is vital to understand the hard cases. ables internal and external qualities such as performance, If there is a class of entities that our approach cannot portability, reusability, and maintainability [3]. It serves map automatically or always maps to the wrong modules, as a guide for the many decisions that are made during we need to ensure that these are part of the initial set the implementation of a system. As the system evolves, that a human expert maps. We perform an exploratory the source code must continue to conform to the archi- study using eight systems with ground truth mappings tecture or risk accumulating technical debt and no longer to determine whether such a class exists. Once we have possess the desired qualities. established that it exists, we determine its properties to Static Architecture Conformance Checking (SACC) is a identify its members automatically. We then investigate collection of methods, such as Reflexion modeling [4], why these properties make the entities difficult to map to that statically analyze source code to ensure that it does ensure that they will not reduce the effectiveness of the not introduce architectural violations [5, 6]. These meth- machine learning approach; we do not want it to learn ods require an architecture model, with modules and the wrong things from the hard cases. dependencies, and a source code model, with entities We hypothesize that at least some hard cases would and concrete dependencies, e.g., due to inheritance or be difficult for a human to map and that different human method invocations. They also require a mapping from experts would disagree on how they should be mapped. This can, for example, be due to poor structuring or the ECSA2021 Companion Volume evolution of the system. We rely on different ground- Envelope-Open tobias.olsson@lnu.se (T. Olsson); morgan.ericsson@lnu.se truth mappings of the same system and metrics to identify (M. Ericsson); anna.wingkvist@lnu.se (A. Wingkvist) such cases and study how well these correlate to the hard Orcid 0000-0003-1154-5308 (T. Olsson); 0000-0003-1173-5187 (M. Ericsson); 0000-0002-0835-823X (A. Wingkvist) cases. © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 Tobias Olsson et al. CEUR Workshop Proceedings 1–10 Orphan Entity be analyzed to determine its purpose and its similarity StringChange to the purpose of the modules. Structural Relations from the Orphan Entity to the Mapped Entities A sub-problem of orphan adoption is orphan kidnap- Architectural ping, where software evolution causes a need for remap- ? Module Automated ping an entity to a new module, or in other words, correc- GUI Logic Mapping tive clustering. Tzerpos and Holt identify a fifth criterion ChangeScanner AttachFileAction Allowed Module Dependency DOIChek XMLUtil DataBank related to orphan kidnapping, Interface minimization; it Initially Mapped Set is not a good idea to reassign an entity to another mod- Figure 1: An example mapping that shows the initial sets ule if the removal of the entity will cause the module to of the GUI and Logic modules of JabRef 3.7. A new orphan get a larger public interface, i.e., the entity is an entry StringChange is about to be mapped. point/facade to the module. HuGMe [10, 8] relies on orphan adoption to map from the source code to the architecture model. It starts from an initial set of entities that are manually mapped to the 2. Automated Mapping correct module. The remaining entities are considered orphans. HuGMe is applied iteratively, and as the set To reason about how well an implementation conforms of mapped entities can grow for each iteration, more to the intended architecture using, e.g., Reflexion mod- orphans have the potential to be automatically mapped. eling, we need a mapping from the source code to the In each iteration, there is also the possibility for human architecture. In this section, we discuss how such a map- intervention using the result of the failed automatic map- ping can be created semi-automatically, starting from an ping attempts as a guideline. The automatic mapping is initial set of mapped source code entities. done by calculating the attraction between the orphan The source code model consists of Entities (E) and De- and the mapped entities for each module. Christl et al. pendencies (ED). The entities are, e.g., classes defined present two attraction functions, CountAttract and MQAt- in a programming language, and the ED are due to, e.g., tract, based on dependencies, i.e., the structure criterion. method calls and inheritance, see StringChange, ChangeS- Bittencourt et al. evaluate two new attraction functions canner, etc., in Figure 1. based on information retrieval techniques. They use the The architecture model consists of Modules (M) and names of modules and entities and the names of iden- Dependencies (MD) between these. The modules repre- tifiers in the entities to form vocabulary documents for sent the major parts of the architecture; see, e.g., GUI modules and entities, i.e., the naming and semantic crite- and Logic in Figure 1. The directed MD indicates how ria. They then use a cosine similarity function, IRAttract, these modules are allowed to interact and depend on each and latent semantic indexing, LSIAttract, to calculate the other. If there, for example, is an MD from GUI to Logic, attraction values. then entities mapped to GUI are expected to call entities Our attraction function, NBAttract, combines ideas mapped to Logic. from the previous two and considers the structure, nam- An automated mapping algorithm aims to map each ing, and semantic criteria [2]. The approach is similar entity to the correct module without human assistance. to that of Bittencourt et al., but we instead use a Naive For example, classes in the implementation that deal with Bayes classifier to determine similarity to other entities. the application’s business rules should be mapped to the To include the structure criterion, NBAttract uses a novel module Logic. Once this mapping exists, we can compare approach, Concrete Dependency Abstraction (CDA), to the ED of the implementation to the MD allowed by the encode dependencies as text [2]. NBAttract has outper- architecture and determine whether they are convergent, formed CountAttract in our previous study [2], and Coun- absent, or divergent [4]. tAttract was not clearly outperformed in [7]. We, there- We rely on orphan adoption [9] to map entities to mod- fore, only use NBAttract in the remainder of this paper. ules automatically. An unmapped entity is considered an orphan that should be adopted by one of the modules, e.g., StringChange in Figure 1. Tzerpos and Holt identify four 3. Method criteria that can affect the mapping. Naming, naming standards can reveal what module is suitable. Structure, Based on our experiences with different attraction func- dependencies between an orphan and already mapped tions, we hypothesize that no matter how well the func- entities can be used as a mapping criterion. Style, mod- tion performs, there is a specific set of entities that are ules are often created using different design principles always misclassified. We seek to investigate this further (e.g., high cohesion or not). Classifying the orphan based to determine whether our hypothesis is correct or if the on style can give hints on how to use, for example, the misclassifications happen by chance due to randomness structure criteria. Semantics, the source code itself can in the composition and size of the initial set. 2 Tobias Olsson et al. CEUR Workshop Proceedings 1–10 We have previously implemented a tool to evaluate for misclassification based on our own experience and the different mapping approaches, including reporting de- advice from related work, and present exciting findings tailed mapping results [11]. We use this tool to create from the data. The ultimate goal is to construct strategies a new dataset over the mapping results for each source to detect entities with a high risk of being misclassified so code entity. that a human can intervene and classify these manually. We run NBAttract, with the following settings. We More specifically, we will investigate: use an initial set of mapped entities of random size and Is the set of problematic entities a good candidate for composition. We extract package names, filenames (these the initial set? This set needs human intervention for correspond to the outer class names in Java), attribute automatic mapping to perform well, effectively removing identifier names, and variable identifier names from the the problem from the automatic mapping. This can be source code entities in the initial set and tokenize these assessed by computing the F1 score of the precision and based on Camel-case and the characters - and _ . The recall, as we did in [2]. We will compare the F1 scores tokens are then stemmed using a Porter stemmer. Tokens across the entire range of initial set sizes visually. that are shorter than three characters are removed. We Is the set of problematic entities related to small modules? use our CDA technique to represent dependencies as text In general, machine learning techniques need good data strings. We use a binary token frequency (present or not) to perform. In particular, there is a need for a balanced and 0.9 as the threshold for automatic classification. dataset where there is approximately the same amount These settings correspond to the settings used in [2] of data to learn from in each class. If the dataset is im- with one exception; we do not require the initial set to balanced, there is a high chance that smaller classes will contain at least one source code entity from each module not be properly handled. An architectural module should in this study. We are interested in how individual files contain a fair amount of source code entities. Still, there are mapped to find possible flaws in the technique, which may exist modules that hold source code entities that do is why we allow for a module to be empty initially. not fit well in other modules, or the system may be under As we run several experiments with random initial sets, evolution, and intended source code has not been created we get a dataset that shows the correct mapping of each yet, etc. We need to know if such small modules exist entity and the number of mappings for each entity and and whether they are common or problematic. module. Based on this information, we can compute an Is the set of problematic entities related to entities with error rate for each entity according to Equation 1. If the poor naming? Tzerpos and Holt [9] define naming as one attraction function was completely stochastic, the error of the key criteria that influence the mapping. In our rate for each entity would converge to the stochastic experience, it is also a common strategy for developers error rate, defined in Equation 2. to create folders, packages, and filenames that reflect the modular architecture to some degree. It would thus be interesting to know if the naming of source code entities |erroneous mappings| errnba = (1) includes the module’s name it is mapped to. It is also |mappings| interesting to know if there are ambiguities in the naming, i.e., if several module names match the name of a source |modules| − 1 code entity. errsto = (2) Is the set of problematic entities related to entities on the |modules| border of a module? Bibi et al. [12], Tzerpos and Holt [9], As NBAttract is not a stochastic function, the 𝑒𝑟𝑟𝑛𝑏𝑎 for and Bittencourt et al. [7] state that dependencies have an entity should converge to something less than 𝑒𝑟𝑟𝑠𝑡𝑜 if an impact on the mappings. We use a textual representa- there are no systematic problems, i.e., it should systemat- tion of dependencies in NBAttract, but this may not be ically produce better mappings than a random mapping. good enough. We will investigate the ratio of external Hence, we can conclude that there are systematic prob- dependencies, e.g., an entity with many external depen- lems if we do not find such a convergence for a source dencies would likely be an entity that lies on the border code entity after several iterations. If we find systematic of a module. If we find a correlation between the external errors in a majority of the systems, we will further an- dependency ratio and the error rate, this could suggest alyze all problematic entities to find common, possible that border entities are problematic. causes for the misclassification. An entity is considered There are several metrics based on dependencies. We problematic if its 𝑒𝑟𝑟𝑛𝑏𝑎 ≥ 0.5, i.e., it is misclassified in use coupling (the count of all dependencies to or from 50% or more of the mappings. The motivation for this all other entities) and fan (the existence of a dependency limit is that a non-problematic attraction function should, to or from all other entities). The coupling may be very on average, produce a correct mapping in at least 50% high between two entities, but the fan can at most be of the cases for each entity. This part of the research is one between two entities, i.e., fan is a subset of coupling. highly exploratory. We investigate the possible reasons While coupling captures the absolute number of depen- 3 Tobias Olsson et al. CEUR Workshop Proceedings 1–10 dencies fan focuses on the diversity of different entities, Table 1 i.e., a high fan value captures that an entity has many Mapping Data Overview. dependencies to other different entities. System Lines # Mod # Ent err ≥ 0.5 err ≥ errsto Is the set of problematic entities related to problems in the ground truth mapping? We have access to two versions Ant 36 699 16 468 187 39.96% 72 15.38% of the JabRef system in which the modules and relations A.UML 62 392 19 767 165 21.51% 74 9.65% between them are the same (same intended architecture), JR 3.7 59 235 6 1 017 107 10.52% 40 3.93% but the mappings are not the same for all entities. This JR 3.5 51 840 6 733 96 13.1% 51 6.96% Lucene 35 812 7 514 60 11.67% 17 3.31% provides an opportunity to study discrepancies in the ProM 9 947 4 261 18 6.9% 9 3.45% ground truth mappings and how these affect the auto- S.H 3D 34 964 9 167 39 23.35% 19 11.38% matic mapping performance. One complicating factor in T.Mates 54 904 12 450 115 25.56% 49 10.89% this analysis is that JabRef underwent an architectural evolution between these two versions. Therefore, we Entity Error Rates per Project limit our analysis to entities that remain the same (no 1.0 changes to the source code) but are mapped to different modules. 0.8 Is the set of problematic entities related to files that are being refactored due to architectural evolution? The two versions of JabRef provide an opportunity to study enti- 0.6 ties that have changed packages and mapping (a sign of architectural evolution), have changes to the source code 0.4 (a sign of refactoring), or were recently added. We study eight open-source systems implemented in 0.2 Java. Ant1 is an API and command-line tool for process automation. ArgoUML2 is a desktop application for UML 0.0 modeling. Jabref3 is a desktop application for managing Ant A.UML Jr v3.5 Jr v3.7 Lucene ProM S.H 3D T.Mates bibliographical references, and we use the 3.5 and 3.7 ver- sions. Lucene4 is an indexing and search library. ProM5 Figure 2: The entity error rates for each project. is an extensible framework that supports a variety of pro- cess mining techniques. Sweet Home 3D6 is an interior design application. TeamMates7 is a web application for handling student peer reviews and feedback. 4. Results and Analysis Table 1 presents the sizes of the systems in lines of code, number of entities, and number of modules. There We performed the experiment and collected mapping exist a documented software architecture as well as a data per entity for each system. All systems show several mapping from the implementation to this architecture entities always being misclassified (an error rate of 1.0) for each system. Jabref 3.7, TeamMates, and ProM have (cf. Figure 2). Table 1 shows an overview of the data been the subjects of study at the Software Architecture collected. Note that each entity has a random chance to Erosion and Architectural Consistency Workshop (SAE- be included in the initial set and not be an orphan in that roCon) 2016, 2017, and 2019 respectively, where a system particular run of the experiment. There is also a chance expert has provided both the architecture and the map- an entity will not be mapped (e.g., due to variations in ping. The architecture documentation and mappings are the initial set). However, each entity has been mapped at available in the SAEroCon repository8 . ArgoUML, Ant, least 500 times. and Lucene were studied by Brunet et al. and Lenhard We now construct the initial set using entities with et al., and the architectures and mappings were extracted 𝑒𝑟𝑟𝑛𝑏𝑎 ≥ 0.5, i.e., only entities with 𝑒𝑟𝑟𝑛𝑏𝑎 < 0.5 are con- from the replication package of Brunet et al. as well as for sidered orphans, and all the troublesome entities are in- Sweet Home 3D. JabRef 3.5 was extracted from Lenhard cluded in the initial set. We compare this with randomly et al.. selecting from all entities in the initial set. We collected 14 849 and 13 754 data points from the respective groups. 1 https://ant.apache.org Figure 3 shows the running median (±100 data points) 2 http://argouml.tigris.org 3 https://jabref.org and limits of the running 75th and 25th percentiles of 4 https://lucene.apache.org the F1 scores, respectively, for JabRef 3.7. Since the other 5 http://www.promtools.org systems show similar trends, so we focus on JabRef. We 6 http://www.sweethome3d.com find that our idea is promising overall, especially when 7 https://teammatesv4.appspot.com 8 the initial set size increase. https://github.com/sebastianherold/SAEroConRepo 4 Tobias Olsson et al. CEUR Workshop Proceedings 1–10 JabRef 3.7 f1 Scores Relative Miss-Classifications vs Relative Module Size 1.0 1.0 0.8 0.8 Relative Miss-Classifications 0.6 0.6 f1 0.4 0.4 0.2 0.2 All Entities 0.0 0.0 Error Rate < 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 Initial Set Size Relative Module Size Figure 3: The running median F1 score with limits of the Figure 4: The relative rate of misclassified entities vs. the running 75th and 25th percentiles for JabRef 3.7 and initial set relative number of entities for each module. Coordinates are type over the whole interval of initial set sizes. slightly jittered to show data points more clearly. However, there is also an interval between the initial all entities are misclassified. Also, note that there are set sizes of 0.1 to 0.18, marked with vertical lines in Fig- small modules with a relatively low number of misclas- ure 3, where the F1 score is considerably lower than using sified entities. Another factor to consider is that only all entities. This indicates that entities with 𝑒𝑟𝑟𝑛𝑏𝑎 ≥ 0.5 321 entities of 4377 (7.33%) are mapped to these small are not good representatives of modules. Upon further modules. inspection of the actual modules and entities in JabRef 3.7, To measure the extent of using a naming strategy (NS) we find that there is a set of modules with very few enti- when naming concrete entities in each system, we check ties in the ground truth mapping, and entities mapped to if the words in the package or class name for an entity these all have a high error rate. contain the module name. Table 2 shows that most sys- The high error rate makes sense in general, as ma- tems use a naming strategy (column NS) to a rather high chine learning techniques produce better results if there degree. Lower values (e.g., TeamMates) are often due to is more data. More specifically, for Naive Bayes, the a module naming discrepancy, e.g., TeamMates defines probability of finding an entity in such a module is very a module view with a corresponding path word named low, so it does not make sense to map entities to it. We, ui; however, there are also cases where there is no clear therefore, investigate if all systems have such small mod- naming strategy for an entity. We consider an entity ules prone to misclassification. If entities are equally to have ambiguous naming if its path or filename con- distributed among the modules of a system, there would tains several different module name words. For example, be 1/|modules|% entities in each module. We regard a net.sf.jabref.logic.net.ProxyPreferences, from JabRef v3.7, module as small if it has less than half the number of contains both the module names logic and preferences. entities of 1/|modules|%. Thus we define the limit for a Ambiguity in entity naming strategy (ANS) seems to be small module as 0.5/|modules|%. It could be argued that quite common in some systems (Ant, ArgoUML, JabRef the lines of code should be used as a more fine-grained 3.5, Sweet Home 3D, and TeamMates) and not at all in measure of size, i.e., mapping one huge entity in terms of others (JabRef v3.7 ProM and Lucene). In some systems, lines of code. However, for example, path and file name the ambiguity is caused by having a parent-level package information is per entity, and for effectively learning a that is also a module. For example, Ant uses ant as both pattern based on entity names, more entities are needed. a high-level package and a module. The misclassification Table 2 shows the limit, the number of small modules, rate in the ambiguously named entities (ANSM) seems to and the rate of misclassification of entities in these mod- follow the inverse pattern of the ANS; the lower the ANS, ules. A surprising result is that all systems have such the higher the ANSM. This makes sense since a higher small modules, and all systems have small modules where ANS means there is more data to learn the pattern of the all entities are misclassified. There are 30 (out of a total of ambiguous naming from (if there is one). 73) modules where all entities are misclassified. Figure 4 We now turn our attention to whether entities that lie shows how the relative number of misclassifications and on the border of a module, i.e., have relatively many de- relative module size are related. Note the cloud of points pendencies to entities in other modules, are problematic. in the upper left corner. These are the 30 modules where We use the common coupling and fan metrics. Results for 5 Tobias Olsson et al. CEUR Workshop Proceedings 1–10 1.0 1.0 Table 2 The number of entities for small modules (Limit), the number of small modules (SM), their rate of misclassification (SMM), 0.8 0.8 the rate of entities with a naming strategy (NS), ambiguous naming strategy (ANS), and rate of entities with ambiguous External Fan Ratio 0.6 0.6 naming that are misclassified (ANSM). System Limit SM SMM NS ANS ANSM 0.4 0.4 Ant 3.57 9 92.75 100.00 85.90 29.41 A.UML 3.33 7 70.59 84.62 62.45 53.94 0.2 0.2 JR v3.5 8.33 4 93.75 71.62 30.29 42.71 JR v3.7 8.33 3 100.00 95.58 7.18 82.24 0.0 Lucene 7.14 3 57.45 99.22 2.53 90.00 0.0 ProM 12.50 1 100.00 100.00 0.77 88.89 0.0 0.2 0.4 0.6 0.8 1.0 <.5 >=0.5 S.H 3D 5.56 5 100.00 89.82 12.57 64.10 Error Rate Error Rate T.Mates 4.17 5 72.50 68.22 34.67 93.91 Figure 5: The error rate versus the external fan ratio of each entity in large modules. Coordinates are jittered for clarity. The box plot shows the difference in the external fan ratio of coupling were very similar to the results for fan. How- non-problematic (error rate < 0.5) and problematic entities. ever, the fan metric seems less noisy, so we opt only to report these values. We use a scatter plot to check whether there is a correlation between these metrics and Finally, we investigate the difference in error between the the error rate. We do this for entities that are part of large versions. All entities move to a lower error rate in JabRef modules since small modules are a confounding factor. v3.7, and 11 entities have a problematic mapping in JabRef Figure 5 shows that there is no clear relation between the 3.5 (cf. Figure 6). JabRef 3.5 has 36 entities mapped to non- external fan ratio and the rate of relative misclassification small modules with problematic error rates. Between 4 of an entity. It would, therefore, not make sense to find a and 11 of these seem to be due to problems in the ground correlation between the two variables. truth mappings, i.e., 11.1% and 30.6%. Optimally, these Yet, when we investigate the difference in external are cases where a technique would alert and spark a coupling for problematic versus non-problematic enti- discussion regarding the ground truth mappings among ties, we find a clear difference in the distribution of the the developers. external fan ratio. Problematic entities have a higher It should be noted that JabRef underwent a refactor- external fan ratio in general. This indicates that we need ing towards a new modular architecture at this point in to investigate further how to correctly classify entities development. Therefore, we do not think that these rela- that lie on the border of a module. In total, there are 3 502 tively high percentages are representative of all software entities with a low error rate (𝑒𝑟𝑟𝑛𝑏𝑎 < 0.5) and 497 enti- systems. The developers likely have a higher degree of ties with a high error rate. The number of entities with a conformance in a more stable architecture. low error rate is also higher throughout the distribution Lastly, we look at architecturally refactored entities of the external fan ratio. This makes the probability of between JabRef 3.5 and JabRef 3.7. We define an archi- finding a problematic entity using the external fan ratio tecturally refactored entity as an entity that has changed very small. mapping and package. We view the conscious choice to To investigate possible cases of disagreement in map- change the package of an entity as a sign that the change pings, we study the entities that have a change in their in mapping is not a mistake or disagreement but a part mapping between versions 3.5 and 3.7 of JabRef. We of architectural evolution. We find 61 such entities, 5 first specifically look at entities that have only changed of which are mapped to a small module in one or both their mapping and not moved in the package hierarchy. versions. We find that refactored entities have a signifi- We consider such nodes as having an ambiguous map- cantly higher error rate if we compare the error rate of ping. We find 17 such entities, 5 of which are mapped these entities with both new and normal entities from to a small module in one or both versions, which will the majority of modules in JabRef 3.7 (cf. Figure 7). make the error rate unrepresentative. We are left with Such refactored entities could still be in a state of tran- 12 entities. sition, and it seems likely to be a practice to make the We investigate the change of source code for these change of package and mapping before changing major entities using cloc9 and find five entities without any parts of the implementation. An architectural refactoring code changes and seven with varying degrees of change. can also change the purpose of a module itself, though 9 this will be a slower process for an automatic mapper to https://github.com/AlDanial/cloc 6 Tobias Olsson et al. CEUR Workshop Proceedings 1–10 Change in Error for Entities with Changed Mapping pare the performance of different approaches, but not 1.0 JabRef v3.5 JabRef v3.7 no code change code changed to specifically analyze problematic cases. We highlight the conclusions of prior work made regarding what may explain the performance. 0.8 The orphan adoption criteria naming, structure, style, and interface minimization are used in an algorithm eval- 0.6 uated in three case studies [9]. We find an evolving in- Error dustrial system where the architecture was created by 0.4 researchers with the help of developers the most inter- esting of these. 939 entities were assigned to modules, 0.2 and in 46 cases (4.9%), the algorithm suggested a different mapping than the developers. In 33 of these cases, the 0.0 1 2 3 4 5 6 7 8 9 10 11 12 developers agreed with the algorithm’s mapping, i.e., the Entity algorithm was able to find developer mistakes. In some Figure 6: The change in the error of 12 entities from large of the 13 cases where the suggested module was not ac- modules in JabRef that have changed mapping but not cepted, the developers mentioned that (code) changes to changed package. The first five (blue) entities have had the entity were needed for it to conform to the developer no change in source code and the last seven (orange) have mapping. changed source code. Bibi et al. compared the structural criteria part of the algorithm proposed by Tzerpos and Holt with supervised JabRef 3.7 Large Module Error Rates machine-learning approaches; Bayesian classification, k-nearest-neighbor, and neural networks. Their study 1.0 focuses on using dependencies as features (i.e., struc- tural criteria) for incremental clustering. They evaluate the approaches using two versions of six open-source 0.8 software systems and find that dependencies between entities within the same module are important to avoid 0.6 misclassifications, especially when there are few depen- dencies between entities in different modules. 0.4 We previously constructed a structure-based heuristic for automatic mapping of source code to Model-View- 0.2 Controller-based architectures [15]. We evaluated the approach on four products in a product line of games, all using the same game engine. We compared the au- 0.0 refactored new normal tomatic mapping to the manual mapping, and if they Figure 7: The error rates of entities in large modules that are disagreed, then the type was flagged as containing an undergoing refactoring, are new, or normal in JabRef 3.7. architectural problem. We compared the mappings of 653 entities and were able to correctly identify 76 out of 101 architectural problems as well as 18 false positives. detect. The risk is that a module can be quite chaotic dur- The heuristic suggested a different mapping in 96 (14.7%) ing a transition phase with multiple entities in different of 653 cases. stages of the refactoring process. Furthermore, two of the projects were refactored to Another interesting observation is that new files tend be fully conformant. This refactoring removed 33 true to have a lower error rate, indicating that the developers positives and six false positives. The true positives were have understood the new architecture and that normal remedied by refactoring the source code. In the context code changes could slowly make an entity harder to clas- of evaluating the performance of a method for automatic sify. This could be due to some form of design erosion, mapping using the manual mappings as ground truth, where changes are introduced that make the entity less these true positives would be regarded as erroneous map- cohesive over time. pings when they, in fact, are pointing to source code with architectural problems that need to be refactored. The CountAttract and MQAttract attraction functions 5. Related Work of HuGMe have been evaluated in four case studies [10, 8]. The focus is on evaluating the influence of two config- There is previous work in the area of orphan adoption [9, uration parameters and comparing the performance of 10, 8, 7, 12, 15, 16, 17]. The focus is to evaluate and com- the attraction functions. Both attraction functions as- 7 Tobias Olsson et al. CEUR Workshop Proceedings 1–10 sume a modular design based on the high cohesion low sented a fairly good correlation, and in one system, they coupling style, and mapping would become problematic could find a repeating pattern of directories. Possibly for modules designed specifically to not use this style. the ground truth architectures recovered in their study Christl et al. suggest the incorporating a detection step is more low level than the modular architectures that to better handle such modules, which would correspond we study. Still, it is likely that there is a variation on to handling the style criteria. Furthermore, Chen et al. what dimension of an architecture that is expressed in improves on CountAttract in an evolutionary case, i.e., a the package structure. This is further supported by Buck- pre-existing mapping is used. ley et al. where one system of five studied did not have Bittencourt et al. present two new attraction functions any clear correlation between packages and modules [19], based on information retrieval techniques. They use the presenting clear difficulties and significant effort when semantic information in the source code and calculate at- performing the manual mapping. tractions based on cosine similarity (IRAttract) and latent semantic indexing (LSIAttract). They make a quantitative comparison between the performance of their attraction 6. Discussion and Validity functions with CountAttract and MQAttract in an evolu- Our results clearly show that there is a set of entities in tionary setting (where a few new files are to be assigned the systems that are systematically hard for the state- a mapping). They find that a combination of attraction of-the-art automatic mapping techniques to map. One functions (e.g., if CountAttract fails, then try IRAttract) reason for this is the surprising result that all studied performs best. This is explained by their qualitative anal- systems exhibit some very small modules. An automated ysis, where they find that CountAttract usually misplaces technique would have very little data to use for these entities on module borders, MQAttract performs better modules, lowering the chance for successful mapping. when mapping entities with dependencies to many dif- In general, unbalanced data is problematic for machine ferent modules, IRAttract and LSIAttract perform better learning techniques, and in particular, the distribution of when mapping entities in libraries or entities on module probabilities is important in Naive Bayes. 30% (237 out of borders, but perform less well if there are modules that 784) problematic entities are mapped to such small mod- share vocabulary but are not related. ules in the ground truth mappings. This is a significant Sinkala and Herold present InMap, which is not an problem that needs to be solved. automated approach to mapping per se; instead, InMap In essence, small modules need to be flagged (either suggests mappings to the end-user, who can choose to automatically or manually) and handled separately. One accept the suggested mapping (or not). It is an iterative idea in the context of Naive Bayes would be to manipulate approach where a number of mappings are presented, the probability distribution appropriately to not wholly and the accepted mappings are used to improve the sug- disregard small modules in the mapping. Schemes that gested mappings further. The suggested mappings are could be tested are a uniform distribution or different produced with the help of information retrieval informa- fixed settings (large, medium, small). These should be tion similar to Bittencourt et al. with the addition of a reasonably easy for an end-user to assign to a module. descriptive text for each architectural module. The enti- Still, there is a risk that overall performance will drop ties are treated as a database of documents, and InMap as potentially more entities will be hard to map. Another uses Lucene to search this database using module infor- approach is to investigate why a few of the small modules mation as a query. As InMap is highly interactive, it will do not contain many problematic entities. We suspect also use negative evidence to some degree, i.e., a rejected that these modules possibly exhibit a unique design, e.g., mapping suggestion will not be suggested again. The being very cohesive or having very clear naming, which data from [16] suggest that using only the module names is perhaps not easy to address directly in a technique as as a search criterion often results in high precision at the it may simply be a way a module is designed. expense of the recall. This is most likely due to the fact Using the naming strategy and possible ambiguity in that module names often reflect package names to some naming is an attractive approach to create an initial set us- degree. Adding more and more module information in ing a specialized mapper. It should be possible to prompt the query tends to lower precision, but increase the re- an end-user with, e.g., keywords from the package or call, e.g., source code comments increase recall but lower class name asking for a mapping of the keyword. This precision in the mapping suggestions. could significantly reduce the effort of creating an initial Garcia et al. discuss the use of package and naming set that could then be used as a basis for other map- information in software architecture recovery [18]. In ping techniques. However, a complete approach must be general, they found that their ground truth components prepared to handle subject systems where the naming often spanned or shared several packages. They could information does not reflect the modular architecture. not find a correlation between components and single The data on finding problematic entities among entities package or directory names. One of their four cases pre- 8 Tobias Olsson et al. CEUR Workshop Proceedings 1–10 that lie on the borders of modules is conflicting. On will become less semantically cohesive as the vocabulary the one hand, we cannot see any correlation between becomes a mix of words from the previous architecture. the external fan ratio and the error rate. On the other The error rate of entities could then be used as a metric hand, we observe a higher median external fan ratio in to know if an entity is properly aligned to other entities problematic entities. We observe very high error rates in in the module. combination with very low external fan ratios and vice Comparing a human-made mapping to the mapping versa. This indicates that the external fan ratio is not a made by an automatic technique seems to be a useful useful metric, and a more refined metric could give better piece of information. The related work [9, 15] shows that answers. There is possibly a difference between incoming this often points to cases where (further) refactoring or and outgoing dependencies that could be a factor. In [9], discussion is needed and that the automatic technique is these entities were specifically detected and only used not necessarily wrong per se. However, if no human map- when suggesting a new module (orphan kidnapping). ping exists, is it important for an automated technique to Such an approach could also be investigated. notify a human user of such issues and not automatically We studied two different mappings in two versions assign the entity a mapping. of JabRef and found six cases where only the mapping Comparing mappings using several different techniques had changed (no change of source code), of which five could be a way forward, similar to what is done in [7] but mapped to large modules. We found eleven entities with a different intent. This also points to a problematic where the mapping and source code had changed (though situation as we cannot fully trust the ground truth map- the entity had not changed package), of which seven were pings; a perfect mapping technique would thus be flawed. in large modules. For these entities, there was a signifi- There is also a general lack of ground truth mappings cant difference in error rate between the two mappings. made by human experts and even fewer mappings made We are relatively confident that the difference in the six by different experts on the same system. Four of the entities with no change is due to disagreement among systems (JabRef v3.5, JabRef v3.7, ProM, and TeamMates) the developers; in the other eleven, it could also be due have mappings done by experts. The others (ArgoUML, to the actual change of the entities’ source code. This Ant, Lucene, and Sweet Home 3D) have mappings created would indicate that between 0.8% and 2.3% of entities are by researchers studying the systems’ documentation and hard to map correctly, even for JabRef experts. implementations [13]. The architects or developers of It should also be noted that JabRef is only one case and these systems would likely not agree to all of these map- that it was undergoing architectural refactoring during pings even if it is likely that large parts of the mappings this time in development. We are reasonably confident are correct. that this affects the results. We can argue that there may Two limiting factors in this study are that all systems be more confusion among the developers during refactor- are implemented in Java and that we have only studied ing, which should increase the chance of disagreements. one set of parameters of the attraction function, i.e., the There is also the possibility that the process of refactor- one from [2] giving the best mapping performance. An- ing has brought the architecture to everyone’s attention, other set of parameters would likely give different error possibly lowering the chance of disagreements. The low rates; however, we think the main points of the paper error rate of new entities suggests the latter as more would still hold. likely. The two mappings and versions of JabRef allow us to study entities under refactoring and new entities. We 7. Conclusions and Future Work find 61 entities under refactoring and 348 new entities. If We investigate the flaws in the automatic mapping of we remove entities from small modules (with confound- source code to modules in eight open-source software ing error rates), we find that entities under refactoring systems. We show that the state of the art technique has are considerably harder to map correctly. This is likely systematic flaws in its suggested mappings that need to because architectural refactoring is a process that can be addressed. We find that a major contributing factor is take some time to complete. The functional aspects of the that all investigated systems have modules with very few entities are likely fixed first, possibly with the removal of ground truth mappings. We also find that all systems use unwanted dependencies (especially as JabRef has some a naming strategy, but this strategy is often ambiguous. tests for this). We found no clear evidence that entities that have many There is, however, a risk that the semantic information dependencies to or from entities in other modules are (e.g., variable names) will not be changed and correctly systematically problematic. Our data indicate that such reflect the vocabulary of the module. It would be interest- dependencies can be a factor, but the metrics used are ing to see if this happens to these entities in future ver- likely not well suited to clearly show such problems. sions of JabRef or if the current state is considered good We studied differences in expert mappings in one of enough. If so, there is a considerable risk that modules 9 Tobias Olsson et al. CEUR Workshop Proceedings 1–10 the systems, where we had two different versions and two techniques, in: IEEE Working Conference on Re- different ground truths. We found that disagreements verse Engineering, 2010, pp. 163–172. exist and that such entities are likely to have a high error [8] A. Christl, R. Koschke, M. A. Storey, Automated rate in the mappings, although there are not many such clustering to support the reflexion method, Infor- entities. We also studied refactored files and new entities. mation and Software Technology 49 (2007) 255–274. Refactored entities tend to have a significantly higher [9] V. Tzerpos, R. C. Holt, The orphan adoption prob- error rate compared to both new entities and normal lem in architecture maintenance, in: IEEE Work- entities. There is a risk that refactoring is considered ing Conference on Reverse Engineering, 1997, pp. done when the entity is moved and the functional aspects 76–82. are fixed. Automatic mapping could indicate when the [10] A. Christl, R. Koschke, M. A. Storey, Equipping the entity is properly aligned to other entities in the module reflexion method with automated clustering, in: or noticeably different. IEEE Working Conference on Reverse Engineering, Our priority for the future is to address the small mod- 2005, pp. 98–108. ules. We will try different approaches to manipulating [11] T. Olsson, M. Ericsson, A. Wingkvist, s4rdm3x: A the probability distribution of the modules and find the tool suite to explore code to architecture mapping effect on overall mapping performance. Another area of techniques, Journal of Open Source Software 6 interest is the use of naming information to create an (2021) 2791. doi:1 0 . 2 1 1 0 5 / j o s s . 0 2 7 9 1 . initial set, as this could significantly reduce the mapping [12] M. Bibi, O. Maqbool, J. Kanwal, Supervised learn- effort. ing for orphan adoption problem in software archi- tecture recovery, Malaysian Journal of Computer Science 29 (2016) 287–313. Acknowledgments [13] J. Brunet, R. A. Bittencourt, D. Serey, J. Figueiredo, On the evolutionary nature of architectural viola- The research was supported by the Centre for Data Inten- tions, in: IEEE Working Conference on Reverse sive Sciences and Applications at Linnaeus University. Engineering, 2012, pp. 257–266. [14] J. Lenhard, M. Blom, S. Herold, Exploring the suit- References ability of source code metrics for indicating archi- tectural inconsistencies, Software Quality Journal [1] T. Olsson, M. Ericsson, A. Wingkvist, Towards im- (2018). proved initial mapping in semi automatic clustering, [15] T. Olsson, D. Toll, A. Wingkvist, M. Ericsson, Evalu- in: Proceedings of the 12th European Conference ation of a static architectural conformance checking on Software Architecture: Companion Proceedings, method in a line of computer games, in: 10th in- ECSA ’18, 2018, pp. 51:1–51:7. ternational ACM Sigsoft conference on Quality of [2] T. Olsson, M. Ericsson, A. Wingkvist, Semi- software architectures, ACM, 2014, pp. 113–118. automatic mapping of source code using naive [16] Z. T. Sinkala, S. Herold, Inmap: Automated inter- bayes, in: 13th European Conference on Software active code-to-architecture mapping recommenda- Architecture - Volume 2, 2019, p. 209–216. tions, in: IEEE 18th International Conference on [3] L. De Silva, D. Balasubramaniam, Controlling soft- Software Architecture (ICSA), 2021, pp. 173–183. ware architecture erosion: A survey, Journal of [17] F. Chen, L. Zhang, X. Lian, An improved mapping Systems and Software 85 (2012) 132–151. method for automated consistency check between [4] G. C. Murphy, D. Notkin, K. Sullivan, Software software architecture and source code, in: IEEE reflexion models: Bridging the gap between source 20th International Conference on Software Quality, and high-level models, ACM SIGSOFT Software Reliability and Security (QRS), 2020, pp. 60–71. Engineering Notes 20 (1995) 18–28. [18] J. Garcia, I. Krka, C. Mattmann, N. Medvidovic, Ob- [5] N. Ali, S. Baker, R. O’Crowley, S. Herold, J. Buck- taining ground-truth software architectures, in: ley, Architecture consistency: State of the practice, 35th International Conference on Software Engi- challenges and requirements, Empirical Software neering (ICSE), 2013, pp. 901–910. Engineering 23 (2017) 1–35. [19] J. Buckley, N. Ali, M. English, J. Rosik, S. Herold, [6] J. Knodel, D. Popescu, A comparison of static archi- Real-time reflexion modelling in architecture rec- tecture compliance checking approaches, in: The onciliation: A multi case study, Information and IEEE/IFIP Working Conference on Software Archi- Software Technology 61 (2015) 107–123. tecture, 2007, pp. 12–21. [7] R. A. Bittencourt, G. Jansen de Souza Santos, D. D. S. Guerrero, G. C. Murphy, Improving automated map- ping in reflexion models using information retrieval 10