=Paper=
{{Paper
|id=Vol-2361/short13
|storemode=property
|title=Actionable
Measurements – Improving The Actionability of
Architecture Level Software Quality
Violations
|pdfUrl=https://ceur-ws.org/Vol-2361/short13.pdf
|volume=Vol-2361
|authors=Wojciech
Czabański,Magiel
Bruntink,Paul Martin
|dblpUrl=https://dblp.org/rec/conf/benevol/CzabanskiBM18
}}
==Actionable
Measurements – Improving The Actionability of
Architecture Level Software Quality
Violations==
Actionable Measurements – Improving The Actionability of Architecture Level Software Quality Violations Wojciech Czabański Magiel Bruntink Paul Martin Institute for Informatics Software Improvement Group Institute for Informatics University of Amsterdam Amsterdam, the Netherlands University of Amsterdam Amsterdam, the Netherlands Email: m.bruntink@sig.eu Amsterdam, the Netherlands Email: wojciech.czabanski@gmail.com Email: p.w.martin@uva.nl Abstract—When system components become more coupled Currently, Better Code Hub provides an overview of com- over time, more effort must be dedicated to software architecture ponents and the interactions between them, such as incoming refactoring. Tightly coupled components present higher risk— and outgoing calls to and from modules in other compo- they make the system more difficult to understand, test and modify. In order to allocate the refactoring effort effectively, it nents. It does not however provide specific guidance as to is crucial to identify how severely components are coupled and how the developer can reduce the component coupling and which areas of the system involve the most risk to modify. improve their independence. Attempts have been made to In this paper we apply the concept of architecture hotspots generate suggestions for improving modularity by suggesting together with the Software Improvement Group Maintainability move module refactoring operations, framing the problem Model to identify violations in architecture design. We propose a prototype tool that can identify and present architecture smells to as a multi-objective search [4]. Such refactoring operations developers as refactoring recommendations. We then apply the however may either not improve modularity or make the tool to open-source projects, validating our approach through codebase less semantically consistent. Identifying patterns in interviews with developers. Developers found the hotspots com- poorly modularized code can be a starting point for devising prehensible and relevant, but there is room for improvement with better recommendations as to how the components can be regards to actionability. decoupled better. Thus we apply the architecture hotspot I. I NTRODUCTION patterns described by Mo et al. [5] to conduct a study on open Software maintainability is an internal quality of a soft- source projects in order to evaluate whether hotspots can be ware product that describes the effort required to maintain a found in open source projects and used to provide refactoring software system. Low maintainability is connected with low recommendations. Furthermore we investigate whether pre- comprehensibility. Glass argues that the most challenging part senting quality violations based on hotspots helps developers of maintaining software is understanding the existing prod- decrease coupling between components. In order to validate uct [1]. What follows is that code which is hard to understand our approach, we construct a hotspot detector, integrate it with is also difficult to modify in a controlled manner and test for the Better Code Hub analysis tool and visualise the hotspots. defects. If changes are difficult to introduce and code hard Based on initial feedback from developers, indicating that to understand, the probability of bugs being introduced is the suggestions are comprehensible and relevant we finally very high, which raises the cost of further developing and consider how to build upon our work in future. We look to maintaining the system. improve the tool by adding more detailed information which We focus on the Maintainability Model developed by the triggers the hotspot detection. Software Improvement Group (SIG) [2]. From this model, 10 guidelines have been derived to help developers quickly evaluate the maintainability of their code and provide ac- II. BACKGROUND tionable suggestions to increase its quality, such as keeping complexity of units low and interfaces small. The guidelines are implemented in the Better Code Hub tool [3], which Program analysis is the process of analysing the behaviour applies them to provide feedback to developers. In particular of a software program with regards to a certain properties such we look at component independence, because it is considered as correctness, robustness or maintainability [6]. There exist the most challenging to improve, based on user feedback. Our a number of means of program analysis already defined in goal is to provide developers with more actionable feedback research literature, including both static and dynamic analysis, in addition to the diagnosis provided by Better Code Hub, so maintainability guidelines and detection of ‘code smells’. We that they can improve the maintainability of the code. survey a few of these approaches below. 1 A. Static and dynamic program analysis Table I H OTSPOT INSTANCES OVERVIEW IN SELECTED SYSTEMS Source code is commonly used as input for static analysis tools. In certain cases other inputs are used as well such Inheritances cycles (files) cycles (files) Unhealthy Language as revision history. Syntactic analysis and software metrics LOC (k) Package module System Cross- computation involves analysing the source code of a system, (files) often represented as a graph. Examples of tools for obtaining the source code graph and metrics include Understand1 , the Bitcoin C++ 120 16 (75) 31 (108) 49 (117) Software Analysis Toolkit from SIG2 and SonarQube [7]. Our Jenkins Java 100 80 (170) 10 (403) 513 (372) intention was to improve the actionability of measurements. jME Java 240 69 (436) 59 (402) 335 (410) JustDecompileEngine C# 115 79 (290) 8 (205) 92 (89) In this respect, SonarQube was aimed at integration within a nunit C# 59 24 (94) 6 (62) 62 (74) CI/CD pipeline, making it difficult to use in a research setting, openHistorian C# 72 12 (37) 31 (89) 63 (114) because of the existing pipeline and the time limitations OpenRA C# 110 19 (150) 35 (273) 202 (206) pdfbox Java 150 64 (252) 23 (379) 261 (301) of the project make it unfeasible to modify it. Understand Pinta C# 54 17 (57) 12 (112) 109 (91) exported the dependency graph to a commonly used format ShareX C# 95 11 (76) 38 (205) 189 (248) and supported a variety of programming languages, but was challenging to integrate with the SIG infrastructure, which left us choosing the Software Analysis Toolkit to pre-process to an increased software defect density [5]. Macia et al. source code for further analysis. designed a DSL for describing architectural concerns and code We also investigated dynamic analysis methods review- anomalies [17]. In addition to the source code and metrics, ing tools such as Valgrind [8], Google Sanitizers [9] and they use the metadata defined using a DSL to detect code Daikon [10]. We chose to focus however on analysing source anomalies in a tool (SCOOP). Martin focused on framing code only—to use dynamic analysis, the executable for every component design guidelines using 3 principles; violating project would need to be built locally. In addition to that, the those principles constitutes an architectural smell. [18] Garcia reviewed tools detect possible faults in the code as opposed et al. define four architecture smells [19]. to analysing maintainability. We believe that connecting maintainability measurements with architecture smells will allow us to provide more action- B. SIG Maintainability Model able and relevant refactoring recommendations for developers SIG developed a maintainability model for software based using Better Code Hub compared with relying on metrics on empirical software engineering research [11]. They use a alone. It will also make it possible to offer advice on how in-house tool, the Software Analysis Toolkit, to conduct static to deal with the detected type of smell. Only the classification analyses of software systems. The Maintainability Model is from Mo et al. draws a clear connection between the archi- also accessible for GitHub developers through the Better Code tectural smells and maintainability, which is why we chose to Hub tool, which conducts an analysis of a repository and use it to enhance the refactoring recommendations generated evaluates it against the ten SIG guidelines [3]. In our paper by Better Code Hub [5]. we focus on the ‘Couple Architecture Components Loosely’ guideline, which advises minimising the number of throughput III. E XPERIMENTS components. These have a high fan-in/fan-out values [12]. A. Data sources Similarly to modules, components that act as intermediaries We selected a number of GitHub repositories that are both are more tightly coupled and more difficult to maintain in available as open source projects and contain the majority of isolation. code in languages that are supported by Better Code Hub as C. Software design defect classifications data sources to validate our approach. The projects we targeted needed to have between 50k and 200k lines of code, be at least Low maintainability can manifest itself by the presence three years old and be written in a strongly and statically typed of certain antipatterns, called ‘code smells’. The concept of language supporting inheritance (e.g. Java, C# or C++). ‘code smells’ as software design defects was popularised by Fowler [13]. We looked into both architecture and design B. Hotspot distribution smells. Suranarayana and Sharma proposed that architecture We used the Understand code analyser to generate source smells represent design violations impacting both component graphs which we then fed into an existing tool called Titan and system levels [14]. Sharma provided the definition of a tool [20], which identifies architecture hotspots. We aggre- design smell [15]; Fontana et al. investigated Java projects gated the hotspot quantities and types per analysed system for recurring patterns of architecturally relevant code anoma- in Table I. The file count indicates how many distinct files lies [16]. Architecture hotspot smells are code anomalies contain hotspots. A file can be a part of multiple hotspots, but introduced in the paper from Mo et al. which are related we count the files only once. 1 https://scitools.com/ In order to reason about the impact of hotspots on the 2 https://www.sig.eu/ overall maintainability of projects, we compare the number 2 Table II can be reached by the Edge component and presented by the H OTSPOT IMPACT ON SELECTED SYSTEMS visualisation part of the prototype. Detector: The control flow of the detector is as follows: Files affected by by hotspots first, the class hierarchies are extracted from the source code measured analyzed by BCH hotspots affected graph as separate subgraphs; second, each hierarchy is checked System Files Files for presence of two types of the Unhealthy Inheritance hotspot: CI % internal and external. Internal Unhealthy Inheritance hotspot Bitcoin 675 117 17.33% 0.9894 is a class hierarchy in which at least one base class depends Jenkins 1112 403 36.24% 0.9868 jME 2077 436 20.99% 0.6812 on or refers to a derived class. External Unhealthy Inheritance JustDecompileEngine 814 290 35.62% 0.8311 hotspot is a class hierarchy which has client classes that refer nunit 781 94 12.03% 0.6329 to both based derived classes of the hierarchy at the same time. openHistorian 726 114 15.70% 0.9572 OpenRA 1157 273 23.60% 0.8362 While detecting internal hotspots we investigate the classes pdfbox 1279 379 29.63% 0.6283 and edges that belong to the hierarchy. For external hotspots Pinta 400 112 28.00% 0.7421 we also check the neighbourhood of the class—clients of the ShareX 677 248 36.63% 0.6842 class hierarchy, being classes which have a dependency on any of the classes in the analysed hierarchy. of files affected by hotspots with the number of code files in IV. P ROTOTYPE EVALUATION the project. Kazman et al. show that files containing hotspots The prototype evaluation involved integration with Better are more bug-prone and exhibit maintenance problems, from Code Hub and the gathering of feedback via structured inter- which we infer that higher percentage of files affected by views with developers who used the prototype. We intended hotspots makes a codebase less maintainable [5]. The percent- evaluating the comprehensibility, relevance and actionability of age of file affected by hotspots is then juxtaposed with the the refactoring recommendations by asking scaled questions component independence metric (CI - percentage of source with a Likert scale [21]. Furthermore we asked developers lines of code which are interface code or code which is called what would they need to make the feedback more actionable from outside of the component in which it resides in and also and how would they address the problem. The integration of calls code outside of the component) measured by Better Code the hotspot detection into the existing system involved two Hub (BCH) in Table II. steps: generating refactoring recommendations and visualisa- Discussion: We expected the percentage of files affected tion. by hotspots to be negatively correlated with component in- Refactoring recommendations are generated in two stages. dependence (CI) (see table II). The correlation coefficient First, the detector part identifies hotspots and generates a is -0.0162, indicating no correlation. Based on the above recommendation for every source node that is a part of a analysis, this indicates that the overall impact of hotspots on hotspot. Secondly, recommendations are filtered and ordered. the codebase may not be measurable using the Better Code The visualisation for the user contains three additions to Hub’s Component Independence metric. Better Code Hub: information about the number of hotspots in the refactoring candidate, a visualisation of the hotspot in C. Prototype the refactoring candidate and contextual information about a specific hotspot: what causes it, what its consequences are The research environment defined limitations on our inputs and suggested resolution strategies. We chose to visualize the and tools that we could use, therefore we decided to implement hotspot as edges and vertices. It allows the user to manipulate a detector for Better Code Hub based on the state-of-the- the graph, by rearranging the nodes. Edges and vertices make art hotspot approach described in [5]. However, we used the it easier to convey more information visually such as type of source code graph created by the Software Analysis Toolkit dependency (inheritance, dependency, violating dependency) as opposed to the Understand source code analyser. or type of source node (parent, child, client). Thus the user Overview: The prototype consists of detector and vi- process is as follows: sualisation parts. The visualisation is a part of the Edge 1) As the user logs into the Better Code Hub, a list of front-end component and only consumes the hotspot data repositories is revealed. produced by the detector. The detector itself is a part of the 2) Once the user enters the repository analysis screen, a list GuidelineChecker component. In addition, the Scheduler is of guidelines is shown with a tick or cross beside each an orchestrating component which starts a Jenkins task and indicating if the code in the repository is compliant. notifies the Edge component that the analysis is finished. The 3) As the user reviews the details of a specific guideline, a Jenkins component clones the source code repository, invokes list of refactoring guidelines is provided for review. the Software Analysis Toolkit which outputs a source code 4) In the hotspot visualisation screen the user can see the graph generated from the repository and the GuidelineChecker graph representing the hotspot visualised as a dynamic checks the source graph against the guidelines. Our detector force simulation3 which can then be manipulated. is invoked as a part of the guideline check. Finally, the analysis result is stored in a MongoDB database, where it 3 https://github.com/d3/d3-force 3 Figure 1. An example for the visualisation of an external Unhealthy Hierarchy VI. C ONCLUSION hotspot. To improve the actionability of architecture level code quality violations we created a prototype tool that identifies structural problems in the codebase which impact modularity. We then provided refactoring recommendations based on the identified problems and interviewed developers to gather their feedback on comprehensibility, actionability and relevance of the presented recommendations. The prototype refactoring tool provides the following contributions: • Detection of architecture smells in source code graphs. • Refactoring recommendations to the user based on the presence of hotspots. • Visualisation of hotspots, emphasising those dependen- cies negatively impacting modularity. • Guidance for the users regarding the impact and structure of hotspots occurring in the analysed codebase. As part of our analysis of repositories, we performed: • A study of the reliability of hotspot detection on statically 5) Finally, we present the hotspot description which our typed languages (Java, C# and C++). prototype provides upon the user pressing the question • An analysis of the overall impact of code containing mark button in the upper left corner of the visualisation. hotspots on the system’s modularity. In Figure 1 we present an example of an external Unhealthy A number of areas of future work have been identified: Inheritance hotspot, a violation where a client class of the a) More structural problems: We limited our detector to hierarchy refers to the base class and all of its children. In this one kind of hotspot. We also chose to use our own detector as case the client class is UriConsoleAnnotator, the base class is opposed to Titan, with a different source code analyzer, which AbstractMarkupText and the child classes are MarkupText and means that there may be a mismatch between the results [20]. SubText. The violations in this case are references from the b) More detailed information about the violation: We UriConsoleAnnotator to all the classes in the hierarchy. only outline violating classes and dependencies. Using the same data the feedback can be improved by providing the exact calls along with the code snippets that trigger the violation. V. D ISCUSSION c) Co-evolutionary coupling reasons: Co-evolutionary coupling is a term used to refer to classes which change For the evaluation we interviewed experienced developers. together in time. It is much more difficult to address co- They had no prior experience with Better Code Hub and the evolutionary hotspots. Firstly, co-evolutionary relationship codebase that they were assigned to evaluate the prototype on. data contains more noise. For example, a copyright header Our aim was to devise recommendations which can be useful update will create a coupling between all the files in the to a user who does not yet have intimate knowledge of the project. Also, co-evolutionary relationships stay in history of system architecture and implementation. the project. Secondly, it is more challenging to reason about We only had time to evaluate the approach on a few systems. the intention of the developer for changing files without a We made an attempt to choose systems representing different structural coupling together. Nevertheless, it would be inter- domains, architectures and languages; a broader test would be esting identify whether there are common reasons for co- necessary to make sure that the conclusions do not stem from evolutionary hotspot pattern occurrences. the selection bias. We hypothesise that the findings should d) Hotspot prioritization: We did not explicitly prioritise be applicable to any strongly typed language that supports hotspots. However, it could be useful as the budget to address packages, modules and inheritance. technical debt (e.g. architecture smells) is usually limited and Even though before applying the method we found a decisions need to be made as to which issues should be strong correlation between hotspot density and the number addressed. A prioritisation could be used to suggest fixing of interface lines of code in a component, we did not find those hotspots first, which exhibit a balance between effort a causal link between removing hotspots and a decreased needed to fix them and the impact on the maintainability. value of the component interface code as measured by the Based on a preliminary evaluation we conducted through Software Analysis Toolkit. However, Mo et al. did show that interviews with a panel of experts and the analysis of open the presence of hotspots indicates areas of code which are source repositories we can say that users see the compli- especially prone to introducing bugs, therefore, even if the mentary information as a promising starting point for further removal of hotspots will not be reflected in the measurement, investigations, but will need additional work to make the it would still improve the maintainability of the system [5]. recommendations actionable. 4 R EFERENCES formation and Communications Technology (QUATIC 2007)(QUATIC), volume 00, pages 30–39, 09 2007. [1] Robert L Glass. Facts and fallacies of software engineering. Addison- Wesley Professional, 2002. [12] Eric Bouwers, Arie van Deursen, and Joost Visser. Quantifying the [2] Ilja Heitlager, Tobias Kuipers, and Joost Visser. A practical model for encapsulation of implemented software architectures. In Software Main- measuring maintainability. In Quality of Information and Communica- tenance and Evolution (ICSME), 2014 IEEE International Conference tions Technology, 2007. QUATIC 2007. 6th International Conference on on, pages 211–220. IEEE, 2014. the, pages 30–39. IEEE, 2007. [13] Martin Fowler and Kent Beck. Refactoring: improving the design of [3] Joost Visser, Sylvan Rigal, Rob van der Leek, Pascal van Eck, and existing code. Addison-Wesley Professional, 1999. Gijs Wijnholds. Building Maintainable Software, Java Edition: Ten [14] Tushar Sharma. Does your architecture smell?, 2017. Last accessed: Guidelines for Future-Proof Code. " O’Reilly Media, Inc.", 2016. 2018-06-03. [4] Teodor Kurtev. Extending actionability in better code hub, suggesting move module refactorings. Master’s thesis, University of Amsterdam, [15] Tushar Sharma. Designite: A customizable tool for smell mining in c# July 2017. repositories. In 10th Seminar on Advanced Techniques and Tools for [5] Ran Mo, Yuanfang Cai, Rick Kazman, and Lu Xiao. Hotspot patterns: Software Evolution, Madrid, Spain, 2017. The formal definition and automatic detection of architecture smells. In [16] Francesca Arcelli Fontana, Ilaria Pigazzini, Riccardo Roveda, and Marco Software Architecture (WICSA), 2015 12th Working IEEE/IFIP Confer- Zanoni. Automatic detection of instability architectural smells. In ence on, pages 51–60. IEEE, 2015. Software Maintenance and Evolution (ICSME), 2016 IEEE International [6] Flemming Nielson, Hanne R Nielson, and Chris Hankin. Principles of Conference on, pages 433–437. IEEE, 2016. program analysis. Springer, 2015. [17] Isela Macia, Roberta Arcoverde, Elder Cirilo, Alessandro Garcia, and [7] Daniel Guaman, PA Sarmiento, L Barba-Guamán, P Cabrera, and Arndt von Staa. Supporting the identification of architecturally-relevant L Enciso. Sonarqube as a tool to identify software metrics and technical code anomalies. In Software Maintenance (ICSM), 2012 28th IEEE debt in the source code through static analysis. In 7th International International Conference on, pages 662–665. IEEE, 2012. Workshop on Computer Science and Engineering, WCSE, pages 171– 175, 2017. [18] Robert C Martin. Clean architecture: a craftsman’s guide to software [8] Nicholas Nethercote and Julian Seward. Valgrind: a framework for structure and design. Prentice Hall Press, 2017. heavyweight dynamic binary instrumentation. In ACM Sigplan notices, [19] Joshua Garcia, Daniel Popescu, George Edwards, and Nenad Medvi- volume 42, pages 89–100. ACM, 2007. dovic. Toward a catalogue of architectural bad smells. In International [9] Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Conference on the Quality of Software Architectures, pages 146–162. Dmitriy Vyukov. Addresssanitizer: A fast address sanity checker. In Springer, 2009. USENIX Annual Technical Conference, pages 309–318, 2012. [10] Michael D Ernst, Jeff H Perkins, Philip J Guo, Stephen McCamant, [20] Lu Xiao, Yuanfang Cai, and Rick Kazman. Titan: A toolset that connects Carlos Pacheco, Matthew S Tschantz, and Chen Xiao. The daikon software architecture with quality analysis. In Proceedings of the 22Nd system for dynamic detection of likely invariants. Science of Computer ACM SIGSOFT International Symposium on Foundations of Software Programming, 69(1-3):35–45, 2007. Engineering, pages 763–766. ACM, 2014. [11] T. Kuipers, I. Heitlager, and J. Visser. A practical model for measuring [21] Rensis Likert. A technique for the measurement of attitudes. Archives maintainability. In 6th International Conference on the Quality of In- of psychology, 1932. 5