A Comparison of Machine Learning-Based Text Classifiers for Mapping Source Code to Architectural Modules Alexander Florean1 , Laoa Jalal1 , Zipani Tom Sinkala1 and Sebastian Herold1 1 Department of Mathematics and Computer Science, Karlstad University, Sweden Abstract A mapping between a system’s implementation and its software architecture is mandatory in many architecture consistency checking techniques. Creating such a mapping manually is a non-trivial task for most complex software systems. Machine learning-based text classification may be an highly effective tool for automating this task. How to make use of this tool most effectively has not been thoroughly investigated yet. This article presents a comparative analysis of three classifiers applied to map the implementations of five open-source systems to their architectures. The performance of the classifiers is evaluated for different extraction and preprocessing settings as well as different training set sizes. The results suggest that Logical Regression and Support Vector Machines both outperform Naive Bayes unless informa- tion about coarse-grained implementation structures cannot be exploited. Moreover, initial manual mappings of more than 15% of all source code files, or 10 files per module, do not seem to lead to a significantly better classification. Keywords software architecture consistency, code-to-architecture mapping, text classification, machine learning 1. Motivation which are hence discouraged [7]. In some cases, architectural documentation describ- Software architecture degradation is the phenomenon ing the relationship between architecture and implemen- of the implementation of a software system diverging tation can help create this mapping. More often than from the intended software architecture [1]. The poten- not though, architectural documentation is missing or tial consequences of this divergence include a decay of outdated such that the architecture and the mapping maintainability as well as the decreased ability of the sys- towards code need to be recovered from a system’s im- tem to meet other desired quality properties. Expensive plementation [12]. Performed manually, this constitutes system re-engineering or discontinuations of software a challenging and labour-intensive task even for system products can be the consequences [2, 3, 4, 5]. experts. As expressed by professional software archi- One approach to combat software architecture degra- tects and designers in a study by Ali et al., creating the dation is software architecture consistency checking [6]. mapping is one of the major obstacles to adopting ar- The core idea of these techniques is to implement fre- chitectural consistency techniques in industrial practice quent checks for inconsistencies between the intended [13]. software architecture and the current implementation of Researchers have thus put some attention into devel- a system to detect degradation early. The individual tech- oping techniques that support software engineers in this niques differ in the variety of consistency constraints, or task by creating mappings partially automatically or by types of divergence that they can detect. They range from recommending mappings [14, 15, 16, 17, 18, 19]. Most re- dependency-focused and source code analysis-based tech- cently, text classification based on machine learning has niques [7] to logical query-based techniques for checking been applied to automatically categorize units of source architecturally induced constraints far beyond dependen- code according to the architectural concern or module cies [8, 9, 10, 11]. they implement and should be mapped to [17, 20]. Most approaches have in common that some kind of These approaches show promising results. The ques- mapping between architectural units, e.g. modules, and tion arises though whether the full potential of machine implementation units, such as source code files, is re- learning for text classification in this context has already quired. Reflexion modelling, for example, exploits this been tapped. Several text classification algorithms that mapping to detect source code dependencies that are not perform well in different contexts have not yet been in- covered by dependencies in an architectural model and vestigated. The question of how to optimally extract and ECSA 2021 Companion Volume preprocess source code for classification has not yet been " florean.alexander@gmail.com (A. Florean); exhaustively explored either. laoa99@outlook.com (L. Jalal); zipani.sinkala@kau.se The goal of this article is to shed some light on the per- (Z. T. Sinkala); sebastian.herold@kau.se (S. Herold) formance, i.e. the predictive capability, and other proper- © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). ties of several machine learning-based text classifiers for CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) fines allowed / prohibited, or expected / discouraged InstantiatedIndex.java dependencies between architectural modules. In order to check whether the dependencies present in source code conform with those, i.e. to compare code with architectural dependencies, the architectural modules to which sources and targets of code dependencies are mapped, need to be known. Fig. 1 shows a cut-out BytesRef.java of a so-called reflexion model of one of the systems (Lucene) used in the experiments presented in this paper. It depicts two architecture modules as boxes. Dashed lines (as opposed to solid lines) indicate that depen- dencies between these modules are architecturally dis- Figure 1: Exemplary cutout of a reflexion modelling and couraged in either direction; they are, however, present source code dependency contributing to undesired architec- in source code: for example, there is a call, located in tural dependency. file InstantiatedIndex.java that is being mapped to store, to a method called utf8ToString(), located in file BytesRef.java, which is being mapped to util. Through simple code analysis and tracing the mapping, the described task. We present a comparative analysis of this can be identified as architecturally discouraged de- three classifiers that were applied to map the code of five pendency between the modules store and util. different systems to their specified architectures. Consequently, creating and maintaining such a map- The contribution of the paper is a set of findings that ping is crucial for applying architectural consistency may guide further research and use of these classifiers for checking effectively. As pointed out by Ali et al., the the task of interest. These guidelines, on the one hand, effort needed for manual mapping constitutes a serious give advice for the selection of an appropriate classifier concern in industrial practice [13]. Several approaches ex- based on assumptions regarding the alignment between ist to support software engineers in this time-consuming architecture and modular implementation elements like task. The most relevant ones will be discussed in Sec. 2.3. packages. On the other hand, they provide rules of thumb for the recommended size of an initial, manual mapping required to train the classifiers. 2.2. Text Classification with Machine The remaining article is organized as follows. The fol- Learning lowing section describes relevant technical background Text classification is one of the fundamental activities in as well as related work. Section 3 explains the experi- Natural Language Processing [21]. The goal of text clas- mental setup of the comparative analysis. In Sec. 4, we sification is to assign a text written in natural language summarize the results, which are discussed in Sec. 5. The to one or more predefined categories. Applications of article is concluded in Sec. 6. this activity include, for example, sentiment analysis or spam detection. 2. Background The central idea of applying machine learning to the task of text classification is to train a classification model 2.1. Architecture Consistency Checking based upon text samples for which the assigned cate- gories are already known. Fig. 2 depicts the typical steps Techniques for checking the consistency between the in training a text classification model and using it to software architecture of a system and its implementation predict its label/category, i.e. to classify new text. For come in various forms. Several authors provide exhaus- learning, a set of text documents for which their labels, tive overviews of available approaches and tools [6, 9]. i.e. categories, are known, is required. These documents The approaches differ in the way how architectures are are often preprocessed, e.g., to remove stop words or to represented and the formalism on which the actual check- stem words. A feature extractor transforms the prepro- ing mechanism relies and, hence, the type of architectural cessed documents into a numerical vector. Finally, the constraints that can be expressed and checked. classification model is trained according to the machine Many techniques have in common that they require learning algorithm that is applied. It can then be used to an association between elements of the architecture predict the label of a new text that was preprocessed and and elements of the implementation for many typical brought into its numerical representation. consistency constraints. The most fundamental con- The overall, general procedure can be transferred to sistency constraints are related to dependencies. The the specific context of mapping code to architecture quite intended architecture of a software system often de- easily. The documents to be classified are the aforemen- Labels machine learning Training data extraction & feature algorithm preprocessing extraction preprocessed numerical text documents text documents representation Prediction predicted label classification model Figure 2: Schematic process of training and using a text classification model through machine learning. tioned source code entities, like source code files. Ar- mapping for the remaining source code entities. The in- chitectural modules are represented by labels—for yet formation used for classification is extracted from the unlabelled source code entities, a classification model compiled source code and consists of package names, should propose the correct module. For training, we re- file/class names, and attribute and variable identifiers. quire a sufficiently large set of source code entities for Compound words, like indicated through camel-casing, which their labels—the modules they are mapped to—is are split and the resulting texts are stemmed. The re- known. sulting documents are complemented by terms reflect- ing dependencies. This way structural information can 2.3. Related Work be considered in the classifier without the need to in- tegrate a separate dependency analysis approach. The The studies by Christl et al. were among the first to in- authors show that this approach outperforms HuGMe vestigate techniques for automating the mapping step significantly; if module descriptions are available, though, needed in architecture consistency checking [14, 15]. InMap performs slightly better [19]. They developed a technique, called HuGMe, for interac- The focus of the work by Link et al. is slightly dif- tive, human-guided mapping and compared two different ferent yet related [20]. In their approach called RELAX, attraction functions, CountAttract and MQAttract, mea- code entities are not mapped to architectural modules suring how well a code entity will map to an architectural but concerns which are potentially reusable across sys- module based on structural properties. tems. Any document of a system considered being part Bittencourt et al. presented a technique based on infor- of an architectural concern can be fed into the training mation retrieval, thus addressing the mapping problem process of a Naive Bayes classifier to categorize new doc- from an textual analysis angle instead [16]. They devel- uments according to their textual content. The approach oped an attraction function based on Latent Semantic is compared with two other clustering approaches for Indexing and evaluated it separately as well as a hybrid architecture recovery as this is the main scenario that the approach with both CountAttract and MQAttract. The authors target. For five out of eight case study systems, best results were achieved by integrating their novel at- RELAX is shown to perform best in comparison. The traction function and CountAttract. study does neither include details of the preprocessing of Both approaches require a set of manually mapped documents nor a replication package such that technical source code entities as a foothold for the applied tech- details of how information is extracted from source code niques. Sinkala and Herold instead exploit textual de- remain unclear. scriptions of the modules of intended architectures to pro- vide their information retrieval-based technique called InMap with initial information for recommending map- 3. Experimental Design pings [18, 19]. Olsson et al. developed and analysed a technique based 3.1. Research Questions on machine learning [17]. Taking an initial, manually The overarching motivating question for this study is created mapping of a portion of the source code, a Naive how well do different machine-learning based classifica- Bayes classifier is trained and then used to predict the tion models perform in mapping code entities to architec- Table 1 Descriptive statistics of the subject systems. lines of lines of #modules #files/ System #files code comments module (sd) Ant 713 86,685 76,987 15 47.5 (65.1) JabRef 845 88,562 17,187 6 140.7 (161.2) Lucene 508 60,345 33,342 7 83.0 (64.7) ProM 867 69,492 22,763 15 123.3 (55.4) TeamMates 812 102,072 12,514 6 135.3 (119.2) tural modules. As Sec. 2.3 shows, the focus of related ap- reason and their good accuracy outperforming Naive proaches so far has been a single classification algorithm Bayes in comparative studies [24]. Logistic Regression as and less a comparison of classifiers or an investigation of the final classifier has been shown to perform at similar their performance properties when applied in the context performance levels as SVM and was hence selected for of interest. comparison, too [25]. The envisaged scenario for the usage of machine learn- ing technique in this context is that a classifier is first 3.3. Experiment 1: Comparing Extraction trained with an initial set of manually created mappings based on the textual content of source code files. After and Preprocessing Variants that, the trained classification model is used to predict The goal of the first experiment is to address RQ1. In a the mappings for the remaining source code files. first step, we developed a list of elements in (Java) source We therefore break the main motivating question code that we believed to potentially carry architecturally down into two research questions: relevant information w.r.t. the required mapping. We judged the following elements to be potentially relevant: • RQ1: How does the selection of source code el- ements during preprocessing affect the perfor- • Package declarations: Indicate containment rela- mance of these classifiers? tionships that might match course-grained archi- • RQ2: How is the performance of different classi- tectural structures. fiers affected by the size of the training set size, • Import declarations: Elements of the same archi- i.e. the number of code entities that need to be tectural module often share the same dependen- mapped manually initially? cies. • Class declarations: Types defined in the same For each of the questions, we define a separate experi- module might share the same (part of the) domain ment based on the same set of systems and classifiers. vocabulary as expressed in their names. • Public methods: Same rationale as for class dec- 3.2. Subject Systems and Classifiers larations. For training and evaluating text classifiers for the task at • Comments: May refer to architectural aspects hand, a set of systems is required for each of which a) the and decisions, parts of the domain vocabulary, source code is accessible and b) the mapping between an etc., beyond what is being expressed in code intended architecture and the source code is known. We We furthermore identified seven different preprocess- explored two data sources for identifying systems that 1 ing steps that could be activated or deactivated for each fulfil these prerequisites: the SAEroCon repository and of the above elements in a source code file: the repository of the s4rdm3x tool [22]. Five open source software systems from these two repositories as listed in 1. Splitting of compound words: split, e.g. camel- Table 1 were selected for this study. They are all written case notation, getCustomerId becomes get in Java. Customer Id. Three commonly used machine learning-based clas- 2. Stemming: reduce inflected word to their stem, sifiers were selected for the study. Naive Bayes for text e.g. notification or notify become notif. classification was selected as the most relevant related 3. Transform to lower case work is built on it (see Sec. 2.3) and because of its good 4. Removing single characters performance with even little training data [23]. Sup- 5. Removing stop words, such as the, and or of port Vector Machines (SVM) were selected for the same 6. Removing Java keywords, such as class or 1 https://github.com/sebastianherold/SAEroConRepo/wiki public 7. Tokenization of words: chopping the stream of is defined as the precision/recall per class (module) di- characters that the document consists of into ac- vided by the number of classes. The weighted average tual tokens based on separators such as spaces, precision/recall takes the proportions of classes into ac- colons, etc. count and weights the individual precision/recall scores accordingly. The weighted average recall is equal to the A complete investigation of all combinations of prepro- accuracy of a classification model2 . cessing steps in a fixed order would lead to 27 options Practically speaking, this experiment corresponds per extracted source code element. For extracting all of roughly to a situation in which a software archi- the above elements alone, this would lead to 235 com- tect/designer can estimate the number of code entities bination which we considered infeasible. Instead, we that should be mapped to each of the architectural mod- experimented with several settings in an exploratory ules. The experiment could offer advice regarding the pre-study from which we concluded to activate the pre- relative number of entities she should map per module in processing steps 3 to 7 per default for all code elements order to get a sufficiently accurate automated mapping as deactivating them lead to decreased performance in for the rest of the system. the explored alternatives. This scenario, however, is not always realistic as mod- In the same pre-study two different feature representa- ule sizes may be unknown or estimations may be wrong. tion techniques were compared, bag-of-words and tf-idf For that reason, we repeated the experiment described [26, 27]. We noted that bag-of-words outperformed tf-idf above with different absolute training set sizes, expressed on average and hence chose the former for the experi- as absolute number of files per modules that should enter ments. the training set. Obviously, this way of sampling is not For each combination of subject systems, classifier, and stratified; the number of splits and metrics for evaluation combination of code extraction and active preprocessing remain the same as for comparing based upon relative steps, we trained and validated ten models following a training set sizes. Monte-Carlo cross-validation scheme [28]. The training set ratio was kept constant at 0.2 and stratified sampling was applied. The latter ensures that the proportion of the 3.5. Replication Package classes (i.e., modules) in the overall dataset is kept in both The replication package, including the scripts for prepro- training and testing sets during cross-validation. This cessing the data, training and evaluating the classifiers is ensures that both sets are representative for the overall available at https://github.com/sebastianherold/ml-for- dataset. architecture-mapping. The performance of the models were evaluated in terms of accuracy, i.e. the relative frequency of correct classifications, and averaged over all subject systems. 4. Results As described in Sec. 3.3, we explored the accuracy of 3.4. Experiment 2: Measuring the Effect all classification algorithms for different data extraction of Training Set Sizes and preprocessing settings in the first experiment. Fig. 3 The goal of the second experiment is to address RQ2. summarises the findings per combination of extracted Based on the results of the first experiment, one of the source code elements. All three classification algorithms best performing combinations of extraction and prepro- scored best when the data extracted from the code files cessing settings was selected for each classification al- was limited to package declarations and class declara- gorithm. The code files for each system were extracted tions. Logistic regression and SVM achieved accuracies and preprocessed accordingly and represented as bags- of 0.93 each, outperforming Naive Bayes by 0.07. SVM’s of-words. and Logistic regression’s accuracy drop significantly to We then trained models for each of the three classifi- 0.68 and 0.73 at maximum, respectively, when package cation algorithms at different training set sizes expressed declarations are not included in the data. Naive Bayes as fractions of the overall datasets, i.e. relative number of drops to 0.75 at extracting everything else but package available mappings between source code files and archi- declarations, performing more accurately than SVM and tectural modules. Per combination of system, training set Logistic Regression in this scenario. size of interest, and classifier, we trained and evaluated according to a Monte Carlo cross-validation with 100 splits and stratified sampling. 2 |𝑐 | The recall of each individual class 𝑐𝑖 is weighted by 𝑛𝑖 In order to evaluate the resulting models, we computed with 𝑛 being the total number of data points. Since |𝑐𝑖 | = several precision and recall averages per system, train- 𝑇 𝑃𝑐𝑖 + 𝐹 𝑁𝑐𝑖 , each term for the weighted average recall turns ing set size, and classifier. The average precision/recall into 𝑇 𝑃𝑐𝑖 /𝑛 which summed up over all classes is equivalent to the definition of accuracy. Figure 3: Accuracy for extraction of different parts of source code. Values are averages over all investigated combinations of preprocessing steps for the extracted parts. Figure 4: Standard deviation of accuracy averages for each combination of extracted source code parts as depicted in Fig. 3. The role of comments also changed with the inclu- 0.15, in particular for JabRef, ProM, and Teammates. The sion of package declarations. With package declara- curves show similar behaviour for the weighted average tions included, adding comments, while keeping inclu- precision. Unweighted averages keep a steeper slope in sion/exclusion of the other code elements unchanged, comparison even beyond training set sizes of 0.15 which seem to rather decrease the accuracy of the classifiers. shows that the performance for smaller modules benefits Without package declarations, including comments lead from increasing the training set size. to accuracy improvements of up to 0.07. In Fig. 6, the results of evaluating classification perfor- It should be noted that the mapping onto modules mance for different numbers of files per module in the aligned quite well with the package structure in all five training set are shown. Most curves across all metrics systems, which might explain the impact of including the show a sharp increase in performance that slows down at package declaration. We excluded the results for only ex- 10 files per module. This is less pronounced, sometimes tracting package declarations as we believe that the very hardly visible, for Naive Bayes as compared to SVM and good scores (beyond 0.98) of those models were overfit- Logistic Regression. The results for the weighted aver- ting and heavily biased towards the selected systems. ages seem more similar to their unweighted counterparts Fig. 4 illustrates the standard deviation for each ex- in this experiment. SVM and Logistic Regression outper- traction setting and classifier. The standard deviation is form Naive Bayes in almost all settings and systems in below 0.02 in 80% of the cases, exceeding 0.05 slightly this experiment, too. in only one case. These results show that the variable preprocessing settings, stemming of words and splitting of compound words, affect accuracy only slightly. 5. Discussion The results related to classification performance over training set size as relative fraction of the overall number 5.1. Findings regarding RQ1 of source code files are visualised in Fig. 5. They confirm In the following we discuss and summarize the finding that SVM and Logistic Regression perform better than related to the question how different ways of data ex- Naive Bayes in accuracy, precision and recall in almost all traction and preprocessing affect the performance of the settings3 . The improvement in accuracy decreased for all tested classifiers. systems and classifiers beyond a relative training size of The experiment results show that package declarations constitute a significant piece of information for the tested 3 These experiments were performed extracting package decla- classifiers. A decrease of 0.3 in accuracy for settings rations and class declarations. Figure 5: Performance metrics of classifiers over relative training set size. in which the only difference is to not consider package are well-aligned but mapped to more than one package declarations is common across the results. in such systems, identifying the relevant packages for a This seems quite natural as the mappings for the sys- module can still be tedious. tem used for training are largely aggregating source code An interesting question in the light of the first finding elements along several subtrees of the package hierarchy is whether the approaches by Olsson et al. and Link et al. instead of individual classes from unrelated packages. could benefit from using a different classifier than Naive Only in the mapping of JabRef exist cases of packages Bayes [17, 20]. While the alignment with source code whose contained classes/interfaces, i.e. and correspond- structures is largely unclear for Link et al., Olsson et al. ing files, are mapped to different modules, and which applied their approach to the same, well-aligned systems these different mappings do not align with the subpack- used in this study. This finding also suggests that their age/subdirectory structure. It does hence not surprise approach could be further tuned to only use package and that settings including package declarations and only type information as compared to including also variable few other pieces of information score best. In our experi- identifiers. In use cases, in which the slightly slower ments, class declarations seem to complement package training of SVM and Logistic Regression is an issue, Naive declarations best. Since Naive Bayes does not perform as Bayes might be the better alternative. well as the other classification algorithms, we formulate It is common that the mappings are not that well- our first finding as aligned and straight-forward [29]. Furthermore, some programming languages do not declare any containment Finding 1. In settings, in which the architectural mod- relationships equivalent to packages declarations. The ule structure can be assumed to align well with macro- tested systems do not represent this scenario properly. structures declared in the system’s implementation, these We therefore looked at the performance of the classifiers declarations and type information should be extracted. without considering package declarations as approxima- SVM and Logistic Regression provide more accurate results tion of their behaviour if we did not have that informa- than Naive Bayes. tion or considered it useless. In this setting, Naive Bayes Note that a straight-forward alignment does not neces- exploiting import declarations, class declarations, and sarily imply that a mapping can easily be constructed comments, showed the best accuracy (on a par with ad- manually without the need for automation in the first ditionally including declarations of public methods). place. In large-scale systems, structures of hundreds of Finding 2. If alignment with any macro-structures de- packages are not uncommon. If architectural modules Figure 6: Performance metrics of classifiers over absolute training set size. clared or derived from source cannot or should not be as- 15% of the overall dataset (equal to the total number of sumed, Naive Bayes trained based on declarations of types, source code files) and above. Enhancing the initial map- imports, and comments should be used. ping beyond this point may therefore turn out infeasible. Even in the relatively small sample systems of this study The standard deviation within groups of identical extrac- like JabRef, increasing this mapping by 5% of the overall tions regarding different preprocessing settings is very number of code files means to map more than 40 addi- low. This indicates that the impact of stemming and tional files. This may possibly not pay off, in particular for splitting of compound words does not have a significant larger systems, if the gain in classification performance impact on the resulting accuracy of any of the tested is minimal. We therefore state: classification algorithms. Finding 4. If the number of files supposed to be mapped Finding 3. The selection of parts to be extracted for clas- to each module can be estimated, mapping around 15% of sifier training and mapping prediction appears to be more that number in the initial mapping may be a good rule-of- important than the selection of the preprocessing steps con- thumb for training an efficient classifier. sidered optional in this study. The results of experimenting with absolute training Further investigation may be necessary to investigate the set size complement these result for scenarios in which potentially larger impact of other preprocessing steps in it is not possible or desirable to estimate the number of the individual scenarios described above. files mapped to each module. For the tested systems, the gain in accuracy, precision, and recall flattens out at 10 5.2. Findings regarding RQ2 files per module in the initial mapping which leads us to In this subsection, we summarize the finding related to our final finding: the question of how the training set size, corresponding Finding 5. An initial mapping of at least 10 files per to the number of initially, manually mapped files, affects module may lead to a satisfactorily performing classifier. classifier performance. The results suggest that in many cases the additional Again these findings can only properly compared to Ols- gain in accuracy, precision, and recall slackens at around son et al. as Link et al. do no report details about training set sizes [17, 20]. The results reported by Olsson et al., 6. Conclusion suggest that an initial mapping of ca. 20% lead to satisfy- ing performance. Those results, however, are measured The results of the presented study indicate that there is as average over even imbalanced initial mappings that do no silver bullet classifier. The choice of an optimal clas- not take proportions of modules into account. We there- sifier and elements to be extracted from source code is fore think that our findings recommending a slightly influenced by system characteristics like the alignment of smaller relative size of the, however, stratified mapping macro-structural elements with the assumed architecture. is in line with those results. To identify more of such distinguishing characteristics or scenarios seems to be an interesting objective of future research. It will be particularly relevant to investigate 5.3. Validity whether the recommendations regarding the size of ini- Several factors limit the external validity of this study. tial mappings hold in practice and if they apply for larger Firstly, although the subject systems are anything but systems, too. trivial, they certainly do not represent large-scale soft- Last but not least more classifiers wait to be tested for ware systems. Further research in particular to confirm their ability to automate code-to-architecture mapping. or refine the findings regarding training set sizes is re- For these as well as for those tested in this study, dif- quired. Moreover, the results may not accurately reflect ferent preprocessing techniques should be investigated the behaviour of classifiers if package or equivalent dec- more deeply and the improvement that hyperparameter larations are considered but the architecture does not tuning might achieve should be explored. Such a more align with them. A further threat to external validity is exhaustive comparative study might also need to take the the scoping to systems written in Java. This is due to the performance and the resource demands of the training limited availability of systems for which the architecture process into account. as well as the source code is available. The systems iden- tified appeared all to be Java-based. Moreover, we did not tune the hyperparameters but only touched upon this References non-exhaustively in the before-mentioned exploratory [1] D. E. Perry, A. L. Wolf, Foundations for the study of pre-study. This might be considered a limitation as well software architecture, SIGSOFT Softw. Eng. Notes as a threat to external validity as the results might differ 17 (1992) 40–52. doi:10.1145/141874.141884. for classification models with different hyperparameters. [2] M. W. Godfrey, E. H. S. Lee, Secrets from the mon- The experiments aim at identifying causal re- ster: Extracting mozilla’s software architecture, in: lationships between independent variables (extrac- In Proc. of 2000 Intl. Symposium on Constructing tion/preprocessing settings and training set sizes, respec- software engineering tools (CoSET 2000, 2000, pp. tively) and dependent variables (performance measures). 15–23. We are pretty confident that the internal validity is high [3] C. Deiters, P. Dohrmann, S. Herold, A. Rausch, Rule- as all other identified parameters were kept constant based architectural compliance checks for enter- throughout the experiments. A potential threat are of prise architecture management, in: 2009 IEEE Inter- course bugs in the scripts and software used to extract national Enterprise Distributed Object Computing and preprocess data as well as for training and evaluating Conference, 2009, pp. 183–192. doi:10.1109/ED the classifiers. We consider this risk to be low though as OC.2009.15. established software libraries were used for this purpose [4] J. van Gurp, J. Bosch, Design erosion: problems and and any self-written code (largely produced by the first causes, Journal of Systems and Software 61 (2002) and second author) was carefully reviewed by the third 105–119. doi:10.1016/S0164-1212(01)00152 and fourth author. -2. We consider the selected method for cross-validation [5] S. Sarkar, S. Ramachandran, G. S. Kumar, M. K. as main threat to construct validity. Inappropriate Iyengar, K. Rangarajan, S. Sivagnanam, Mod- train/test splits may lead to biased classification models ularization of a large-scale business application: that might not reflect a classifier’s performance prop- A case study, IEEE Software 26 (2009) 28–35. erly. We believe though that the chosen repetitions for doi:10.1109/MS.2009.42. the cross-validation in the experiments was sufficient to [6] L. Passos, R. Terra, M. T. Valente, R. Diniz, N. Men- mitigate this risk. donça, Static architecture-conformance checking: An illustrative overview, IEEE Software 27 (2010) 82–89. doi:10.1109/MS.2009.117. [7] G. Murphy, D. Notkin, K. Sullivan, Software reflex- ion models: bridging the gap between design and implementation, IEEE Transactions on Software 44948.3344984. Engineering 27 (2001) 364–380. doi:10.1109/32 [18] Z. T. Sinkala, S. Herold, InMap: Automated interac- .917525. tive code-to-architecture mapping, in: Proceedings [8] O. de Moor, D. Sereni, M. Verbaere, E. Hajiyev, of the 36th Annual ACM Symposium on Applied P. Avgustinov, T. Ekman, N. Ongkingco, J. Tibble, Computing, SAC ’21, Association for Computing .QL: Object-Oriented Queries Made Easy, Springer Machinery, New York, NY, USA, 2021, p. 1439–1442. Berlin Heidelberg, 2008, pp. 78–133. doi:10.1007/ doi:10.1145/3412841.3442124. 978-3-540-88643-3\_3. [19] Z. T. Sinkala, S. Herold, InMap: Automated in- [9] S. Herold, Architectural compliance in component- teractive code-to-architecture mapping recommen- based systems, Ph.D. thesis, Clausthal University dations, in: 2021 IEEE 18th International Con- of Technology, 2011. ference on Software Architecture (ICSA), 2021. [10] S. Herold, A. Rausch, Complementing model-driven doi:10.1109/ICSA51549.2021.00024. development for the detection of software architec- [20] D. Link, P. Behnamghader, R. Moazeni, B. Boehm, ture erosion, in: Proceedings of the 5th Interna- Recover and relax: Concern-oriented software ar- tional Workshop on Modeling in Software Engi- chitecture recovery for systems development and neering, MiSE ’13, IEEE Press, 2013, p. 24–30. maintenance, in: Proceedings of the International [11] S. Schröder, G. Buchgeher, Formalizing architec- Conference on Software and System Processes, IC- tural rules with ontologies - an industrial evalu- SSP ’19, IEEE Press, 2019, p. 64–73. doi:10.1109/ ation, in: 2019 26th Asia-Pacific Software En- ICSSP.2019.00018. gineering Conference (APSEC), 2019, pp. 55–62. [21] C. C. Aggarwal, C. Zhai, A survey of text classifi- doi:10.1109/APSEC48747.2019.00017. cation algorithms, in: Mining text data, Springer, [12] W. Ding, P. Liang, A. Tang, H. Van Vliet, M. Shahin, 2012, pp. 163–222. How do open source communities document soft- [22] T. Olsson, M. Ericsson, A. Wingkvist, s4rdm3x: A ware architecture: An exploratory survey, in: 2014 tool suite to explore code to architecture mapping 19th International Conference on Engineering of techniques, Journal of Open Source Software 6 Complex Computer Systems, 2014, pp. 136–145. (2021) 2791. doi:10.21105/joss.02791. doi:10.1109/ICECCS.2014.26. [23] I. H. Witten, E. Frank, M. A. Hall, Data Mining: [13] N. Ali, S. Baker, R. O’Crowley, S. Herold, J. Buckley, Practical Machine Learning Tools and Techniques, Architecture consistency: State of the practice, chal- 3 ed., Morgan Kaufmann, Amsterdam, 2011. lenges and requirements, Emp. Softw. Eng. 23 (2018) [24] A. Sheshasaayee, G. Thailambal, Comparison of 224–258. doi:10.1007/s10664-017-9515-3. classification algorithms in text mining, Interna- [14] A. Christl, R. Koschke, M.-A. Storey, Equipping the tional Journal of Pure and Applied Mathematics 116 reflexion method with automated clustering, in: (2017) 425–433. 12th Working Conference on Reverse Engineering [25] K. Shah, H. Patel, D. Sanghvi, M. Shah, A compara- (WCRE’05), 2005, pp. 10 pp.–98. doi:10.1109/WC tive analysis of logistic regression, random forest RE.2005.17. and knn models for the text classification, Aug- [15] A. Christl, R. Koschke, M.-A. Storey, Automated mented Human Research 5 (2020) 1–16. clustering to support the reflexion method, Infor- [26] Z. S. Harris, Distributional structure, WORD 10 mation and Software Technology 49 (2007) 255–274. (1954) 146–162. doi:10.1080/00437956.1954. URL: https://www.sciencedirect.com/science/ar 11659520. ticle/pii/S095058490600187X. doi:https://doi. [27] K. Sparck Jones, A statistical interpretation of term org/10.1016/j.infsof.2006.10.015, 12th specificity and its application in retrieval, Journal Working Conference on Reverse Engineering. of documentation 28 (1972). doi:10.1108/eb0265 [16] R. A. Bittencourt, G. J. d. Santos, D. D. S. Guer- 26. rero, G. C. Murphy, Improving automated map- [28] R. R. Picard, R. D. Cook, Cross-validation of regres- ping in reflexion models using information re- sion models, Journal of the American Statistical trieval techniques, in: 2010 17th Working Con- Association 79 (1984) 575–583. doi:10.1080/0162 ference on Reverse Engineering, 2010, pp. 163–172. 1459.1984.10478083. doi:10.1109/WCRE.2010.26. [29] J. Buckley, N. Ali, M. English, J. Rosik, S. Herold, [17] T. Olsson, M. Ericsson, A. Wingkvist, Semi- Real-time reflexion modelling in architecture rec- automatic mapping of source code using naive onciliation: A multi case study, Information and bayes, in: Proceedings of the 13th European Con- Software Technology 61 (2015) 107–123. URL: https: ference on Software Architecture - Volume 2, ECSA //www.sciencedirect.com/science/article/pii/S095 ’19, Association for Computing Machinery, New 0584915000270. doi:https://doi.org/10.101 York, NY, USA, 2019, p. 209–216. doi:10.1145/33 6/j.infsof.2015.01.011.