A Comparison of Machine Learning-Based Text Classifiers
for Mapping Source Code to Architectural Modules
Alexander Florean1 , Laoa Jalal1 , Zipani Tom Sinkala1 and Sebastian Herold1
1
    Department of Mathematics and Computer Science, Karlstad University, Sweden


                                             Abstract
                                             A mapping between a system’s implementation and its software architecture is mandatory in many architecture consistency
                                             checking techniques. Creating such a mapping manually is a non-trivial task for most complex software systems. Machine
                                             learning-based text classification may be an highly effective tool for automating this task. How to make use of this tool most
                                             effectively has not been thoroughly investigated yet.
                                                 This article presents a comparative analysis of three classifiers applied to map the implementations of five open-source
                                             systems to their architectures. The performance of the classifiers is evaluated for different extraction and preprocessing
                                             settings as well as different training set sizes.
                                                 The results suggest that Logical Regression and Support Vector Machines both outperform Naive Bayes unless informa-
                                             tion about coarse-grained implementation structures cannot be exploited. Moreover, initial manual mappings of more than
                                             15% of all source code files, or 10 files per module, do not seem to lead to a significantly better classification.

                                             Keywords
                                             software architecture consistency, code-to-architecture mapping, text classification, machine learning


1. Motivation                                                                                                         which are hence discouraged [7].
                                                                                                                         In some cases, architectural documentation describ-
Software architecture degradation is the phenomenon                                                                   ing the relationship between architecture and implemen-
of the implementation of a software system diverging                                                                  tation can help create this mapping. More often than
from the intended software architecture [1]. The poten-                                                               not though, architectural documentation is missing or
tial consequences of this divergence include a decay of                                                               outdated such that the architecture and the mapping
maintainability as well as the decreased ability of the sys-                                                          towards code need to be recovered from a system’s im-
tem to meet other desired quality properties. Expensive                                                               plementation [12]. Performed manually, this constitutes
system re-engineering or discontinuations of software                                                                 a challenging and labour-intensive task even for system
products can be the consequences [2, 3, 4, 5].                                                                        experts. As expressed by professional software archi-
   One approach to combat software architecture degra-                                                                tects and designers in a study by Ali et al., creating the
dation is software architecture consistency checking [6].                                                             mapping is one of the major obstacles to adopting ar-
The core idea of these techniques is to implement fre-                                                                chitectural consistency techniques in industrial practice
quent checks for inconsistencies between the intended                                                                 [13].
software architecture and the current implementation of                                                                  Researchers have thus put some attention into devel-
a system to detect degradation early. The individual tech-                                                            oping techniques that support software engineers in this
niques differ in the variety of consistency constraints, or                                                           task by creating mappings partially automatically or by
types of divergence that they can detect. They range from                                                             recommending mappings [14, 15, 16, 17, 18, 19]. Most re-
dependency-focused and source code analysis-based tech-                                                               cently, text classification based on machine learning has
niques [7] to logical query-based techniques for checking                                                             been applied to automatically categorize units of source
architecturally induced constraints far beyond dependen-                                                              code according to the architectural concern or module
cies [8, 9, 10, 11].                                                                                                  they implement and should be mapped to [17, 20].
   Most approaches have in common that some kind of                                                                      These approaches show promising results. The ques-
mapping between architectural units, e.g. modules, and                                                                tion arises though whether the full potential of machine
implementation units, such as source code files, is re-                                                               learning for text classification in this context has already
quired. Reflexion modelling, for example, exploits this                                                               been tapped. Several text classification algorithms that
mapping to detect source code dependencies that are not                                                               perform well in different contexts have not yet been in-
covered by dependencies in an architectural model and                                                                 vestigated. The question of how to optimally extract and
ECSA 2021 Companion Volume                                                                                            preprocess source code for classification has not yet been
" florean.alexander@gmail.com (A. Florean);                                                                           exhaustively explored either.
laoa99@outlook.com (L. Jalal); zipani.sinkala@kau.se                                                                     The goal of this article is to shed some light on the per-
(Z. T. Sinkala); sebastian.herold@kau.se (S. Herold)                                                                  formance, i.e. the predictive capability, and other proper-
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     ties of several machine learning-based text classifiers for
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
                                                            fines allowed / prohibited, or expected / discouraged
                  InstantiatedIndex.java
                                                            dependencies between architectural modules. In order
                                                            to check whether the dependencies present in source
                                                            code conform with those, i.e. to compare code with
                                                            architectural dependencies, the architectural modules
                                                            to which sources and targets of code dependencies are
                                                            mapped, need to be known. Fig. 1 shows a cut-out
              BytesRef.java                                 of a so-called reflexion model of one of the systems
                                                            (Lucene) used in the experiments presented in this paper.
                                                            It depicts two architecture modules as boxes. Dashed
                                                            lines (as opposed to solid lines) indicate that depen-
                                                            dencies between these modules are architecturally dis-
Figure 1: Exemplary cutout of a reflexion modelling and couraged in either direction; they are, however, present
source code dependency contributing to undesired architec- in source code: for example, there is a call, located in
tural dependency.                                           file InstantiatedIndex.java that is being mapped
                                                            to store, to a method called utf8ToString(), located
                                                            in file BytesRef.java, which is being mapped to util.
                                                            Through simple code analysis and tracing the mapping,
the described task. We present a comparative analysis of this can be identified as architecturally discouraged de-
three classifiers that were applied to map the code of five pendency between the modules store and util.
different systems to their specified architectures.            Consequently, creating and maintaining such a map-
   The contribution of the paper is a set of findings that ping is crucial for applying architectural consistency
may guide further research and use of these classifiers for checking effectively. As pointed out by Ali et al., the
the task of interest. These guidelines, on the one hand, effort needed for manual mapping constitutes a serious
give advice for the selection of an appropriate classifier concern in industrial practice [13]. Several approaches ex-
based on assumptions regarding the alignment between ist to support software engineers in this time-consuming
architecture and modular implementation elements like task. The most relevant ones will be discussed in Sec. 2.3.
packages. On the other hand, they provide rules of thumb
for the recommended size of an initial, manual mapping
required to train the classifiers.                          2.2. Text Classification with Machine
   The remaining article is organized as follows. The fol-         Learning
lowing section describes relevant technical background
                                                            Text classification is one of the fundamental activities in
as well as related work. Section 3 explains the experi-
                                                            Natural Language Processing [21]. The goal of text clas-
mental setup of the comparative analysis. In Sec. 4, we
                                                            sification is to assign a text written in natural language
summarize the results, which are discussed in Sec. 5. The
                                                            to one or more predefined categories. Applications of
article is concluded in Sec. 6.
                                                            this activity include, for example, sentiment analysis or
                                                            spam detection.
2. Background                                                  The central idea of applying machine learning to the
                                                            task of text classification is to train a classification model
2.1. Architecture Consistency Checking                      based upon text samples for which the assigned cate-
                                                            gories are already known. Fig. 2 depicts the typical steps
Techniques for checking the consistency between the in training a text classification model and using it to
software architecture of a system and its implementation predict its label/category, i.e. to classify new text. For
come in various forms. Several authors provide exhaus- learning, a set of text documents for which their labels,
tive overviews of available approaches and tools [6, 9]. i.e. categories, are known, is required. These documents
The approaches differ in the way how architectures are are often preprocessed, e.g., to remove stop words or to
represented and the formalism on which the actual check- stem words. A feature extractor transforms the prepro-
ing mechanism relies and, hence, the type of architectural cessed documents into a numerical vector. Finally, the
constraints that can be expressed and checked.              classification model is trained according to the machine
   Many techniques have in common that they require learning algorithm that is applied. It can then be used to
an association between elements of the architecture predict the label of a new text that was preprocessed and
and elements of the implementation for many typical brought into its numerical representation.
consistency constraints. The most fundamental con-             The overall, general procedure can be transferred to
sistency constraints are related to dependencies. The the specific context of mapping code to architecture quite
intended architecture of a software system often de- easily. The documents to be classified are the aforemen-
                                    Labels
                                                                                                  machine
                                                                                                   learning


            Training
                             data extraction &             feature                                algorithm
                              preprocessing               extraction


                                              preprocessed               numerical
                   text documents
                                             text documents            representation
            Prediction


                                                                                                          predicted label
                                                                                   classification model

Figure 2: Schematic process of training and using a text classification model through machine learning.


tioned source code entities, like source code files. Ar-               mapping for the remaining source code entities. The in-
chitectural modules are represented by labels—for yet                  formation used for classification is extracted from the
unlabelled source code entities, a classification model                compiled source code and consists of package names,
should propose the correct module. For training, we re-                file/class names, and attribute and variable identifiers.
quire a sufficiently large set of source code entities for             Compound words, like indicated through camel-casing,
which their labels—the modules they are mapped to—is                   are split and the resulting texts are stemmed. The re-
known.                                                                 sulting documents are complemented by terms reflect-
                                                                       ing dependencies. This way structural information can
2.3. Related Work                                                      be considered in the classifier without the need to in-
                                                                       tegrate a separate dependency analysis approach. The
The studies by Christl et al. were among the first to in-              authors show that this approach outperforms HuGMe
vestigate techniques for automating the mapping step                   significantly; if module descriptions are available, though,
needed in architecture consistency checking [14, 15].                  InMap performs slightly better [19].
They developed a technique, called HuGMe, for interac-                    The focus of the work by Link et al. is slightly dif-
tive, human-guided mapping and compared two different                  ferent yet related [20]. In their approach called RELAX,
attraction functions, CountAttract and MQAttract, mea-                 code entities are not mapped to architectural modules
suring how well a code entity will map to an architectural             but concerns which are potentially reusable across sys-
module based on structural properties.                                 tems. Any document of a system considered being part
   Bittencourt et al. presented a technique based on infor-            of an architectural concern can be fed into the training
mation retrieval, thus addressing the mapping problem                  process of a Naive Bayes classifier to categorize new doc-
from an textual analysis angle instead [16]. They devel-               uments according to their textual content. The approach
oped an attraction function based on Latent Semantic                   is compared with two other clustering approaches for
Indexing and evaluated it separately as well as a hybrid               architecture recovery as this is the main scenario that the
approach with both CountAttract and MQAttract. The                     authors target. For five out of eight case study systems,
best results were achieved by integrating their novel at-              RELAX is shown to perform best in comparison. The
traction function and CountAttract.                                    study does neither include details of the preprocessing of
   Both approaches require a set of manually mapped                    documents nor a replication package such that technical
source code entities as a foothold for the applied tech-               details of how information is extracted from source code
niques. Sinkala and Herold instead exploit textual de-                 remain unclear.
scriptions of the modules of intended architectures to pro-
vide their information retrieval-based technique called
InMap with initial information for recommending map-                   3. Experimental Design
pings [18, 19].
   Olsson et al. developed and analysed a technique based              3.1. Research Questions
on machine learning [17]. Taking an initial, manually                  The overarching motivating question for this study is
created mapping of a portion of the source code, a Naive               how well do different machine-learning based classifica-
Bayes classifier is trained and then used to predict the               tion models perform in mapping code entities to architec-
Table 1
Descriptive statistics of the subject systems.
                                             lines of            lines of              #modules        #files/
   System                   #files           code                comments                              module (sd)
   Ant                      713              86,685              76,987                15              47.5 (65.1)
   JabRef                   845              88,562              17,187                6               140.7 (161.2)
   Lucene                   508              60,345              33,342                7               83.0 (64.7)
   ProM                     867              69,492              22,763                15              123.3 (55.4)
   TeamMates                812              102,072             12,514                6               135.3 (119.2)


tural modules. As Sec. 2.3 shows, the focus of related ap-       reason and their good accuracy outperforming Naive
proaches so far has been a single classification algorithm       Bayes in comparative studies [24]. Logistic Regression as
and less a comparison of classifiers or an investigation of      the final classifier has been shown to perform at similar
their performance properties when applied in the context         performance levels as SVM and was hence selected for
of interest.                                                     comparison, too [25].
   The envisaged scenario for the usage of machine learn-
ing technique in this context is that a classifier is first      3.3. Experiment 1: Comparing Extraction
trained with an initial set of manually created mappings
based on the textual content of source code files. After
                                                                      and Preprocessing Variants
that, the trained classification model is used to predict        The goal of the first experiment is to address RQ1. In a
the mappings for the remaining source code files.                first step, we developed a list of elements in (Java) source
   We therefore break the main motivating question               code that we believed to potentially carry architecturally
down into two research questions:                                relevant information w.r.t. the required mapping. We
                                                                 judged the following elements to be potentially relevant:
        • RQ1: How does the selection of source code el-
          ements during preprocessing affect the perfor-              • Package declarations: Indicate containment rela-
          mance of these classifiers?                                   tionships that might match course-grained archi-
        • RQ2: How is the performance of different classi-              tectural structures.
          fiers affected by the size of the training set size,        • Import declarations: Elements of the same archi-
          i.e. the number of code entities that need to be              tectural module often share the same dependen-
          mapped manually initially?                                    cies.
                                                                      • Class declarations: Types defined in the same
For each of the questions, we define a separate experi-
                                                                        module might share the same (part of the) domain
ment based on the same set of systems and classifiers.
                                                                        vocabulary as expressed in their names.
                                                                      • Public methods: Same rationale as for class dec-
3.2. Subject Systems and Classifiers                                    larations.
For training and evaluating text classifiers for the task at          • Comments: May refer to architectural aspects
hand, a set of systems is required for each of which a) the             and decisions, parts of the domain vocabulary,
source code is accessible and b) the mapping between an                 etc., beyond what is being expressed in code
intended architecture and the source code is known. We
                                                                We furthermore identified seven different preprocess-
explored two data sources for identifying systems that
                                                       1     ing steps that could be activated or deactivated for each
fulfil these prerequisites: the SAEroCon repository and
                                                             of the above elements in a source code file:
the repository of the s4rdm3x tool [22]. Five open source
software systems from these two repositories as listed in         1. Splitting of compound words: split, e.g. camel-
Table 1 were selected for this study. They are all written           case notation, getCustomerId becomes get
in Java.                                                             Customer Id.
   Three commonly used machine learning-based clas-               2. Stemming: reduce inflected word to their stem,
sifiers were selected for the study. Naive Bayes for text            e.g. notification or notify become notif.
classification was selected as the most relevant related          3. Transform to lower case
work is built on it (see Sec. 2.3) and because of its good        4. Removing single characters
performance with even little training data [23]. Sup-             5. Removing stop words, such as the, and or of
port Vector Machines (SVM) were selected for the same
                                                                  6. Removing Java keywords, such as class or
    1
        https://github.com/sebastianherold/SAEroConRepo/wiki              public
    7. Tokenization of words: chopping the stream of             is defined as the precision/recall per class (module) di-
       characters that the document consists of into ac-         vided by the number of classes. The weighted average
       tual tokens based on separators such as spaces,           precision/recall takes the proportions of classes into ac-
       colons, etc.                                              count and weights the individual precision/recall scores
                                                                 accordingly. The weighted average recall is equal to the
A complete investigation of all combinations of prepro-
                                                                 accuracy of a classification model2 .
cessing steps in a fixed order would lead to 27 options
                                                                    Practically speaking, this experiment corresponds
per extracted source code element. For extracting all of
                                                                 roughly to a situation in which a software archi-
the above elements alone, this would lead to 235 com-
                                                                 tect/designer can estimate the number of code entities
bination which we considered infeasible. Instead, we
                                                                 that should be mapped to each of the architectural mod-
experimented with several settings in an exploratory
                                                                 ules. The experiment could offer advice regarding the
pre-study from which we concluded to activate the pre-
                                                                 relative number of entities she should map per module in
processing steps 3 to 7 per default for all code elements
                                                                 order to get a sufficiently accurate automated mapping
as deactivating them lead to decreased performance in
                                                                 for the rest of the system.
the explored alternatives.
                                                                    This scenario, however, is not always realistic as mod-
   In the same pre-study two different feature representa-
                                                                 ule sizes may be unknown or estimations may be wrong.
tion techniques were compared, bag-of-words and tf-idf
                                                                 For that reason, we repeated the experiment described
[26, 27]. We noted that bag-of-words outperformed tf-idf
                                                                 above with different absolute training set sizes, expressed
on average and hence chose the former for the experi-
                                                                 as absolute number of files per modules that should enter
ments.
                                                                 the training set. Obviously, this way of sampling is not
   For each combination of subject systems, classifier, and
                                                                 stratified; the number of splits and metrics for evaluation
combination of code extraction and active preprocessing
                                                                 remain the same as for comparing based upon relative
steps, we trained and validated ten models following a
                                                                 training set sizes.
Monte-Carlo cross-validation scheme [28]. The training
set ratio was kept constant at 0.2 and stratified sampling
was applied. The latter ensures that the proportion of the       3.5. Replication Package
classes (i.e., modules) in the overall dataset is kept in both   The replication package, including the scripts for prepro-
training and testing sets during cross-validation. This          cessing the data, training and evaluating the classifiers is
ensures that both sets are representative for the overall        available at https://github.com/sebastianherold/ml-for-
dataset.                                                         architecture-mapping.
   The performance of the models were evaluated in
terms of accuracy, i.e. the relative frequency of correct
classifications, and averaged over all subject systems.          4. Results
                                                                 As described in Sec. 3.3, we explored the accuracy of
3.4. Experiment 2: Measuring the Effect
                                                                 all classification algorithms for different data extraction
     of Training Set Sizes                                       and preprocessing settings in the first experiment. Fig. 3
The goal of the second experiment is to address RQ2.             summarises the findings per combination of extracted
Based on the results of the first experiment, one of the         source code elements. All three classification algorithms
best performing combinations of extraction and prepro-           scored best when the data extracted from the code files
cessing settings was selected for each classification al-        was limited to package declarations and class declara-
gorithm. The code files for each system were extracted           tions. Logistic regression and SVM achieved accuracies
and preprocessed accordingly and represented as bags-            of 0.93 each, outperforming Naive Bayes by 0.07. SVM’s
of-words.                                                        and Logistic regression’s accuracy drop significantly to
   We then trained models for each of the three classifi-        0.68 and 0.73 at maximum, respectively, when package
cation algorithms at different training set sizes expressed      declarations are not included in the data. Naive Bayes
as fractions of the overall datasets, i.e. relative number of    drops to 0.75 at extracting everything else but package
available mappings between source code files and archi-          declarations, performing more accurately than SVM and
tectural modules. Per combination of system, training set        Logistic Regression in this scenario.
size of interest, and classifier, we trained and evaluated
according to a Monte Carlo cross-validation with 100
splits and stratified sampling.                                 2                                                       |𝑐 |
                                                                  The recall of each individual class 𝑐𝑖 is weighted by 𝑛𝑖
   In order to evaluate the resulting models, we computed with 𝑛 being the total number of data points. Since |𝑐𝑖 | =
several precision and recall averages per system, train- 𝑇 𝑃𝑐𝑖 + 𝐹 𝑁𝑐𝑖 , each term for the weighted average recall turns
ing set size, and classifier. The average precision/recall into 𝑇 𝑃𝑐𝑖 /𝑛 which summed up over all classes is equivalent to
                                                                 the definition of accuracy.
Figure 3: Accuracy for extraction of different parts of source code. Values are averages over all investigated combinations of
preprocessing steps for the extracted parts.


Figure 4: Standard deviation of accuracy averages for each combination of extracted source code parts as depicted in Fig. 3.


   The role of comments also changed with the inclu-            0.15, in particular for JabRef, ProM, and Teammates. The
sion of package declarations. With package declara-             curves show similar behaviour for the weighted average
tions included, adding comments, while keeping inclu-           precision. Unweighted averages keep a steeper slope in
sion/exclusion of the other code elements unchanged,            comparison even beyond training set sizes of 0.15 which
seem to rather decrease the accuracy of the classifiers.        shows that the performance for smaller modules benefits
Without package declarations, including comments lead           from increasing the training set size.
to accuracy improvements of up to 0.07.                            In Fig. 6, the results of evaluating classification perfor-
   It should be noted that the mapping onto modules             mance for different numbers of files per module in the
aligned quite well with the package structure in all five       training set are shown. Most curves across all metrics
systems, which might explain the impact of including the        show a sharp increase in performance that slows down at
package declaration. We excluded the results for only ex-       10 files per module. This is less pronounced, sometimes
tracting package declarations as we believe that the very       hardly visible, for Naive Bayes as compared to SVM and
good scores (beyond 0.98) of those models were overfit-         Logistic Regression. The results for the weighted aver-
ting and heavily biased towards the selected systems.           ages seem more similar to their unweighted counterparts
   Fig. 4 illustrates the standard deviation for each ex-       in this experiment. SVM and Logistic Regression outper-
traction setting and classifier. The standard deviation is      form Naive Bayes in almost all settings and systems in
below 0.02 in 80% of the cases, exceeding 0.05 slightly         this experiment, too.
in only one case. These results show that the variable
preprocessing settings, stemming of words and splitting
of compound words, affect accuracy only slightly.               5. Discussion
   The results related to classification performance over
training set size as relative fraction of the overall number    5.1. Findings regarding RQ1
of source code files are visualised in Fig. 5. They confirm      In the following we discuss and summarize the finding
that SVM and Logistic Regression perform better than             related to the question how different ways of data ex-
Naive Bayes in accuracy, precision and recall in almost all      traction and preprocessing affect the performance of the
settings3 . The improvement in accuracy decreased for all        tested classifiers.
systems and classifiers beyond a relative training size of          The experiment results show that package declarations
                                                                 constitute a significant piece of information for the tested
    3
      These experiments were performed extracting package decla- classifiers. A decrease of 0.3 in accuracy for settings
rations and class declarations.
Figure 5: Performance metrics of classifiers over relative training set size.


in which the only difference is to not consider package           are well-aligned but mapped to more than one package
declarations is common across the results.                        in such systems, identifying the relevant packages for a
   This seems quite natural as the mappings for the sys-          module can still be tedious.
tem used for training are largely aggregating source code            An interesting question in the light of the first finding
elements along several subtrees of the package hierarchy          is whether the approaches by Olsson et al. and Link et al.
instead of individual classes from unrelated packages.            could benefit from using a different classifier than Naive
Only in the mapping of JabRef exist cases of packages             Bayes [17, 20]. While the alignment with source code
whose contained classes/interfaces, i.e. and correspond-          structures is largely unclear for Link et al., Olsson et al.
ing files, are mapped to different modules, and which             applied their approach to the same, well-aligned systems
these different mappings do not align with the subpack-           used in this study. This finding also suggests that their
age/subdirectory structure. It does hence not surprise            approach could be further tuned to only use package and
that settings including package declarations and only             type information as compared to including also variable
few other pieces of information score best. In our experi-        identifiers. In use cases, in which the slightly slower
ments, class declarations seem to complement package              training of SVM and Logistic Regression is an issue, Naive
declarations best. Since Naive Bayes does not perform as          Bayes might be the better alternative.
well as the other classification algorithms, we formulate            It is common that the mappings are not that well-
our first finding as                                              aligned and straight-forward [29]. Furthermore, some
                                                                  programming languages do not declare any containment
Finding 1. In settings, in which the architectural mod-
                                                                  relationships equivalent to packages declarations. The
ule structure can be assumed to align well with macro-
                                                                  tested systems do not represent this scenario properly.
structures declared in the system’s implementation, these
                                                                  We therefore looked at the performance of the classifiers
declarations and type information should be extracted.
                                                                  without considering package declarations as approxima-
SVM and Logistic Regression provide more accurate results
                                                                  tion of their behaviour if we did not have that informa-
than Naive Bayes.
                                                                  tion or considered it useless. In this setting, Naive Bayes
Note that a straight-forward alignment does not neces-            exploiting import declarations, class declarations, and
sarily imply that a mapping can easily be constructed             comments, showed the best accuracy (on a par with ad-
manually without the need for automation in the first             ditionally including declarations of public methods).
place. In large-scale systems, structures of hundreds of
                                                         Finding 2. If alignment with any macro-structures de-
packages are not uncommon. If architectural modules
Figure 6: Performance metrics of classifiers over absolute training set size.


clared or derived from source cannot or should not be as- 15% of the overall dataset (equal to the total number of
sumed, Naive Bayes trained based on declarations of types, source code files) and above. Enhancing the initial map-
imports, and comments should be used.                        ping beyond this point may therefore turn out infeasible.
                                                             Even in the relatively small sample systems of this study
The standard deviation within groups of identical extrac- like JabRef, increasing this mapping by 5% of the overall
tions regarding different preprocessing settings is very number of code files means to map more than 40 addi-
low. This indicates that the impact of stemming and tional files. This may possibly not pay off, in particular for
splitting of compound words does not have a significant larger systems, if the gain in classification performance
impact on the resulting accuracy of any of the tested is minimal. We therefore state:
classification algorithms.
                                                             Finding 4. If the number of files supposed to be mapped
Finding 3. The selection of parts to be extracted for clas-
                                                             to each module can be estimated, mapping around 15% of
sifier training and mapping prediction appears to be more
                                                             that number in the initial mapping may be a good rule-of-
important than the selection of the preprocessing steps con-
                                                             thumb for training an efficient classifier.
sidered optional in this study.
                                                                 The results of experimenting with absolute training
Further investigation may be necessary to investigate the
                                                             set size complement these result for scenarios in which
potentially larger impact of other preprocessing steps in
                                                             it is not possible or desirable to estimate the number of
the individual scenarios described above.
                                                             files mapped to each module. For the tested systems, the
                                                             gain in accuracy, precision, and recall flattens out at 10
5.2. Findings regarding RQ2                                  files per module in the initial mapping which leads us to
In this subsection, we summarize the finding related to our final finding:
the question of how the training set size, corresponding         Finding 5. An initial mapping of at least 10 files per
to the number of initially, manually mapped files, affects       module may lead to a satisfactorily performing classifier.
classifier performance.
   The results suggest that in many cases the additional         Again these findings can only properly compared to Ols-
gain in accuracy, precision, and recall slackens at around       son et al. as Link et al. do no report details about training
set sizes [17, 20]. The results reported by Olsson et al.,     6. Conclusion
suggest that an initial mapping of ca. 20% lead to satisfy-
ing performance. Those results, however, are measured          The results of the presented study indicate that there is
as average over even imbalanced initial mappings that do       no silver bullet classifier. The choice of an optimal clas-
not take proportions of modules into account. We there-        sifier and elements to be extracted from source code is
fore think that our findings recommending a slightly           influenced by system characteristics like the alignment of
smaller relative size of the, however, stratified mapping      macro-structural elements with the assumed architecture.
is in line with those results.                                 To identify more of such distinguishing characteristics or
                                                               scenarios seems to be an interesting objective of future
                                                               research. It will be particularly relevant to investigate
5.3. Validity                                                  whether the recommendations regarding the size of ini-
Several factors limit the external validity of this study.     tial mappings hold in practice and if they apply for larger
Firstly, although the subject systems are anything but         systems, too.
trivial, they certainly do not represent large-scale soft-        Last but not least more classifiers wait to be tested for
ware systems. Further research in particular to confirm        their ability to automate code-to-architecture mapping.
or refine the findings regarding training set sizes is re-     For these as well as for those tested in this study, dif-
quired. Moreover, the results may not accurately reflect       ferent preprocessing techniques should be investigated
the behaviour of classifiers if package or equivalent dec-     more deeply and the improvement that hyperparameter
larations are considered but the architecture does not         tuning might achieve should be explored. Such a more
align with them. A further threat to external validity is      exhaustive comparative study might also need to take the
the scoping to systems written in Java. This is due to the     performance and the resource demands of the training
limited availability of systems for which the architecture     process into account.
as well as the source code is available. The systems iden-
tified appeared all to be Java-based. Moreover, we did not
tune the hyperparameters but only touched upon this            References
non-exhaustively in the before-mentioned exploratory
                                                                [1] D. E. Perry, A. L. Wolf, Foundations for the study of
pre-study. This might be considered a limitation as well
                                                                    software architecture, SIGSOFT Softw. Eng. Notes
as a threat to external validity as the results might differ
                                                                    17 (1992) 40–52. doi:10.1145/141874.141884.
for classification models with different hyperparameters.
                                                                [2] M. W. Godfrey, E. H. S. Lee, Secrets from the mon-
    The experiments aim at identifying causal re-
                                                                    ster: Extracting mozilla’s software architecture, in:
lationships between independent variables (extrac-
                                                                    In Proc. of 2000 Intl. Symposium on Constructing
tion/preprocessing settings and training set sizes, respec-
                                                                    software engineering tools (CoSET 2000, 2000, pp.
tively) and dependent variables (performance measures).
                                                                    15–23.
We are pretty confident that the internal validity is high
                                                                [3] C. Deiters, P. Dohrmann, S. Herold, A. Rausch, Rule-
as all other identified parameters were kept constant
                                                                    based architectural compliance checks for enter-
throughout the experiments. A potential threat are of
                                                                    prise architecture management, in: 2009 IEEE Inter-
course bugs in the scripts and software used to extract
                                                                    national Enterprise Distributed Object Computing
and preprocess data as well as for training and evaluating
                                                                    Conference, 2009, pp. 183–192. doi:10.1109/ED
the classifiers. We consider this risk to be low though as
                                                                    OC.2009.15.
established software libraries were used for this purpose
                                                                [4] J. van Gurp, J. Bosch, Design erosion: problems and
and any self-written code (largely produced by the first
                                                                    causes, Journal of Systems and Software 61 (2002)
and second author) was carefully reviewed by the third
                                                                    105–119. doi:10.1016/S0164-1212(01)00152
and fourth author.
                                                                    -2.
    We consider the selected method for cross-validation
                                                                [5] S. Sarkar, S. Ramachandran, G. S. Kumar, M. K.
as main threat to construct validity. Inappropriate
                                                                    Iyengar, K. Rangarajan, S. Sivagnanam, Mod-
train/test splits may lead to biased classification models
                                                                    ularization of a large-scale business application:
that might not reflect a classifier’s performance prop-
                                                                    A case study, IEEE Software 26 (2009) 28–35.
erly. We believe though that the chosen repetitions for
                                                                    doi:10.1109/MS.2009.42.
the cross-validation in the experiments was sufficient to
                                                                [6] L. Passos, R. Terra, M. T. Valente, R. Diniz, N. Men-
mitigate this risk.
                                                                    donça, Static architecture-conformance checking:
                                                                    An illustrative overview, IEEE Software 27 (2010)
                                                                    82–89. doi:10.1109/MS.2009.117.
                                                                [7] G. Murphy, D. Notkin, K. Sullivan, Software reflex-
                                                                    ion models: bridging the gap between design and
     implementation, IEEE Transactions on Software                 44948.3344984.
     Engineering 27 (2001) 364–380. doi:10.1109/32            [18] Z. T. Sinkala, S. Herold, InMap: Automated interac-
     .917525.                                                      tive code-to-architecture mapping, in: Proceedings
 [8] O. de Moor, D. Sereni, M. Verbaere, E. Hajiyev,               of the 36th Annual ACM Symposium on Applied
     P. Avgustinov, T. Ekman, N. Ongkingco, J. Tibble,             Computing, SAC ’21, Association for Computing
     .QL: Object-Oriented Queries Made Easy, Springer              Machinery, New York, NY, USA, 2021, p. 1439–1442.
     Berlin Heidelberg, 2008, pp. 78–133. doi:10.1007/             doi:10.1145/3412841.3442124.
     978-3-540-88643-3\_3.                                    [19] Z. T. Sinkala, S. Herold, InMap: Automated in-
 [9] S. Herold, Architectural compliance in component-             teractive code-to-architecture mapping recommen-
     based systems, Ph.D. thesis, Clausthal University             dations, in: 2021 IEEE 18th International Con-
     of Technology, 2011.                                          ference on Software Architecture (ICSA), 2021.
[10] S. Herold, A. Rausch, Complementing model-driven              doi:10.1109/ICSA51549.2021.00024.
     development for the detection of software architec-      [20] D. Link, P. Behnamghader, R. Moazeni, B. Boehm,
     ture erosion, in: Proceedings of the 5th Interna-             Recover and relax: Concern-oriented software ar-
     tional Workshop on Modeling in Software Engi-                 chitecture recovery for systems development and
     neering, MiSE ’13, IEEE Press, 2013, p. 24–30.                maintenance, in: Proceedings of the International
[11] S. Schröder, G. Buchgeher, Formalizing architec-              Conference on Software and System Processes, IC-
     tural rules with ontologies - an industrial evalu-            SSP ’19, IEEE Press, 2019, p. 64–73. doi:10.1109/
     ation, in: 2019 26th Asia-Pacific Software En-                ICSSP.2019.00018.
     gineering Conference (APSEC), 2019, pp. 55–62.           [21] C. C. Aggarwal, C. Zhai, A survey of text classifi-
     doi:10.1109/APSEC48747.2019.00017.                            cation algorithms, in: Mining text data, Springer,
[12] W. Ding, P. Liang, A. Tang, H. Van Vliet, M. Shahin,          2012, pp. 163–222.
     How do open source communities document soft-            [22] T. Olsson, M. Ericsson, A. Wingkvist, s4rdm3x: A
     ware architecture: An exploratory survey, in: 2014            tool suite to explore code to architecture mapping
     19th International Conference on Engineering of               techniques, Journal of Open Source Software 6
     Complex Computer Systems, 2014, pp. 136–145.                  (2021) 2791. doi:10.21105/joss.02791.
     doi:10.1109/ICECCS.2014.26.                              [23] I. H. Witten, E. Frank, M. A. Hall, Data Mining:
[13] N. Ali, S. Baker, R. O’Crowley, S. Herold, J. Buckley,        Practical Machine Learning Tools and Techniques,
     Architecture consistency: State of the practice, chal-        3 ed., Morgan Kaufmann, Amsterdam, 2011.
     lenges and requirements, Emp. Softw. Eng. 23 (2018)      [24] A. Sheshasaayee, G. Thailambal, Comparison of
     224–258. doi:10.1007/s10664-017-9515-3.                       classification algorithms in text mining, Interna-
[14] A. Christl, R. Koschke, M.-A. Storey, Equipping the           tional Journal of Pure and Applied Mathematics 116
     reflexion method with automated clustering, in:               (2017) 425–433.
     12th Working Conference on Reverse Engineering           [25] K. Shah, H. Patel, D. Sanghvi, M. Shah, A compara-
     (WCRE’05), 2005, pp. 10 pp.–98. doi:10.1109/WC                tive analysis of logistic regression, random forest
     RE.2005.17.                                                   and knn models for the text classification, Aug-
[15] A. Christl, R. Koschke, M.-A. Storey, Automated               mented Human Research 5 (2020) 1–16.
     clustering to support the reflexion method, Infor-       [26] Z. S. Harris, Distributional structure, WORD 10
     mation and Software Technology 49 (2007) 255–274.             (1954) 146–162. doi:10.1080/00437956.1954.
     URL: https://www.sciencedirect.com/science/ar                 11659520.
     ticle/pii/S095058490600187X. doi:https://doi.            [27] K. Sparck Jones, A statistical interpretation of term
     org/10.1016/j.infsof.2006.10.015, 12th                        specificity and its application in retrieval, Journal
     Working Conference on Reverse Engineering.                    of documentation 28 (1972). doi:10.1108/eb0265
[16] R. A. Bittencourt, G. J. d. Santos, D. D. S. Guer-            26.
     rero, G. C. Murphy, Improving automated map-             [28] R. R. Picard, R. D. Cook, Cross-validation of regres-
     ping in reflexion models using information re-                sion models, Journal of the American Statistical
     trieval techniques, in: 2010 17th Working Con-                Association 79 (1984) 575–583. doi:10.1080/0162
     ference on Reverse Engineering, 2010, pp. 163–172.            1459.1984.10478083.
     doi:10.1109/WCRE.2010.26.                                [29] J. Buckley, N. Ali, M. English, J. Rosik, S. Herold,
[17] T. Olsson, M. Ericsson, A. Wingkvist, Semi-                   Real-time reflexion modelling in architecture rec-
     automatic mapping of source code using naive                  onciliation: A multi case study, Information and
     bayes, in: Proceedings of the 13th European Con-              Software Technology 61 (2015) 107–123. URL: https:
     ference on Software Architecture - Volume 2, ECSA             //www.sciencedirect.com/science/article/pii/S095
     ’19, Association for Computing Machinery, New                 0584915000270. doi:https://doi.org/10.101
     York, NY, USA, 2019, p. 209–216. doi:10.1145/33               6/j.infsof.2015.01.011.