=Paper= {{Paper |id=Vol-2978/saerocon-paper6 |storemode=property |title=Hard Cases in Source Code to Architecture Mapping using Naive Bayes |pdfUrl=https://ceur-ws.org/Vol-2978/saerocon-paper6.pdf |volume=Vol-2978 |authors=Tobias Olsson,Morgan Ericsson,Anna Wingkvist |dblpUrl=https://dblp.org/rec/conf/ecsa/OlssonEW21b }} ==Hard Cases in Source Code to Architecture Mapping using Naive Bayes== https://ceur-ws.org/Vol-2978/saerocon-paper6.pdf
Hard Cases in Source Code to Architecture Mapping using
Naive Bayes
Tobias Olsson, Morgan Ericsson and Anna Wingkvist
Department of Computer Science and Media Technology, Linnaeus University, Kalmar/Växjö, Sweden


                                          Abstract
                                          The automatic mapping of source code entities to architectural modules is a challenging problem that is necessary to solve
                                          if we want to increase the use of Static Architecture Conformance Checking in the industry. We apply the state-of-the-art
                                          automatic mapping technique to eight open-source systems and find that there are systematic problems in the automatically
                                          created mappings. All of these eight systems have small modules that are very hard to map correctly since only a few source
                                          code entities are mapped to these. All systems seem to use some naming strategy, mapping source code to modules; however,
                                          naming is often ambiguous. We also find differences in ground truth mappings performed by experts, which affect mappings
                                          based on these, and that architectural refactoring also affects the mapping performance.

                                          Keywords
                                          Orphan Adoption, Software Architecture, Source Code Clustering, Naive Bayes



1. Introduction                                                                                                        the source code model to the architecture model to de-
                                                                                                                       termine whether the source code dependencies are con-
Our previous studies [1, 2] of automated techniques to                                                                 vergent, absent, or divergent compared to the allowed
map source code entities to high-level software archi-                                                                 dependencies specified in the architecture model.
tectural modules suggest that some entities are much                                                                      The need for a mapping between the source code and
harder to map correctly than others. Even using the                                                                    architecture models is a significant reason why SACC has
best algorithm and different parameters, certain entities                                                              not reached widespread use in the software industry [3, 5,
always seem to fail to map correctly. We conduct an                                                                    7, 8]; the tools and methods exist, but the mappings do not
exploratory study to determine whether our intuition is                                                                or are outdated. Many tools address this by combining
correct, i.e., that these hard cases exist, and if they do,                                                            manual mapping and regular expressions to filter file,
what their properties are, and what makes them hard to                                                                 module, and package names. Still, such approaches have
map correctly.                                                                                                         proven to be time-consuming and error-prone [5, 7, 8].
   The software architecture of a system captures major                                                                   If we want to automate the mapping process using, e.g.,
design decisions at a high level of abstraction and en-                                                                machine learning, it is vital to understand the hard cases.
ables internal and external qualities such as performance,                                                             If there is a class of entities that our approach cannot
portability, reusability, and maintainability [3]. It serves                                                           map automatically or always maps to the wrong modules,
as a guide for the many decisions that are made during                                                                 we need to ensure that these are part of the initial set
the implementation of a system. As the system evolves,                                                                 that a human expert maps. We perform an exploratory
the source code must continue to conform to the archi-                                                                 study using eight systems with ground truth mappings
tecture or risk accumulating technical debt and no longer                                                              to determine whether such a class exists. Once we have
possess the desired qualities.                                                                                         established that it exists, we determine its properties to
   Static Architecture Conformance Checking (SACC) is a                                                                identify its members automatically. We then investigate
collection of methods, such as Reflexion modeling [4],                                                                 why these properties make the entities difficult to map to
that statically analyze source code to ensure that it does                                                             ensure that they will not reduce the effectiveness of the
not introduce architectural violations [5, 6]. These meth-                                                             machine learning approach; we do not want it to learn
ods require an architecture model, with modules and                                                                    the wrong things from the hard cases.
dependencies, and a source code model, with entities                                                                      We hypothesize that at least some hard cases would
and concrete dependencies, e.g., due to inheritance or                                                                 be difficult for a human to map and that different human
method invocations. They also require a mapping from                                                                   experts would disagree on how they should be mapped.
                                                                                                                       This can, for example, be due to poor structuring or the
ECSA2021 Companion Volume                                                                                              evolution of the system. We rely on different ground-
Envelope-Open tobias.olsson@lnu.se (T. Olsson); morgan.ericsson@lnu.se                                                 truth mappings of the same system and metrics to identify
(M. Ericsson); anna.wingkvist@lnu.se (A. Wingkvist)
                                                                                                                       such cases and study how well these correlate to the hard
Orcid 0000-0003-1154-5308 (T. Olsson); 0000-0003-1173-5187
(M. Ericsson); 0000-0002-0835-823X (A. Wingkvist)                                                                      cases.
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                    Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)




                                                                                                                   1
Tobias Olsson et al. CEUR Workshop Proceedings                                                                                                                     1–10



                                                                  Orphan Entity                              be analyzed to determine its purpose and its similarity
                                                   StringChange                                              to the purpose of the modules.
           Structural Relations from the
           Orphan Entity to the Mapped
                      Entities
                                                                                                                A sub-problem of orphan adoption is orphan kidnap-
  Architectural
                                                                                                             ping, where software evolution causes a need for remap-
                                                        ?
  Module
                                                    Automated
                                                                                                             ping an entity to a new module, or in other words, correc-
 GUI                                                                  Logic
                                                     Mapping
                                                                                                             tive clustering. Tzerpos and Holt identify a fifth criterion
      ChangeScanner       AttachFileAction
                                             Allowed Module Dependency
                                                                              DOIChek XMLUtil DataBank
                                                                                                             related to orphan kidnapping, Interface minimization; it
  Initially Mapped Set                                                                                       is not a good idea to reassign an entity to another mod-
Figure 1: An example mapping that shows the initial sets                                                     ule if the removal of the entity will cause the module to
of the GUI and Logic modules of JabRef 3.7. A new orphan                                                     get a larger public interface, i.e., the entity is an entry
StringChange is about to be mapped.                                                                          point/facade to the module.
                                                                                                                HuGMe [10, 8] relies on orphan adoption to map from
                                                                                                             the source code to the architecture model. It starts from
                                                                                                             an initial set of entities that are manually mapped to the
2. Automated Mapping                                                                                         correct module. The remaining entities are considered
                                                                                                             orphans. HuGMe is applied iteratively, and as the set
To reason about how well an implementation conforms
                                                                                                             of mapped entities can grow for each iteration, more
to the intended architecture using, e.g., Reflexion mod-
                                                                                                             orphans have the potential to be automatically mapped.
eling, we need a mapping from the source code to the
                                                                                                             In each iteration, there is also the possibility for human
architecture. In this section, we discuss how such a map-
                                                                                                             intervention using the result of the failed automatic map-
ping can be created semi-automatically, starting from an
                                                                                                             ping attempts as a guideline. The automatic mapping is
initial set of mapped source code entities.
                                                                                                             done by calculating the attraction between the orphan
   The source code model consists of Entities (E) and De-
                                                                                                             and the mapped entities for each module. Christl et al.
pendencies (ED). The entities are, e.g., classes defined
                                                                                                             present two attraction functions, CountAttract and MQAt-
in a programming language, and the ED are due to, e.g.,
                                                                                                             tract, based on dependencies, i.e., the structure criterion.
method calls and inheritance, see StringChange, ChangeS-
                                                                                                                Bittencourt et al. evaluate two new attraction functions
canner, etc., in Figure 1.
                                                                                                             based on information retrieval techniques. They use the
   The architecture model consists of Modules (M) and
                                                                                                             names of modules and entities and the names of iden-
Dependencies (MD) between these. The modules repre-
                                                                                                             tifiers in the entities to form vocabulary documents for
sent the major parts of the architecture; see, e.g., GUI
                                                                                                             modules and entities, i.e., the naming and semantic crite-
and Logic in Figure 1. The directed MD indicates how
                                                                                                             ria. They then use a cosine similarity function, IRAttract,
these modules are allowed to interact and depend on each
                                                                                                             and latent semantic indexing, LSIAttract, to calculate the
other. If there, for example, is an MD from GUI to Logic,
                                                                                                             attraction values.
then entities mapped to GUI are expected to call entities
                                                                                                                Our attraction function, NBAttract, combines ideas
mapped to Logic.
                                                                                                             from the previous two and considers the structure, nam-
   An automated mapping algorithm aims to map each
                                                                                                             ing, and semantic criteria [2]. The approach is similar
entity to the correct module without human assistance.
                                                                                                             to that of Bittencourt et al., but we instead use a Naive
For example, classes in the implementation that deal with
                                                                                                             Bayes classifier to determine similarity to other entities.
the application’s business rules should be mapped to the
                                                                                                             To include the structure criterion, NBAttract uses a novel
module Logic. Once this mapping exists, we can compare
                                                                                                             approach, Concrete Dependency Abstraction (CDA), to
the ED of the implementation to the MD allowed by the
                                                                                                             encode dependencies as text [2]. NBAttract has outper-
architecture and determine whether they are convergent,
                                                                                                             formed CountAttract in our previous study [2], and Coun-
absent, or divergent [4].
                                                                                                             tAttract was not clearly outperformed in [7]. We, there-
   We rely on orphan adoption [9] to map entities to mod-
                                                                                                             fore, only use NBAttract in the remainder of this paper.
ules automatically. An unmapped entity is considered an
orphan that should be adopted by one of the modules, e.g.,
StringChange in Figure 1. Tzerpos and Holt identify four                                                     3. Method
criteria that can affect the mapping. Naming, naming
standards can reveal what module is suitable. Structure,                                                     Based on our experiences with different attraction func-
dependencies between an orphan and already mapped                                                            tions, we hypothesize that no matter how well the func-
entities can be used as a mapping criterion. Style, mod-                                                     tion performs, there is a specific set of entities that are
ules are often created using different design principles                                                     always misclassified. We seek to investigate this further
(e.g., high cohesion or not). Classifying the orphan based                                                   to determine whether our hypothesis is correct or if the
on style can give hints on how to use, for example, the                                                      misclassifications happen by chance due to randomness
structure criteria. Semantics, the source code itself can                                                    in the composition and size of the initial set.




                                                                                                         2
Tobias Olsson et al. CEUR Workshop Proceedings                                                                                1–10



   We have previously implemented a tool to evaluate                for misclassification based on our own experience and the
different mapping approaches, including reporting de-               advice from related work, and present exciting findings
tailed mapping results [11]. We use this tool to create             from the data. The ultimate goal is to construct strategies
a new dataset over the mapping results for each source              to detect entities with a high risk of being misclassified so
code entity.                                                        that a human can intervene and classify these manually.
   We run NBAttract, with the following settings. We                More specifically, we will investigate:
use an initial set of mapped entities of random size and               Is the set of problematic entities a good candidate for
composition. We extract package names, filenames (these             the initial set? This set needs human intervention for
correspond to the outer class names in Java), attribute             automatic mapping to perform well, effectively removing
identifier names, and variable identifier names from the            the problem from the automatic mapping. This can be
source code entities in the initial set and tokenize these          assessed by computing the F1 score of the precision and
based on Camel-case and the characters - and _ . The                recall, as we did in [2]. We will compare the F1 scores
tokens are then stemmed using a Porter stemmer. Tokens              across the entire range of initial set sizes visually.
that are shorter than three characters are removed. We                 Is the set of problematic entities related to small modules?
use our CDA technique to represent dependencies as text             In general, machine learning techniques need good data
strings. We use a binary token frequency (present or not)           to perform. In particular, there is a need for a balanced
and 0.9 as the threshold for automatic classification.              dataset where there is approximately the same amount
   These settings correspond to the settings used in [2]            of data to learn from in each class. If the dataset is im-
with one exception; we do not require the initial set to            balanced, there is a high chance that smaller classes will
contain at least one source code entity from each module            not be properly handled. An architectural module should
in this study. We are interested in how individual files            contain a fair amount of source code entities. Still, there
are mapped to find possible flaws in the technique, which           may exist modules that hold source code entities that do
is why we allow for a module to be empty initially.                 not fit well in other modules, or the system may be under
   As we run several experiments with random initial sets,          evolution, and intended source code has not been created
we get a dataset that shows the correct mapping of each             yet, etc. We need to know if such small modules exist
entity and the number of mappings for each entity and               and whether they are common or problematic.
module. Based on this information, we can compute an                   Is the set of problematic entities related to entities with
error rate for each entity according to Equation 1. If the          poor naming? Tzerpos and Holt [9] define naming as one
attraction function was completely stochastic, the error            of the key criteria that influence the mapping. In our
rate for each entity would converge to the stochastic               experience, it is also a common strategy for developers
error rate, defined in Equation 2.                                  to create folders, packages, and filenames that reflect the
                                                                    modular architecture to some degree. It would thus be
                                                                    interesting to know if the naming of source code entities
                       |erroneous mappings|
            errnba =                                     (1)        includes the module’s name it is mapped to. It is also
                            |mappings|                              interesting to know if there are ambiguities in the naming,
                                                                    i.e., if several module names match the name of a source
                            |modules| − 1                           code entity.
                 errsto =                                (2)           Is the set of problematic entities related to entities on the
                              |modules|
                                                                    border of a module? Bibi et al. [12], Tzerpos and Holt [9],
   As NBAttract is not a stochastic function, the 𝑒𝑟𝑟𝑛𝑏𝑎 for        and Bittencourt et al. [7] state that dependencies have
an entity should converge to something less than 𝑒𝑟𝑟𝑠𝑡𝑜 if          an impact on the mappings. We use a textual representa-
there are no systematic problems, i.e., it should systemat-         tion of dependencies in NBAttract, but this may not be
ically produce better mappings than a random mapping.               good enough. We will investigate the ratio of external
Hence, we can conclude that there are systematic prob-              dependencies, e.g., an entity with many external depen-
lems if we do not find such a convergence for a source              dencies would likely be an entity that lies on the border
code entity after several iterations. If we find systematic         of a module. If we find a correlation between the external
errors in a majority of the systems, we will further an-            dependency ratio and the error rate, this could suggest
alyze all problematic entities to find common, possible             that border entities are problematic.
causes for the misclassification. An entity is considered              There are several metrics based on dependencies. We
problematic if its 𝑒𝑟𝑟𝑛𝑏𝑎 ≥ 0.5, i.e., it is misclassified in       use coupling (the count of all dependencies to or from
50% or more of the mappings. The motivation for this                all other entities) and fan (the existence of a dependency
limit is that a non-problematic attraction function should,         to or from all other entities). The coupling may be very
on average, produce a correct mapping in at least 50%               high between two entities, but the fan can at most be
of the cases for each entity. This part of the research is          one between two entities, i.e., fan is a subset of coupling.
highly exploratory. We investigate the possible reasons             While coupling captures the absolute number of depen-



                                                                3
Tobias Olsson et al. CEUR Workshop Proceedings                                                                                                                 1–10



dencies fan focuses on the diversity of different entities,            Table 1
i.e., a high fan value captures that an entity has many                Mapping Data Overview.
dependencies to other different entities.
                                                                       System Lines # Mod # Ent                             err ≥ 0.5         err ≥ errsto
   Is the set of problematic entities related to problems in the
ground truth mapping? We have access to two versions                   Ant       36 699              16          468      187      39.96%     72     15.38%
of the JabRef system in which the modules and relations                A.UML     62 392              19          767      165      21.51%     74      9.65%
between them are the same (same intended architecture),                JR 3.7    59 235               6        1 017      107      10.52%     40      3.93%
but the mappings are not the same for all entities. This               JR 3.5    51 840               6          733       96       13.1%     51      6.96%
                                                                       Lucene    35 812               7          514       60      11.67%     17      3.31%
provides an opportunity to study discrepancies in the
                                                                       ProM       9 947               4          261       18        6.9%      9      3.45%
ground truth mappings and how these affect the auto-                   S.H 3D    34 964               9          167       39      23.35%     19     11.38%
matic mapping performance. One complicating factor in                  T.Mates   54 904              12          450      115      25.56%     49     10.89%
this analysis is that JabRef underwent an architectural
evolution between these two versions. Therefore, we                                               Entity Error Rates per Project

limit our analysis to entities that remain the same (no                 1.0
changes to the source code) but are mapped to different
modules.
                                                                        0.8
   Is the set of problematic entities related to files that are
being refactored due to architectural evolution? The two
versions of JabRef provide an opportunity to study enti-                0.6


ties that have changed packages and mapping (a sign of
architectural evolution), have changes to the source code               0.4


(a sign of refactoring), or were recently added.
   We study eight open-source systems implemented in                    0.2


Java. Ant1 is an API and command-line tool for process
automation. ArgoUML2 is a desktop application for UML                   0.0


modeling. Jabref3 is a desktop application for managing
                                                                                 Ant



                                                                                          A.UML


                                                                                                     Jr v3.5



                                                                                                                Jr v3.7



                                                                                                                          Lucene


                                                                                                                                    ProM



                                                                                                                                            S.H 3D



                                                                                                                                                     T.Mates
bibliographical references, and we use the 3.5 and 3.7 ver-
sions. Lucene4 is an indexing and search library. ProM5
                                                                       Figure 2: The entity error rates for each project.
is an extensible framework that supports a variety of pro-
cess mining techniques. Sweet Home 3D6 is an interior
design application. TeamMates7 is a web application for
handling student peer reviews and feedback.                            4. Results and Analysis
   Table 1 presents the sizes of the systems in lines of
code, number of entities, and number of modules. There                 We performed the experiment and collected mapping
exist a documented software architecture as well as a                  data per entity for each system. All systems show several
mapping from the implementation to this architecture                   entities always being misclassified (an error rate of 1.0)
for each system. Jabref 3.7, TeamMates, and ProM have                  (cf. Figure 2). Table 1 shows an overview of the data
been the subjects of study at the Software Architecture                collected. Note that each entity has a random chance to
Erosion and Architectural Consistency Workshop (SAE-                   be included in the initial set and not be an orphan in that
roCon) 2016, 2017, and 2019 respectively, where a system               particular run of the experiment. There is also a chance
expert has provided both the architecture and the map-                 an entity will not be mapped (e.g., due to variations in
ping. The architecture documentation and mappings are                  the initial set). However, each entity has been mapped at
available in the SAEroCon repository8 . ArgoUML, Ant,                  least 500 times.
and Lucene were studied by Brunet et al. and Lenhard                      We now construct the initial set using entities with
et al., and the architectures and mappings were extracted              𝑒𝑟𝑟𝑛𝑏𝑎 ≥ 0.5, i.e., only entities with 𝑒𝑟𝑟𝑛𝑏𝑎 < 0.5 are con-
from the replication package of Brunet et al. as well as for           sidered orphans, and all the troublesome entities are in-
Sweet Home 3D. JabRef 3.5 was extracted from Lenhard                   cluded in the initial set. We compare this with randomly
et al..                                                                selecting from all entities in the initial set. We collected
                                                                       14 849 and 13 754 data points from the respective groups.
    1
      https://ant.apache.org                                           Figure 3 shows the running median (±100 data points)
    2
      http://argouml.tigris.org
    3
      https://jabref.org
                                                                       and limits of the running 75th and 25th percentiles of
    4
      https://lucene.apache.org                                        the F1 scores, respectively, for JabRef 3.7. Since the other
    5
      http://www.promtools.org                                         systems show similar trends, so we focus on JabRef. We
    6
      http://www.sweethome3d.com                                       find that our idea is promising overall, especially when
    7
      https://teammatesv4.appspot.com
    8                                                                  the initial set size increase.
      https://github.com/sebastianherold/SAEroConRepo




                                                                   4
Tobias Olsson et al. CEUR Workshop Proceedings                                                                                                                                           1–10


                                JabRef 3.7 f1 Scores                                                                             Relative Miss-Classifications vs Relative Module Size



      1.0




                                                                                                                         1.0
      0.8




                                                                                                                         0.8
                                                                                         Relative Miss-Classifications
      0.6




                                                                                                                         0.6
 f1

      0.4




                                                                                                                         0.4
                                                                                                                         0.2
      0.2




                                                                 All Entities




                                                                                                                         0.0
      0.0




                                                                 Error Rate < 0.5


            0.0   0.1   0.2   0.3   0.4    0.5     0.6   0.7   0.8   0.9     1.0                                               0.0      0.1        0.2         0.3        0.4     0.5

                                     Initial Set Size                                                                                              Relative Module Size


Figure 3: The running median F1 score with limits of the                                Figure 4: The relative rate of misclassified entities vs. the
running 75th and 25th percentiles for JabRef 3.7 and initial set                        relative number of entities for each module. Coordinates are
type over the whole interval of initial set sizes.                                      slightly jittered to show data points more clearly.



   However, there is also an interval between the initial                               all entities are misclassified. Also, note that there are
set sizes of 0.1 to 0.18, marked with vertical lines in Fig-                            small modules with a relatively low number of misclas-
ure 3, where the F1 score is considerably lower than using                              sified entities. Another factor to consider is that only
all entities. This indicates that entities with 𝑒𝑟𝑟𝑛𝑏𝑎 ≥ 0.5                            321 entities of 4377 (7.33%) are mapped to these small
are not good representatives of modules. Upon further                                   modules.
inspection of the actual modules and entities in JabRef 3.7,                               To measure the extent of using a naming strategy (NS)
we find that there is a set of modules with very few enti-                              when naming concrete entities in each system, we check
ties in the ground truth mapping, and entities mapped to                                if the words in the package or class name for an entity
these all have a high error rate.                                                       contain the module name. Table 2 shows that most sys-
   The high error rate makes sense in general, as ma-                                   tems use a naming strategy (column NS) to a rather high
chine learning techniques produce better results if there                               degree. Lower values (e.g., TeamMates) are often due to
is more data. More specifically, for Naive Bayes, the                                   a module naming discrepancy, e.g., TeamMates defines
probability of finding an entity in such a module is very                               a module view with a corresponding path word named
low, so it does not make sense to map entities to it. We,                               ui; however, there are also cases where there is no clear
therefore, investigate if all systems have such small mod-                              naming strategy for an entity. We consider an entity
ules prone to misclassification. If entities are equally                                to have ambiguous naming if its path or filename con-
distributed among the modules of a system, there would                                  tains several different module name words. For example,
be 1/|modules|% entities in each module. We regard a                                    net.sf.jabref.logic.net.ProxyPreferences, from JabRef v3.7,
module as small if it has less than half the number of                                  contains both the module names logic and preferences.
entities of 1/|modules|%. Thus we define the limit for a                                Ambiguity in entity naming strategy (ANS) seems to be
small module as 0.5/|modules|%. It could be argued that                                 quite common in some systems (Ant, ArgoUML, JabRef
the lines of code should be used as a more fine-grained                                 3.5, Sweet Home 3D, and TeamMates) and not at all in
measure of size, i.e., mapping one huge entity in terms of                              others (JabRef v3.7 ProM and Lucene). In some systems,
lines of code. However, for example, path and file name                                 the ambiguity is caused by having a parent-level package
information is per entity, and for effectively learning a                               that is also a module. For example, Ant uses ant as both
pattern based on entity names, more entities are needed.                                a high-level package and a module. The misclassification
   Table 2 shows the limit, the number of small modules,                                rate in the ambiguously named entities (ANSM) seems to
and the rate of misclassification of entities in these mod-                             follow the inverse pattern of the ANS; the lower the ANS,
ules. A surprising result is that all systems have such                                 the higher the ANSM. This makes sense since a higher
small modules, and all systems have small modules where                                 ANS means there is more data to learn the pattern of the
all entities are misclassified. There are 30 (out of a total of                         ambiguous naming from (if there is one).
73) modules where all entities are misclassified. Figure 4                                 We now turn our attention to whether entities that lie
shows how the relative number of misclassifications and                                 on the border of a module, i.e., have relatively many de-
relative module size are related. Note the cloud of points                              pendencies to entities in other modules, are problematic.
in the upper left corner. These are the 30 modules where                                We use the common coupling and fan metrics. Results for




                                                                                    5
Tobias Olsson et al. CEUR Workshop Proceedings                                                                                                        1–10




                                                                                                                                        1.0
                                                                                           1.0
Table 2
The number of entities for small modules (Limit), the number
of small modules (SM), their rate of misclassification (SMM),




                                                                                                                                        0.8
                                                                                           0.8
the rate of entities with a naming strategy (NS), ambiguous
naming strategy (ANS), and rate of entities with ambiguous




                                                                      External Fan Ratio




                                                                                                                                        0.6
                                                                                           0.6
naming that are misclassified (ANSM).

  System Limit SM              SMM         NS   ANS ANSM




                                                                                           0.4




                                                                                                                                        0.4
  Ant             3.57     9 92.75 100.00 85.90       29.41
  A.UML           3.33     7 70.59 84.62 62.45        53.94




                                                                                           0.2




                                                                                                                                        0.2
  JR v3.5         8.33     4 93.75 71.62 30.29        42.71
  JR v3.7         8.33     3 100.00 95.58 7.18        82.24




                                                                                           0.0
  Lucene          7.14     3 57.45 99.22 2.53         90.00




                                                                                                                                        0.0
  ProM           12.50     1 100.00 100.00 0.77       88.89                                      0.0   0.2   0.4      0.6   0.8   1.0         <.5   >=0.5

  S.H 3D          5.56     5 100.00 89.82 12.57       64.10                                                   Error Rate                      Error Rate

  T.Mates         4.17     5 72.50 68.22 34.67        93.91
                                                                     Figure 5: The error rate versus the external fan ratio of each
                                                                     entity in large modules. Coordinates are jittered for clarity.
                                                                     The box plot shows the difference in the external fan ratio of
coupling were very similar to the results for fan. How-              non-problematic (error rate < 0.5) and problematic entities.
ever, the fan metric seems less noisy, so we opt only
to report these values. We use a scatter plot to check
whether there is a correlation between these metrics and             Finally, we investigate the difference in error between the
the error rate. We do this for entities that are part of large       versions. All entities move to a lower error rate in JabRef
modules since small modules are a confounding factor.                v3.7, and 11 entities have a problematic mapping in JabRef
Figure 5 shows that there is no clear relation between the           3.5 (cf. Figure 6). JabRef 3.5 has 36 entities mapped to non-
external fan ratio and the rate of relative misclassification        small modules with problematic error rates. Between 4
of an entity. It would, therefore, not make sense to find a          and 11 of these seem to be due to problems in the ground
correlation between the two variables.                               truth mappings, i.e., 11.1% and 30.6%. Optimally, these
   Yet, when we investigate the difference in external               are cases where a technique would alert and spark a
coupling for problematic versus non-problematic enti-                discussion regarding the ground truth mappings among
ties, we find a clear difference in the distribution of the          the developers.
external fan ratio. Problematic entities have a higher                  It should be noted that JabRef underwent a refactor-
external fan ratio in general. This indicates that we need           ing towards a new modular architecture at this point in
to investigate further how to correctly classify entities            development. Therefore, we do not think that these rela-
that lie on the border of a module. In total, there are 3 502        tively high percentages are representative of all software
entities with a low error rate (𝑒𝑟𝑟𝑛𝑏𝑎 < 0.5) and 497 enti-          systems. The developers likely have a higher degree of
ties with a high error rate. The number of entities with a           conformance in a more stable architecture.
low error rate is also higher throughout the distribution               Lastly, we look at architecturally refactored entities
of the external fan ratio. This makes the probability of             between JabRef 3.5 and JabRef 3.7. We define an archi-
finding a problematic entity using the external fan ratio            tecturally refactored entity as an entity that has changed
very small.                                                          mapping and package. We view the conscious choice to
   To investigate possible cases of disagreement in map-             change the package of an entity as a sign that the change
pings, we study the entities that have a change in their             in mapping is not a mistake or disagreement but a part
mapping between versions 3.5 and 3.7 of JabRef. We                   of architectural evolution. We find 61 such entities, 5
first specifically look at entities that have only changed           of which are mapped to a small module in one or both
their mapping and not moved in the package hierarchy.                versions. We find that refactored entities have a signifi-
We consider such nodes as having an ambiguous map-                   cantly higher error rate if we compare the error rate of
ping. We find 17 such entities, 5 of which are mapped                these entities with both new and normal entities from
to a small module in one or both versions, which will                the majority of modules in JabRef 3.7 (cf. Figure 7).
make the error rate unrepresentative. We are left with                  Such refactored entities could still be in a state of tran-
12 entities.                                                         sition, and it seems likely to be a practice to make the
   We investigate the change of source code for these                change of package and mapping before changing major
entities using cloc9 and find five entities without any              parts of the implementation. An architectural refactoring
code changes and seven with varying degrees of change.               can also change the purpose of a module itself, though
    9
                                                                     this will be a slower process for an automatic mapper to
        https://github.com/AlDanial/cloc




                                                                 6
Tobias Olsson et al. CEUR Workshop Proceedings                                                                                              1–10


                   Change in Error for Entities with Changed Mapping
                                                                                      pare the performance of different approaches, but not
         1.0
                    JabRef v3.5
                    JabRef v3.7
                                      no code change
                                      code changed                                    to specifically analyze problematic cases. We highlight
                                                                                      the conclusions of prior work made regarding what may
                                                                                      explain the performance.
         0.8




                                                                                         The orphan adoption criteria naming, structure, style,
                                                                                      and interface minimization are used in an algorithm eval-
         0.6




                                                                                      uated in three case studies [9]. We find an evolving in-
 Error




                                                                                      dustrial system where the architecture was created by
         0.4




                                                                                      researchers with the help of developers the most inter-
                                                                                      esting of these. 939 entities were assigned to modules,
         0.2




                                                                                      and in 46 cases (4.9%), the algorithm suggested a different
                                                                                      mapping than the developers. In 33 of these cases, the
         0.0




               1      2     3     4   5    6     7     8   9    10      11   12
                                                                                      developers agreed with the algorithm’s mapping, i.e., the
                                            Entity                                    algorithm was able to find developer mistakes. In some
Figure 6: The change in the error of 12 entities from large                           of the 13 cases where the suggested module was not ac-
modules in JabRef that have changed mapping but not                                   cepted, the developers mentioned that (code) changes to
changed package. The first five (blue) entities have had                              the entity were needed for it to conform to the developer
no change in source code and the last seven (orange) have                             mapping.
changed source code.                                                                     Bibi et al. compared the structural criteria part of the
                                                                                      algorithm proposed by Tzerpos and Holt with supervised
                          JabRef 3.7 Large Module Error Rates                         machine-learning approaches; Bayesian classification,
                                                                                      k-nearest-neighbor, and neural networks. Their study
 1.0




                                                                                      focuses on using dependencies as features (i.e., struc-
                                                                                      tural criteria) for incremental clustering. They evaluate
                                                                                      the approaches using two versions of six open-source
 0.8




                                                                                      software systems and find that dependencies between
                                                                                      entities within the same module are important to avoid
 0.6




                                                                                      misclassifications, especially when there are few depen-
                                                                                      dencies between entities in different modules.
 0.4




                                                                                         We previously constructed a structure-based heuristic
                                                                                      for automatic mapping of source code to Model-View-
 0.2




                                                                                      Controller-based architectures [15]. We evaluated the
                                                                                      approach on four products in a product line of games,
                                                                                      all using the same game engine. We compared the au-
 0.0




                   refactored             new                  normal                 tomatic mapping to the manual mapping, and if they
Figure 7: The error rates of entities in large modules that are                       disagreed, then the type was flagged as containing an
undergoing refactoring, are new, or normal in JabRef 3.7.                             architectural problem. We compared the mappings of
                                                                                      653 entities and were able to correctly identify 76 out of
                                                                                      101 architectural problems as well as 18 false positives.
detect. The risk is that a module can be quite chaotic dur-                           The heuristic suggested a different mapping in 96 (14.7%)
ing a transition phase with multiple entities in different                            of 653 cases.
stages of the refactoring process.                                                       Furthermore, two of the projects were refactored to
   Another interesting observation is that new files tend                             be fully conformant. This refactoring removed 33 true
to have a lower error rate, indicating that the developers                            positives and six false positives. The true positives were
have understood the new architecture and that normal                                  remedied by refactoring the source code. In the context
code changes could slowly make an entity harder to clas-                              of evaluating the performance of a method for automatic
sify. This could be due to some form of design erosion,                               mapping using the manual mappings as ground truth,
where changes are introduced that make the entity less                                these true positives would be regarded as erroneous map-
cohesive over time.                                                                   pings when they, in fact, are pointing to source code with
                                                                                      architectural problems that need to be refactored.
                                                                                         The CountAttract and MQAttract attraction functions
5. Related Work                                                                       of HuGMe have been evaluated in four case studies [10, 8].
                                                                                      The focus is on evaluating the influence of two config-
There is previous work in the area of orphan adoption [9,                             uration parameters and comparing the performance of
10, 8, 7, 12, 15, 16, 17]. The focus is to evaluate and com-                          the attraction functions. Both attraction functions as-



                                                                                  7
Tobias Olsson et al. CEUR Workshop Proceedings                                                                            1–10



sume a modular design based on the high cohesion low               sented a fairly good correlation, and in one system, they
coupling style, and mapping would become problematic               could find a repeating pattern of directories. Possibly
for modules designed specifically to not use this style.           the ground truth architectures recovered in their study
Christl et al. suggest the incorporating a detection step          is more low level than the modular architectures that
to better handle such modules, which would correspond              we study. Still, it is likely that there is a variation on
to handling the style criteria. Furthermore, Chen et al.           what dimension of an architecture that is expressed in
improves on CountAttract in an evolutionary case, i.e., a          the package structure. This is further supported by Buck-
pre-existing mapping is used.                                      ley et al. where one system of five studied did not have
   Bittencourt et al. present two new attraction functions         any clear correlation between packages and modules [19],
based on information retrieval techniques. They use the            presenting clear difficulties and significant effort when
semantic information in the source code and calculate at-          performing the manual mapping.
tractions based on cosine similarity (IRAttract) and latent
semantic indexing (LSIAttract). They make a quantitative
comparison between the performance of their attraction             6. Discussion and Validity
functions with CountAttract and MQAttract in an evolu-
                                                                   Our results clearly show that there is a set of entities in
tionary setting (where a few new files are to be assigned
                                                                   the systems that are systematically hard for the state-
a mapping). They find that a combination of attraction
                                                                   of-the-art automatic mapping techniques to map. One
functions (e.g., if CountAttract fails, then try IRAttract)
                                                                   reason for this is the surprising result that all studied
performs best. This is explained by their qualitative anal-
                                                                   systems exhibit some very small modules. An automated
ysis, where they find that CountAttract usually misplaces
                                                                   technique would have very little data to use for these
entities on module borders, MQAttract performs better
                                                                   modules, lowering the chance for successful mapping.
when mapping entities with dependencies to many dif-
                                                                      In general, unbalanced data is problematic for machine
ferent modules, IRAttract and LSIAttract perform better
                                                                   learning techniques, and in particular, the distribution of
when mapping entities in libraries or entities on module
                                                                   probabilities is important in Naive Bayes. 30% (237 out of
borders, but perform less well if there are modules that
                                                                   784) problematic entities are mapped to such small mod-
share vocabulary but are not related.
                                                                   ules in the ground truth mappings. This is a significant
   Sinkala and Herold present InMap, which is not an
                                                                   problem that needs to be solved.
automated approach to mapping per se; instead, InMap
                                                                      In essence, small modules need to be flagged (either
suggests mappings to the end-user, who can choose to
                                                                   automatically or manually) and handled separately. One
accept the suggested mapping (or not). It is an iterative
                                                                   idea in the context of Naive Bayes would be to manipulate
approach where a number of mappings are presented,
                                                                   the probability distribution appropriately to not wholly
and the accepted mappings are used to improve the sug-
                                                                   disregard small modules in the mapping. Schemes that
gested mappings further. The suggested mappings are
                                                                   could be tested are a uniform distribution or different
produced with the help of information retrieval informa-
                                                                   fixed settings (large, medium, small). These should be
tion similar to Bittencourt et al. with the addition of a
                                                                   reasonably easy for an end-user to assign to a module.
descriptive text for each architectural module. The enti-
                                                                      Still, there is a risk that overall performance will drop
ties are treated as a database of documents, and InMap
                                                                   as potentially more entities will be hard to map. Another
uses Lucene to search this database using module infor-
                                                                   approach is to investigate why a few of the small modules
mation as a query. As InMap is highly interactive, it will
                                                                   do not contain many problematic entities. We suspect
also use negative evidence to some degree, i.e., a rejected
                                                                   that these modules possibly exhibit a unique design, e.g.,
mapping suggestion will not be suggested again. The
                                                                   being very cohesive or having very clear naming, which
data from [16] suggest that using only the module names
                                                                   is perhaps not easy to address directly in a technique as
as a search criterion often results in high precision at the
                                                                   it may simply be a way a module is designed.
expense of the recall. This is most likely due to the fact
                                                                      Using the naming strategy and possible ambiguity in
that module names often reflect package names to some
                                                                   naming is an attractive approach to create an initial set us-
degree. Adding more and more module information in
                                                                   ing a specialized mapper. It should be possible to prompt
the query tends to lower precision, but increase the re-
                                                                   an end-user with, e.g., keywords from the package or
call, e.g., source code comments increase recall but lower
                                                                   class name asking for a mapping of the keyword. This
precision in the mapping suggestions.
                                                                   could significantly reduce the effort of creating an initial
   Garcia et al. discuss the use of package and naming
                                                                   set that could then be used as a basis for other map-
information in software architecture recovery [18]. In
                                                                   ping techniques. However, a complete approach must be
general, they found that their ground truth components
                                                                   prepared to handle subject systems where the naming
often spanned or shared several packages. They could
                                                                   information does not reflect the modular architecture.
not find a correlation between components and single
                                                                      The data on finding problematic entities among entities
package or directory names. One of their four cases pre-



                                                               8
Tobias Olsson et al. CEUR Workshop Proceedings                                                                           1–10



that lie on the borders of modules is conflicting. On               will become less semantically cohesive as the vocabulary
the one hand, we cannot see any correlation between                 becomes a mix of words from the previous architecture.
the external fan ratio and the error rate. On the other             The error rate of entities could then be used as a metric
hand, we observe a higher median external fan ratio in              to know if an entity is properly aligned to other entities
problematic entities. We observe very high error rates in           in the module.
combination with very low external fan ratios and vice                 Comparing a human-made mapping to the mapping
versa. This indicates that the external fan ratio is not a          made by an automatic technique seems to be a useful
useful metric, and a more refined metric could give better          piece of information. The related work [9, 15] shows that
answers. There is possibly a difference between incoming            this often points to cases where (further) refactoring or
and outgoing dependencies that could be a factor. In [9],           discussion is needed and that the automatic technique is
these entities were specifically detected and only used             not necessarily wrong per se. However, if no human map-
when suggesting a new module (orphan kidnapping).                   ping exists, is it important for an automated technique to
Such an approach could also be investigated.                        notify a human user of such issues and not automatically
   We studied two different mappings in two versions                assign the entity a mapping.
of JabRef and found six cases where only the mapping                   Comparing mappings using several different techniques
had changed (no change of source code), of which five               could be a way forward, similar to what is done in [7] but
mapped to large modules. We found eleven entities                   with a different intent. This also points to a problematic
where the mapping and source code had changed (though               situation as we cannot fully trust the ground truth map-
the entity had not changed package), of which seven were            pings; a perfect mapping technique would thus be flawed.
in large modules. For these entities, there was a signifi-          There is also a general lack of ground truth mappings
cant difference in error rate between the two mappings.             made by human experts and even fewer mappings made
We are relatively confident that the difference in the six          by different experts on the same system. Four of the
entities with no change is due to disagreement among                systems (JabRef v3.5, JabRef v3.7, ProM, and TeamMates)
the developers; in the other eleven, it could also be due           have mappings done by experts. The others (ArgoUML,
to the actual change of the entities’ source code. This             Ant, Lucene, and Sweet Home 3D) have mappings created
would indicate that between 0.8% and 2.3% of entities are           by researchers studying the systems’ documentation and
hard to map correctly, even for JabRef experts.                     implementations [13]. The architects or developers of
   It should also be noted that JabRef is only one case and         these systems would likely not agree to all of these map-
that it was undergoing architectural refactoring during             pings even if it is likely that large parts of the mappings
this time in development. We are reasonably confident               are correct.
that this affects the results. We can argue that there may             Two limiting factors in this study are that all systems
be more confusion among the developers during refactor-             are implemented in Java and that we have only studied
ing, which should increase the chance of disagreements.             one set of parameters of the attraction function, i.e., the
There is also the possibility that the process of refactor-         one from [2] giving the best mapping performance. An-
ing has brought the architecture to everyone’s attention,           other set of parameters would likely give different error
possibly lowering the chance of disagreements. The low              rates; however, we think the main points of the paper
error rate of new entities suggests the latter as more              would still hold.
likely.
   The two mappings and versions of JabRef allow us to
study entities under refactoring and new entities. We               7. Conclusions and Future Work
find 61 entities under refactoring and 348 new entities. If
                                                                    We investigate the flaws in the automatic mapping of
we remove entities from small modules (with confound-
                                                                    source code to modules in eight open-source software
ing error rates), we find that entities under refactoring
                                                                    systems. We show that the state of the art technique has
are considerably harder to map correctly. This is likely
                                                                    systematic flaws in its suggested mappings that need to
because architectural refactoring is a process that can
                                                                    be addressed. We find that a major contributing factor is
take some time to complete. The functional aspects of the
                                                                    that all investigated systems have modules with very few
entities are likely fixed first, possibly with the removal of
                                                                    ground truth mappings. We also find that all systems use
unwanted dependencies (especially as JabRef has some
                                                                    a naming strategy, but this strategy is often ambiguous.
tests for this).
                                                                    We found no clear evidence that entities that have many
   There is, however, a risk that the semantic information
                                                                    dependencies to or from entities in other modules are
(e.g., variable names) will not be changed and correctly
                                                                    systematically problematic. Our data indicate that such
reflect the vocabulary of the module. It would be interest-
                                                                    dependencies can be a factor, but the metrics used are
ing to see if this happens to these entities in future ver-
                                                                    likely not well suited to clearly show such problems.
sions of JabRef or if the current state is considered good
                                                                       We studied differences in expert mappings in one of
enough. If so, there is a considerable risk that modules



                                                                9
Tobias Olsson et al. CEUR Workshop Proceedings                                                                      1–10



the systems, where we had two different versions and two          techniques, in: IEEE Working Conference on Re-
different ground truths. We found that disagreements              verse Engineering, 2010, pp. 163–172.
exist and that such entities are likely to have a high error  [8] A. Christl, R. Koschke, M. A. Storey, Automated
rate in the mappings, although there are not many such            clustering to support the reflexion method, Infor-
entities. We also studied refactored files and new entities.      mation and Software Technology 49 (2007) 255–274.
Refactored entities tend to have a significantly higher       [9] V. Tzerpos, R. C. Holt, The orphan adoption prob-
error rate compared to both new entities and normal               lem in architecture maintenance, in: IEEE Work-
entities. There is a risk that refactoring is considered          ing Conference on Reverse Engineering, 1997, pp.
done when the entity is moved and the functional aspects          76–82.
are fixed. Automatic mapping could indicate when the [10] A. Christl, R. Koschke, M. A. Storey, Equipping the
entity is properly aligned to other entities in the module        reflexion method with automated clustering, in:
or noticeably different.                                          IEEE Working Conference on Reverse Engineering,
   Our priority for the future is to address the small mod-       2005, pp. 98–108.
ules. We will try different approaches to manipulating [11] T. Olsson, M. Ericsson, A. Wingkvist, s4rdm3x: A
the probability distribution of the modules and find the          tool suite to explore code to architecture mapping
effect on overall mapping performance. Another area of            techniques, Journal of Open Source Software 6
interest is the use of naming information to create an            (2021) 2791. doi:1 0 . 2 1 1 0 5 / j o s s . 0 2 7 9 1 .
initial set, as this could significantly reduce the mapping [12] M. Bibi, O. Maqbool, J. Kanwal, Supervised learn-
effort.                                                           ing for orphan adoption problem in software archi-
                                                                  tecture recovery, Malaysian Journal of Computer
                                                                  Science 29 (2016) 287–313.
Acknowledgments                                              [13] J. Brunet, R. A. Bittencourt, D. Serey, J. Figueiredo,
                                                                  On the evolutionary nature of architectural viola-
The research was supported by the Centre for Data Inten-
                                                                  tions, in: IEEE Working Conference on Reverse
sive Sciences and Applications at Linnaeus University.
                                                                  Engineering, 2012, pp. 257–266.
                                                             [14] J. Lenhard, M. Blom, S. Herold, Exploring the suit-
References                                                        ability of source code metrics for indicating archi-
                                                                  tectural inconsistencies, Software Quality Journal
 [1] T. Olsson, M. Ericsson, A. Wingkvist, Towards im-            (2018).
       proved initial mapping in semi automatic clustering, [15] T. Olsson, D. Toll, A. Wingkvist, M. Ericsson, Evalu-
       in: Proceedings of the 12th European Conference            ation of a static architectural conformance checking
       on Software Architecture: Companion Proceedings,           method in a line of computer games, in: 10th in-
       ECSA ’18, 2018, pp. 51:1–51:7.                             ternational ACM Sigsoft conference on Quality of
 [2] T. Olsson, M. Ericsson, A. Wingkvist, Semi-                  software architectures, ACM, 2014, pp. 113–118.
       automatic mapping of source code using naive [16] Z. T. Sinkala, S. Herold, Inmap: Automated inter-
       bayes, in: 13th European Conference on Software            active code-to-architecture mapping recommenda-
      Architecture - Volume 2, 2019, p. 209–216.                  tions, in: IEEE 18th International Conference on
 [3] L. De Silva, D. Balasubramaniam, Controlling soft-           Software Architecture (ICSA), 2021, pp. 173–183.
      ware architecture erosion: A survey, Journal of [17] F. Chen, L. Zhang, X. Lian, An improved mapping
       Systems and Software 85 (2012) 132–151.                    method for automated consistency check between
 [4] G. C. Murphy, D. Notkin, K. Sullivan, Software               software architecture and source code, in: IEEE
       reflexion models: Bridging the gap between source          20th International Conference on Software Quality,
       and high-level models, ACM SIGSOFT Software                Reliability and Security (QRS), 2020, pp. 60–71.
       Engineering Notes 20 (1995) 18–28.                    [18] J. Garcia, I. Krka, C. Mattmann, N. Medvidovic, Ob-
 [5] N. Ali, S. Baker, R. O’Crowley, S. Herold, J. Buck-          taining ground-truth software architectures, in:
       ley, Architecture consistency: State of the practice,      35th International Conference on Software Engi-
       challenges and requirements, Empirical Software            neering (ICSE), 2013, pp. 901–910.
       Engineering 23 (2017) 1–35.                           [19] J. Buckley, N. Ali, M. English, J. Rosik, S. Herold,
 [6] J. Knodel, D. Popescu, A comparison of static archi-         Real-time reflexion modelling in architecture rec-
       tecture compliance checking approaches, in: The            onciliation: A multi case study, Information and
       IEEE/IFIP Working Conference on Software Archi-            Software Technology 61 (2015) 107–123.
       tecture, 2007, pp. 12–21.
 [7] R. A. Bittencourt, G. Jansen de Souza Santos, D. D. S.
       Guerrero, G. C. Murphy, Improving automated map-
       ping in reflexion models using information retrieval



                                                           10