=Paper= {{Paper |id=Vol-2978/saerocon-paper6 |storemode=property |title=Hard Cases in Source Code to Architecture Mapping using Naive Bayes |pdfUrl=https://ceur-ws.org/Vol-2978/saerocon-paper6.pdf |volume=Vol-2978 |authors=Tobias Olsson,Morgan Ericsson,Anna Wingkvist |dblpUrl=https://dblp.org/rec/conf/ecsa/OlssonEW21b }} ==Hard Cases in Source Code to Architecture Mapping using Naive Bayes== https://ceur-ws.org/Vol-2978/saerocon-paper6.pdf

Hard Cases in Source Code to Architecture Mapping using
Naive Bayes
Tobias Olsson, Morgan Ericsson and Anna Wingkvist
Department of Computer Science and Media Technology, Linnaeus University, Kalmar/Växjö, Sweden

Abstract
The automatic mapping of source code entities to architectural modules is a challenging problem that is necessary to solve
if we want to increase the use of Static Architecture Conformance Checking in the industry. We apply the state-of-the-art
automatic mapping technique to eight open-source systems and find that there are systematic problems in the automatically
created mappings. All of these eight systems have small modules that are very hard to map correctly since only a few source
code entities are mapped to these. All systems seem to use some naming strategy, mapping source code to modules; however,
naming is often ambiguous. We also find differences in ground truth mappings performed by experts, which affect mappings
based on these, and that architectural refactoring also affects the mapping performance.

Keywords
Orphan Adoption, Software Architecture, Source Code Clustering, Naive Bayes

1. Introduction the source code model to the architecture model to de-
termine whether the source code dependencies are con-
Our previous studies [1, 2] of automated techniques to vergent, absent, or divergent compared to the allowed
map source code entities to high-level software archi- dependencies specified in the architecture model.
tectural modules suggest that some entities are much The need for a mapping between the source code and
harder to map correctly than others. Even using the architecture models is a significant reason why SACC has
best algorithm and different parameters, certain entities not reached widespread use in the software industry [3, 5,
always seem to fail to map correctly. We conduct an 7, 8]; the tools and methods exist, but the mappings do not
exploratory study to determine whether our intuition is or are outdated. Many tools address this by combining
correct, i.e., that these hard cases exist, and if they do, manual mapping and regular expressions to filter file,
what their properties are, and what makes them hard to module, and package names. Still, such approaches have
map correctly. proven to be time-consuming and error-prone [5, 7, 8].
The software architecture of a system captures major If we want to automate the mapping process using, e.g.,
design decisions at a high level of abstraction and en- machine learning, it is vital to understand the hard cases.
ables internal and external qualities such as performance, If there is a class of entities that our approach cannot
portability, reusability, and maintainability [3]. It serves map automatically or always maps to the wrong modules,
as a guide for the many decisions that are made during we need to ensure that these are part of the initial set
the implementation of a system. As the system evolves, that a human expert maps. We perform an exploratory
the source code must continue to conform to the archi- study using eight systems with ground truth mappings
tecture or risk accumulating technical debt and no longer to determine whether such a class exists. Once we have
possess the desired qualities. established that it exists, we determine its properties to
Static Architecture Conformance Checking (SACC) is a identify its members automatically. We then investigate
collection of methods, such as Reflexion modeling [4], why these properties make the entities difficult to map to
that statically analyze source code to ensure that it does ensure that they will not reduce the effectiveness of the
not introduce architectural violations [5, 6]. These meth- machine learning approach; we do not want it to learn
ods require an architecture model, with modules and the wrong things from the hard cases.
dependencies, and a source code model, with entities We hypothesize that at least some hard cases would
and concrete dependencies, e.g., due to inheritance or be difficult for a human to map and that different human
method invocations. They also require a mapping from experts would disagree on how they should be mapped.
This can, for example, be due to poor structuring or the
ECSA2021 Companion Volume evolution of the system. We rely on different ground-
Envelope-Open tobias.olsson@lnu.se (T. Olsson); morgan.ericsson@lnu.se truth mappings of the same system and metrics to identify
(M. Ericsson); anna.wingkvist@lnu.se (A. Wingkvist)
such cases and study how well these correlate to the hard
Orcid 0000-0003-1154-5308 (T. Olsson); 0000-0003-1173-5187
(M. Ericsson); 0000-0002-0835-823X (A. Wingkvist) cases.
© 2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)

1
Tobias Olsson et al. CEUR Workshop Proceedings 1–10

Orphan Entity be analyzed to determine its purpose and its similarity
StringChange to the purpose of the modules.
Structural Relations from the
Orphan Entity to the Mapped
Entities
A sub-problem of orphan adoption is orphan kidnap-
Architectural
ping, where software evolution causes a need for remap-
?
Module
Automated
ping an entity to a new module, or in other words, correc-
GUI Logic
Mapping
tive clustering. Tzerpos and Holt identify a fifth criterion
ChangeScanner AttachFileAction
Allowed Module Dependency
DOIChek XMLUtil DataBank
related to orphan kidnapping, Interface minimization; it
Initially Mapped Set is not a good idea to reassign an entity to another mod-
Figure 1: An example mapping that shows the initial sets ule if the removal of the entity will cause the module to
of the GUI and Logic modules of JabRef 3.7. A new orphan get a larger public interface, i.e., the entity is an entry
StringChange is about to be mapped. point/facade to the module.
HuGMe [10, 8] relies on orphan adoption to map from
the source code to the architecture model. It starts from
an initial set of entities that are manually mapped to the
2. Automated Mapping correct module. The remaining entities are considered
orphans. HuGMe is applied iteratively, and as the set
To reason about how well an implementation conforms
of mapped entities can grow for each iteration, more
to the intended architecture using, e.g., Reflexion mod-
orphans have the potential to be automatically mapped.
eling, we need a mapping from the source code to the
In each iteration, there is also the possibility for human
architecture. In this section, we discuss how such a map-
intervention using the result of the failed automatic map-
ping can be created semi-automatically, starting from an
ping attempts as a guideline. The automatic mapping is
initial set of mapped source code entities.
done by calculating the attraction between the orphan
The source code model consists of Entities (E) and De-
and the mapped entities for each module. Christl et al.
pendencies (ED). The entities are, e.g., classes defined
present two attraction functions, CountAttract and MQAt-
in a programming language, and the ED are due to, e.g.,
tract, based on dependencies, i.e., the structure criterion.
method calls and inheritance, see StringChange, ChangeS-
Bittencourt et al. evaluate two new attraction functions
canner, etc., in Figure 1.
based on information retrieval techniques. They use the
The architecture model consists of Modules (M) and
names of modules and entities and the names of iden-
Dependencies (MD) between these. The modules repre-
tifiers in the entities to form vocabulary documents for
sent the major parts of the architecture; see, e.g., GUI
modules and entities, i.e., the naming and semantic crite-
and Logic in Figure 1. The directed MD indicates how
ria. They then use a cosine similarity function, IRAttract,
these modules are allowed to interact and depend on each
and latent semantic indexing, LSIAttract, to calculate the
other. If there, for example, is an MD from GUI to Logic,
attraction values.
then entities mapped to GUI are expected to call entities
Our attraction function, NBAttract, combines ideas
mapped to Logic.
from the previous two and considers the structure, nam-
An automated mapping algorithm aims to map each
ing, and semantic criteria [2]. The approach is similar
entity to the correct module without human assistance.
to that of Bittencourt et al., but we instead use a Naive
For example, classes in the implementation that deal with
Bayes classifier to determine similarity to other entities.
the application’s business rules should be mapped to the
To include the structure criterion, NBAttract uses a novel
module Logic. Once this mapping exists, we can compare
approach, Concrete Dependency Abstraction (CDA), to
the ED of the implementation to the MD allowed by the
encode dependencies as text [2]. NBAttract has outper-
architecture and determine whether they are convergent,
formed CountAttract in our previous study [2], and Coun-
absent, or divergent [4].
tAttract was not clearly outperformed in [7]. We, there-
We rely on orphan adoption [9] to map entities to mod-
fore, only use NBAttract in the remainder of this paper.
ules automatically. An unmapped entity is considered an
orphan that should be adopted by one of the modules, e.g.,
StringChange in Figure 1. Tzerpos and Holt identify four 3. Method
criteria that can affect the mapping. Naming, naming
standards can reveal what module is suitable. Structure, Based on our experiences with different attraction func-
dependencies between an orphan and already mapped tions, we hypothesize that no matter how well the func-
entities can be used as a mapping criterion. Style, mod- tion performs, there is a specific set of entities that are
ules are often created using different design principles always misclassified. We seek to investigate this further
(e.g., high cohesion or not). Classifying the orphan based to determine whether our hypothesis is correct or if the
on style can give hints on how to use, for example, the misclassifications happen by chance due to randomness
structure criteria. Semantics, the source code itself can in the composition and size of the initial set.

2
Tobias Olsson et al. CEUR Workshop Proceedings 1–10

We have previously implemented a tool to evaluate for misclassification based on our own experience and the
different mapping approaches, including reporting de- advice from related work, and present exciting findings
tailed mapping results [11]. We use this tool to create from the data. The ultimate goal is to construct strategies
a new dataset over the mapping results for each source to detect entities with a high risk of being misclassified so
code entity. that a human can intervene and classify these manually.
We run NBAttract, with the following settings. We More specifically, we will investigate:
use an initial set of mapped entities of random size and Is the set of problematic entities a good candidate for
composition. We extract package names, filenames (these the initial set? This set needs human intervention for
correspond to the outer class names in Java), attribute automatic mapping to perform well, effectively removing
identifier names, and variable identifier names from the the problem from the automatic mapping. This can be
source code entities in the initial set and tokenize these assessed by computing the F1 score of the precision and
based on Camel-case and the characters - and _ . The recall, as we did in [2]. We will compare the F1 scores
tokens are then stemmed using a Porter stemmer. Tokens across the entire range of initial set sizes visually.
that are shorter than three characters are removed. We Is the set of problematic entities related to small modules?
use our CDA technique to represent dependencies as text In general, machine learning techniques need good data
strings. We use a binary token frequency (present or not) to perform. In particular, there is a need for a balanced
and 0.9 as the threshold for automatic classification. dataset where there is approximately the same amount
These settings correspond to the settings used in [2] of data to learn from in each class. If the dataset is im-
with one exception; we do not require the initial set to balanced, there is a high chance that smaller classes will
contain at least one source code entity from each module not be properly handled. An architectural module should
in this study. We are interested in how individual files contain a fair amount of source code entities. Still, there
are mapped to find possible flaws in the technique, which may exist modules that hold source code entities that do
is why we allow for a module to be empty initially. not fit well in other modules, or the system may be under
As we run several experiments with random initial sets, evolution, and intended source code has not been created
we get a dataset that shows the correct mapping of each yet, etc. We need to know if such small modules exist
entity and the number of mappings for each entity and and whether they are common or problematic.
module. Based on this information, we can compute an Is the set of problematic entities related to entities with
error rate for each entity according to Equation 1. If the poor naming? Tzerpos and Holt [9] define naming as one
attraction function was completely stochastic, the error of the key criteria that influence the mapping. In our
rate for each entity would converge to the stochastic experience, it is also a common strategy for developers
error rate, defined in Equation 2. to create folders, packages, and filenames that reflect the
modular architecture to some degree. It would thus be
interesting to know if the naming of source code entities
|erroneous mappings|
errnba = (1) includes the module’s name it is mapped to. It is also
|mappings| interesting to know if there are ambiguities in the naming,
i.e., if several module names match the name of a source
|modules| − 1 code entity.
errsto = (2) Is the set of problematic entities related to entities on the
|modules|
border of a module? Bibi et al. [12], Tzerpos and Holt [9],
As NBAttract is not a stochastic function, the 𝑒𝑟𝑟𝑛𝑏𝑎 for and Bittencourt et al. [7] state that dependencies have
an entity should converge to something less than 𝑒𝑟𝑟𝑠𝑡𝑜 if an impact on the mappings. We use a textual representa-
there are no systematic problems, i.e., it should systemat- tion of dependencies in NBAttract, but this may not be
ically produce better mappings than a random mapping. good enough. We will investigate the ratio of external
Hence, we can conclude that there are systematic prob- dependencies, e.g., an entity with many external depen-
lems if we do not find such a convergence for a source dencies would likely be an entity that lies on the border
code entity after several iterations. If we find systematic of a module. If we find a correlation between the external
errors in a majority of the systems, we will further an- dependency ratio and the error rate, this could suggest
alyze all problematic entities to find common, possible that border entities are problematic.
causes for the misclassification. An entity is considered There are several metrics based on dependencies. We
problematic if its 𝑒𝑟𝑟𝑛𝑏𝑎 ≥ 0.5, i.e., it is misclassified in use coupling (the count of all dependencies to or from
50% or more of the mappings. The motivation for this all other entities) and fan (the existence of a dependency
limit is that a non-problematic attraction function should, to or from all other entities). The coupling may be very
on average, produce a correct mapping in at least 50% high between two entities, but the fan can at most be
of the cases for each entity. This part of the research is one between two entities, i.e., fan is a subset of coupling.
highly exploratory. We investigate the possible reasons While coupling captures the absolute number of depen-

3
Tobias Olsson et al. CEUR Workshop Proceedings 1–10

dencies fan focuses on the diversity of different entities, Table 1
i.e., a high fan value captures that an entity has many Mapping Data Overview.
dependencies to other different entities.
System Lines # Mod # Ent err ≥ 0.5 err ≥ errsto
Is the set of problematic entities related to problems in the
ground truth mapping? We have access to two versions Ant 36 699 16 468 187 39.96% 72 15.38%
of the JabRef system in which the modules and relations A.UML 62 392 19 767 165 21.51% 74 9.65%
between them are the same (same intended architecture), JR 3.7 59 235 6 1 017 107 10.52% 40 3.93%
but the mappings are not the same for all entities. This JR 3.5 51 840 6 733 96 13.1% 51 6.96%
Lucene 35 812 7 514 60 11.67% 17 3.31%
provides an opportunity to study discrepancies in the
ProM 9 947 4 261 18 6.9% 9 3.45%
ground truth mappings and how these affect the auto- S.H 3D 34 964 9 167 39 23.35% 19 11.38%
matic mapping performance. One complicating factor in T.Mates 54 904 12 450 115 25.56% 49 10.89%
this analysis is that JabRef underwent an architectural
evolution between these two versions. Therefore, we Entity Error Rates per Project

limit our analysis to entities that remain the same (no 1.0
changes to the source code) but are mapped to different
modules.
0.8
Is the set of problematic entities related to files that are
being refactored due to architectural evolution? The two
versions of JabRef provide an opportunity to study enti- 0.6

ties that have changed packages and mapping (a sign of
architectural evolution), have changes to the source code 0.4

(a sign of refactoring), or were recently added.
We study eight open-source systems implemented in 0.2

Java. Ant1 is an API and command-line tool for process
automation. ArgoUML2 is a desktop application for UML 0.0

modeling. Jabref3 is a desktop application for managing
Ant

A.UML

Jr v3.5

Jr v3.7

Lucene

ProM

S.H 3D

T.Mates
bibliographical references, and we use the 3.5 and 3.7 ver-
sions. Lucene4 is an indexing and search library. ProM5
Figure 2: The entity error rates for each project.
is an extensible framework that supports a variety of pro-
cess mining techniques. Sweet Home 3D6 is an interior
design application. TeamMates7 is a web application for
handling student peer reviews and feedback. 4. Results and Analysis
Table 1 presents the sizes of the systems in lines of
code, number of entities, and number of modules. There We performed the experiment and collected mapping
exist a documented software architecture as well as a data per entity for each system. All systems show several
mapping from the implementation to this architecture entities always being misclassified (an error rate of 1.0)
for each system. Jabref 3.7, TeamMates, and ProM have (cf. Figure 2). Table 1 shows an overview of the data
been the subjects of study at the Software Architecture collected. Note that each entity has a random chance to
Erosion and Architectural Consistency Workshop (SAE- be included in the initial set and not be an orphan in that
roCon) 2016, 2017, and 2019 respectively, where a system particular run of the experiment. There is also a chance
expert has provided both the architecture and the map- an entity will not be mapped (e.g., due to variations in
ping. The architecture documentation and mappings are the initial set). However, each entity has been mapped at
available in the SAEroCon repository8 . ArgoUML, Ant, least 500 times.
and Lucene were studied by Brunet et al. and Lenhard We now construct the initial set using entities with
et al., and the architectures and mappings were extracted 𝑒𝑟𝑟𝑛𝑏𝑎 ≥ 0.5, i.e., only entities with 𝑒𝑟𝑟𝑛𝑏𝑎 < 0.5 are con-
from the replication package of Brunet et al. as well as for sidered orphans, and all the troublesome entities are in-
Sweet Home 3D. JabRef 3.5 was extracted from Lenhard cluded in the initial set. We compare this with randomly
et al.. selecting from all entities in the initial set. We collected
14 849 and 13 754 data points from the respective groups.
1
https://ant.apache.org Figure 3 shows the running median (±100 data points)
2
http://argouml.tigris.org
3
https://jabref.org
and limits of the running 75th and 25th percentiles of
4
https://lucene.apache.org the F1 scores, respectively, for JabRef 3.7. Since the other
5
http://www.promtools.org systems show similar trends, so we focus on JabRef. We
6
http://www.sweethome3d.com find that our idea is promising overall, especially when
7
https://teammatesv4.appspot.com
8 the initial set size increase.
https://github.com/sebastianherold/SAEroConRepo

4
Tobias Olsson et al. CEUR Workshop Proceedings 1–10

JabRef 3.7 f1 Scores Relative Miss-Classifications vs Relative Module Size

1.0

1.0
0.8

0.8
Relative Miss-Classifications
0.6

0.6
f1

0.4

0.4
0.2
0.2

All Entities

0.0
0.0

Error Rate < 0.5

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5

Initial Set Size Relative Module Size

Figure 3: The running median F1 score with limits of the Figure 4: The relative rate of misclassified entities vs. the
running 75th and 25th percentiles for JabRef 3.7 and initial set relative number of entities for each module. Coordinates are
type over the whole interval of initial set sizes. slightly jittered to show data points more clearly.

However, there is also an interval between the initial all entities are misclassified. Also, note that there are
set sizes of 0.1 to 0.18, marked with vertical lines in Fig- small modules with a relatively low number of misclas-
ure 3, where the F1 score is considerably lower than using sified entities. Another factor to consider is that only
all entities. This indicates that entities with 𝑒𝑟𝑟𝑛𝑏𝑎 ≥ 0.5 321 entities of 4377 (7.33%) are mapped to these small
are not good representatives of modules. Upon further modules.
inspection of the actual modules and entities in JabRef 3.7, To measure the extent of using a naming strategy (NS)
we find that there is a set of modules with very few enti- when naming concrete entities in each system, we check
ties in the ground truth mapping, and entities mapped to if the words in the package or class name for an entity
these all have a high error rate. contain the module name. Table 2 shows that most sys-
The high error rate makes sense in general, as ma- tems use a naming strategy (column NS) to a rather high
chine learning techniques produce better results if there degree. Lower values (e.g., TeamMates) are often due to
is more data. More specifically, for Naive Bayes, the a module naming discrepancy, e.g., TeamMates defines
probability of finding an entity in such a module is very a module view with a corresponding path word named
low, so it does not make sense to map entities to it. We, ui; however, there are also cases where there is no clear
therefore, investigate if all systems have such small mod- naming strategy for an entity. We consider an entity
ules prone to misclassification. If entities are equally to have ambiguous naming if its path or filename con-
distributed among the modules of a system, there would tains several different module name words. For example,
be 1/|modules|% entities in each module. We regard a net.sf.jabref.logic.net.ProxyPreferences, from JabRef v3.7,
module as small if it has less than half the number of contains both the module names logic and preferences.
entities of 1/|modules|%. Thus we define the limit for a Ambiguity in entity naming strategy (ANS) seems to be
small module as 0.5/|modules|%. It could be argued that quite common in some systems (Ant, ArgoUML, JabRef
the lines of code should be used as a more fine-grained 3.5, Sweet Home 3D, and TeamMates) and not at all in
measure of size, i.e., mapping one huge entity in terms of others (JabRef v3.7 ProM and Lucene). In some systems,
lines of code. However, for example, path and file name the ambiguity is caused by having a parent-level package
information is per entity, and for effectively learning a that is also a module. For example, Ant uses ant as both
pattern based on entity names, more entities are needed. a high-level package and a module. The misclassification
Table 2 shows the limit, the number of small modules, rate in the ambiguously named entities (ANSM) seems to
and the rate of misclassification of entities in these mod- follow the inverse pattern of the ANS; the lower the ANS,
ules. A surprising result is that all systems have such the higher the ANSM. This makes sense since a higher
small modules, and all systems have small modules where ANS means there is more data to learn the pattern of the
all entities are misclassified. There are 30 (out of a total of ambiguous naming from (if there is one).
73) modules where all entities are misclassified. Figure 4 We now turn our attention to whether entities that lie
shows how the relative number of misclassifications and on the border of a module, i.e., have relatively many de-
relative module size are related. Note the cloud of points pendencies to entities in other modules, are problematic.
in the upper left corner. These are the 30 modules where We use the common coupling and fan metrics. Results for

5
Tobias Olsson et al. CEUR Workshop Proceedings 1–10

1.0
1.0
Table 2
The number of entities for small modules (Limit), the number
of small modules (SM), their rate of misclassification (SMM),

0.8
0.8
the rate of entities with a naming strategy (NS), ambiguous
naming strategy (ANS), and rate of entities with ambiguous

External Fan Ratio

0.6
0.6
naming that are misclassified (ANSM).

System Limit SM SMM NS ANS ANSM

0.4

0.4
Ant 3.57 9 92.75 100.00 85.90 29.41
A.UML 3.33 7 70.59 84.62 62.45 53.94

0.2

0.2
JR v3.5 8.33 4 93.75 71.62 30.29 42.71
JR v3.7 8.33 3 100.00 95.58 7.18 82.24

0.0
Lucene 7.14 3 57.45 99.22 2.53 90.00

0.0
ProM 12.50 1 100.00 100.00 0.77 88.89 0.0 0.2 0.4 0.6 0.8 1.0 <.5 >=0.5

S.H 3D 5.56 5 100.00 89.82 12.57 64.10 Error Rate Error Rate

T.Mates 4.17 5 72.50 68.22 34.67 93.91
Figure 5: The error rate versus the external fan ratio of each
entity in large modules. Coordinates are jittered for clarity.
The box plot shows the difference in the external fan ratio of
coupling were very similar to the results for fan. How- non-problematic (error rate < 0.5) and problematic entities.
ever, the fan metric seems less noisy, so we opt only
to report these values. We use a scatter plot to check
whether there is a correlation between these metrics and Finally, we investigate the difference in error between the
the error rate. We do this for entities that are part of large versions. All entities move to a lower error rate in JabRef
modules since small modules are a confounding factor. v3.7, and 11 entities have a problematic mapping in JabRef
Figure 5 shows that there is no clear relation between the 3.5 (cf. Figure 6). JabRef 3.5 has 36 entities mapped to non-
external fan ratio and the rate of relative misclassification small modules with problematic error rates. Between 4
of an entity. It would, therefore, not make sense to find a and 11 of these seem to be due to problems in the ground
correlation between the two variables. truth mappings, i.e., 11.1% and 30.6%. Optimally, these
Yet, when we investigate the difference in external are cases where a technique would alert and spark a
coupling for problematic versus non-problematic enti- discussion regarding the ground truth mappings among
ties, we find a clear difference in the distribution of the the developers.
external fan ratio. Problematic entities have a higher It should be noted that JabRef underwent a refactor-
external fan ratio in general. This indicates that we need ing towards a new modular architecture at this point in
to investigate further how to correctly classify entities development. Therefore, we do not think that these rela-
that lie on the border of a module. In total, there are 3 502 tively high percentages are representative of all software
entities with a low error rate (𝑒𝑟𝑟𝑛𝑏𝑎 < 0.5) and 497 enti- systems. The developers likely have a higher degree of
ties with a high error rate. The number of entities with a conformance in a more stable architecture.
low error rate is also higher throughout the distribution Lastly, we look at architecturally refactored entities
of the external fan ratio. This makes the probability of between JabRef 3.5 and JabRef 3.7. We define an archi-
finding a problematic entity using the external fan ratio tecturally refactored entity as an entity that has changed
very small. mapping and package. We view the conscious choice to
To investigate possible cases of disagreement in map- change the package of an entity as a sign that the change
pings, we study the entities that have a change in their in mapping is not a mistake or disagreement but a part
mapping between versions 3.5 and 3.7 of JabRef. We of architectural evolution. We find 61 such entities, 5
first specifically look at entities that have only changed of which are mapped to a small module in one or both
their mapping and not moved in the package hierarchy. versions. We find that refactored entities have a signifi-
We consider such nodes as having an ambiguous map- cantly higher error rate if we compare the error rate of
ping. We find 17 such entities, 5 of which are mapped these entities with both new and normal entities from
to a small module in one or both versions, which will the majority of modules in JabRef 3.7 (cf. Figure 7).
make the error rate unrepresentative. We are left with Such refactored entities could still be in a state of tran-
12 entities. sition, and it seems likely to be a practice to make the
We investigate the change of source code for these change of package and mapping before changing major
entities using cloc9 and find five entities without any parts of the implementation. An architectural refactoring
code changes and seven with varying degrees of change. can also change the purpose of a module itself, though
9
this will be a slower process for an automatic mapper to
https://github.com/AlDanial/cloc

6
Tobias Olsson et al. CEUR Workshop Proceedings 1–10

Change in Error for Entities with Changed Mapping
pare the performance of different approaches, but not
1.0
JabRef v3.5
JabRef v3.7
no code change
code changed to specifically analyze problematic cases. We highlight
the conclusions of prior work made regarding what may
explain the performance.
0.8

The orphan adoption criteria naming, structure, style,
and interface minimization are used in an algorithm eval-
0.6

uated in three case studies [9]. We find an evolving in-
Error

dustrial system where the architecture was created by
0.4

researchers with the help of developers the most inter-
esting of these. 939 entities were assigned to modules,
0.2

and in 46 cases (4.9%), the algorithm suggested a different
mapping than the developers. In 33 of these cases, the
0.0

1 2 3 4 5 6 7 8 9 10 11 12
developers agreed with the algorithm’s mapping, i.e., the
Entity algorithm was able to find developer mistakes. In some
Figure 6: The change in the error of 12 entities from large of the 13 cases where the suggested module was not ac-
modules in JabRef that have changed mapping but not cepted, the developers mentioned that (code) changes to
changed package. The first five (blue) entities have had the entity were needed for it to conform to the developer
no change in source code and the last seven (orange) have mapping.
changed source code. Bibi et al. compared the structural criteria part of the
algorithm proposed by Tzerpos and Holt with supervised
JabRef 3.7 Large Module Error Rates machine-learning approaches; Bayesian classification,
k-nearest-neighbor, and neural networks. Their study
1.0

focuses on using dependencies as features (i.e., struc-
tural criteria) for incremental clustering. They evaluate
the approaches using two versions of six open-source
0.8

software systems and find that dependencies between
entities within the same module are important to avoid
0.6

misclassifications, especially when there are few depen-
dencies between entities in different modules.
0.4

We previously constructed a structure-based heuristic
for automatic mapping of source code to Model-View-
0.2

Controller-based architectures [15]. We evaluated the
approach on four products in a product line of games,
all using the same game engine. We compared the au-
0.0

refactored new normal tomatic mapping to the manual mapping, and if they
Figure 7: The error rates of entities in large modules that are disagreed, then the type was flagged as containing an
undergoing refactoring, are new, or normal in JabRef 3.7. architectural problem. We compared the mappings of
653 entities and were able to correctly identify 76 out of
101 architectural problems as well as 18 false positives.
detect. The risk is that a module can be quite chaotic dur- The heuristic suggested a different mapping in 96 (14.7%)
ing a transition phase with multiple entities in different of 653 cases.
stages of the refactoring process. Furthermore, two of the projects were refactored to
Another interesting observation is that new files tend be fully conformant. This refactoring removed 33 true
to have a lower error rate, indicating that the developers positives and six false positives. The true positives were
have understood the new architecture and that normal remedied by refactoring the source code. In the context
code changes could slowly make an entity harder to clas- of evaluating the performance of a method for automatic
sify. This could be due to some form of design erosion, mapping using the manual mappings as ground truth,
where changes are introduced that make the entity less these true positives would be regarded as erroneous map-
cohesive over time. pings when they, in fact, are pointing to source code with
architectural problems that need to be refactored.
The CountAttract and MQAttract attraction functions
5. Related Work of HuGMe have been evaluated in four case studies [10, 8].
The focus is on evaluating the influence of two config-
There is previous work in the area of orphan adoption [9, uration parameters and comparing the performance of
10, 8, 7, 12, 15, 16, 17]. The focus is to evaluate and com- the attraction functions. Both attraction functions as-

7
Tobias Olsson et al. CEUR Workshop Proceedings 1–10

sume a modular design based on the high cohesion low sented a fairly good correlation, and in one system, they
coupling style, and mapping would become problematic could find a repeating pattern of directories. Possibly
for modules designed specifically to not use this style. the ground truth architectures recovered in their study
Christl et al. suggest the incorporating a detection step is more low level than the modular architectures that
to better handle such modules, which would correspond we study. Still, it is likely that there is a variation on
to handling the style criteria. Furthermore, Chen et al. what dimension of an architecture that is expressed in
improves on CountAttract in an evolutionary case, i.e., a the package structure. This is further supported by Buck-
pre-existing mapping is used. ley et al. where one system of five studied did not have
Bittencourt et al. present two new attraction functions any clear correlation between packages and modules [19],
based on information retrieval techniques. They use the presenting clear difficulties and significant effort when
semantic information in the source code and calculate at- performing the manual mapping.
tractions based on cosine similarity (IRAttract) and latent
semantic indexing (LSIAttract). They make a quantitative
comparison between the performance of their attraction 6. Discussion and Validity
functions with CountAttract and MQAttract in an evolu-
Our results clearly show that there is a set of entities in
tionary setting (where a few new files are to be assigned
the systems that are systematically hard for the state-
a mapping). They find that a combination of attraction
of-the-art automatic mapping techniques to map. One
functions (e.g., if CountAttract fails, then try IRAttract)
reason for this is the surprising result that all studied
performs best. This is explained by their qualitative anal-
systems exhibit some very small modules. An automated
ysis, where they find that CountAttract usually misplaces
technique would have very little data to use for these
entities on module borders, MQAttract performs better
modules, lowering the chance for successful mapping.
when mapping entities with dependencies to many dif-
In general, unbalanced data is problematic for machine
ferent modules, IRAttract and LSIAttract perform better
learning techniques, and in particular, the distribution of
when mapping entities in libraries or entities on module
probabilities is important in Naive Bayes. 30% (237 out of
borders, but perform less well if there are modules that
784) problematic entities are mapped to such small mod-
share vocabulary but are not related.
ules in the ground truth mappings. This is a significant
Sinkala and Herold present InMap, which is not an
problem that needs to be solved.
automated approach to mapping per se; instead, InMap
In essence, small modules need to be flagged (either
suggests mappings to the end-user, who can choose to
automatically or manually) and handled separately. One
accept the suggested mapping (or not). It is an iterative
idea in the context of Naive Bayes would be to manipulate
approach where a number of mappings are presented,
the probability distribution appropriately to not wholly
and the accepted mappings are used to improve the sug-
disregard small modules in the mapping. Schemes that
gested mappings further. The suggested mappings are
could be tested are a uniform distribution or different
produced with the help of information retrieval informa-
fixed settings (large, medium, small). These should be
tion similar to Bittencourt et al. with the addition of a
reasonably easy for an end-user to assign to a module.
descriptive text for each architectural module. The enti-
Still, there is a risk that overall performance will drop
ties are treated as a database of documents, and InMap
as potentially more entities will be hard to map. Another
uses Lucene to search this database using module infor-
approach is to investigate why a few of the small modules
mation as a query. As InMap is highly interactive, it will
do not contain many problematic entities. We suspect
also use negative evidence to some degree, i.e., a rejected
that these modules possibly exhibit a unique design, e.g.,
mapping suggestion will not be suggested again. The
being very cohesive or having very clear naming, which
data from [16] suggest that using only the module names
is perhaps not easy to address directly in a technique as
as a search criterion often results in high precision at the
it may simply be a way a module is designed.
expense of the recall. This is most likely due to the fact
Using the naming strategy and possible ambiguity in
that module names often reflect package names to some
naming is an attractive approach to create an initial set us-
degree. Adding more and more module information in
ing a specialized mapper. It should be possible to prompt
the query tends to lower precision, but increase the re-
an end-user with, e.g., keywords from the package or
call, e.g., source code comments increase recall but lower
class name asking for a mapping of the keyword. This
precision in the mapping suggestions.
could significantly reduce the effort of creating an initial
Garcia et al. discuss the use of package and naming
set that could then be used as a basis for other map-
information in software architecture recovery [18]. In
ping techniques. However, a complete approach must be
general, they found that their ground truth components
prepared to handle subject systems where the naming
often spanned or shared several packages. They could
information does not reflect the modular architecture.
not find a correlation between components and single
The data on finding problematic entities among entities
package or directory names. One of their four cases pre-

8
Tobias Olsson et al. CEUR Workshop Proceedings 1–10

that lie on the borders of modules is conflicting. On will become less semantically cohesive as the vocabulary
the one hand, we cannot see any correlation between becomes a mix of words from the previous architecture.
the external fan ratio and the error rate. On the other The error rate of entities could then be used as a metric
hand, we observe a higher median external fan ratio in to know if an entity is properly aligned to other entities
problematic entities. We observe very high error rates in in the module.
combination with very low external fan ratios and vice Comparing a human-made mapping to the mapping
versa. This indicates that the external fan ratio is not a made by an automatic technique seems to be a useful
useful metric, and a more refined metric could give better piece of information. The related work [9, 15] shows that
answers. There is possibly a difference between incoming this often points to cases where (further) refactoring or
and outgoing dependencies that could be a factor. In [9], discussion is needed and that the automatic technique is
these entities were specifically detected and only used not necessarily wrong per se. However, if no human map-
when suggesting a new module (orphan kidnapping). ping exists, is it important for an automated technique to
Such an approach could also be investigated. notify a human user of such issues and not automatically
We studied two different mappings in two versions assign the entity a mapping.
of JabRef and found six cases where only the mapping Comparing mappings using several different techniques
had changed (no change of source code), of which five could be a way forward, similar to what is done in [7] but
mapped to large modules. We found eleven entities with a different intent. This also points to a problematic
where the mapping and source code had changed (though situation as we cannot fully trust the ground truth map-
the entity had not changed package), of which seven were pings; a perfect mapping technique would thus be flawed.
in large modules. For these entities, there was a signifi- There is also a general lack of ground truth mappings
cant difference in error rate between the two mappings. made by human experts and even fewer mappings made
We are relatively confident that the difference in the six by different experts on the same system. Four of the
entities with no change is due to disagreement among systems (JabRef v3.5, JabRef v3.7, ProM, and TeamMates)
the developers; in the other eleven, it could also be due have mappings done by experts. The others (ArgoUML,
to the actual change of the entities’ source code. This Ant, Lucene, and Sweet Home 3D) have mappings created
would indicate that between 0.8% and 2.3% of entities are by researchers studying the systems’ documentation and
hard to map correctly, even for JabRef experts. implementations [13]. The architects or developers of
It should also be noted that JabRef is only one case and these systems would likely not agree to all of these map-
that it was undergoing architectural refactoring during pings even if it is likely that large parts of the mappings
this time in development. We are reasonably confident are correct.
that this affects the results. We can argue that there may Two limiting factors in this study are that all systems
be more confusion among the developers during refactor- are implemented in Java and that we have only studied
ing, which should increase the chance of disagreements. one set of parameters of the attraction function, i.e., the
There is also the possibility that the process of refactor- one from [2] giving the best mapping performance. An-
ing has brought the architecture to everyone’s attention, other set of parameters would likely give different error
possibly lowering the chance of disagreements. The low rates; however, we think the main points of the paper
error rate of new entities suggests the latter as more would still hold.
likely.
The two mappings and versions of JabRef allow us to
study entities under refactoring and new entities. We 7. Conclusions and Future Work
find 61 entities under refactoring and 348 new entities. If
We investigate the flaws in the automatic mapping of
we remove entities from small modules (with confound-
source code to modules in eight open-source software
ing error rates), we find that entities under refactoring
systems. We show that the state of the art technique has
are considerably harder to map correctly. This is likely
systematic flaws in its suggested mappings that need to
because architectural refactoring is a process that can
be addressed. We find that a major contributing factor is
take some time to complete. The functional aspects of the
that all investigated systems have modules with very few
entities are likely fixed first, possibly with the removal of
ground truth mappings. We also find that all systems use
unwanted dependencies (especially as JabRef has some
a naming strategy, but this strategy is often ambiguous.
tests for this).
We found no clear evidence that entities that have many
There is, however, a risk that the semantic information
dependencies to or from entities in other modules are
(e.g., variable names) will not be changed and correctly
systematically problematic. Our data indicate that such
reflect the vocabulary of the module. It would be interest-
dependencies can be a factor, but the metrics used are
ing to see if this happens to these entities in future ver-
likely not well suited to clearly show such problems.
sions of JabRef or if the current state is considered good
We studied differences in expert mappings in one of
enough. If so, there is a considerable risk that modules

9
Tobias Olsson et al. CEUR Workshop Proceedings 1–10

the systems, where we had two different versions and two techniques, in: IEEE Working Conference on Re-
different ground truths. We found that disagreements verse Engineering, 2010, pp. 163–172.
exist and that such entities are likely to have a high error [8] A. Christl, R. Koschke, M. A. Storey, Automated
rate in the mappings, although there are not many such clustering to support the reflexion method, Infor-
entities. We also studied refactored files and new entities. mation and Software Technology 49 (2007) 255–274.
Refactored entities tend to have a significantly higher [9] V. Tzerpos, R. C. Holt, The orphan adoption prob-
error rate compared to both new entities and normal lem in architecture maintenance, in: IEEE Work-
entities. There is a risk that refactoring is considered ing Conference on Reverse Engineering, 1997, pp.
done when the entity is moved and the functional aspects 76–82.
are fixed. Automatic mapping could indicate when the [10] A. Christl, R. Koschke, M. A. Storey, Equipping the
entity is properly aligned to other entities in the module reflexion method with automated clustering, in:
or noticeably different. IEEE Working Conference on Reverse Engineering,
Our priority for the future is to address the small mod- 2005, pp. 98–108.
ules. We will try different approaches to manipulating [11] T. Olsson, M. Ericsson, A. Wingkvist, s4rdm3x: A
the probability distribution of the modules and find the tool suite to explore code to architecture mapping
effect on overall mapping performance. Another area of techniques, Journal of Open Source Software 6
interest is the use of naming information to create an (2021) 2791. doi:1 0 . 2 1 1 0 5 / j o s s . 0 2 7 9 1 .
initial set, as this could significantly reduce the mapping [12] M. Bibi, O. Maqbool, J. Kanwal, Supervised learn-
effort. ing for orphan adoption problem in software archi-
tecture recovery, Malaysian Journal of Computer
Science 29 (2016) 287–313.
Acknowledgments [13] J. Brunet, R. A. Bittencourt, D. Serey, J. Figueiredo,
On the evolutionary nature of architectural viola-
The research was supported by the Centre for Data Inten-
tions, in: IEEE Working Conference on Reverse
sive Sciences and Applications at Linnaeus University.
Engineering, 2012, pp. 257–266.
[14] J. Lenhard, M. Blom, S. Herold, Exploring the suit-
References ability of source code metrics for indicating archi-
tectural inconsistencies, Software Quality Journal
[1] T. Olsson, M. Ericsson, A. Wingkvist, Towards im- (2018).
proved initial mapping in semi automatic clustering, [15] T. Olsson, D. Toll, A. Wingkvist, M. Ericsson, Evalu-
in: Proceedings of the 12th European Conference ation of a static architectural conformance checking
on Software Architecture: Companion Proceedings, method in a line of computer games, in: 10th in-
ECSA ’18, 2018, pp. 51:1–51:7. ternational ACM Sigsoft conference on Quality of
[2] T. Olsson, M. Ericsson, A. Wingkvist, Semi- software architectures, ACM, 2014, pp. 113–118.
automatic mapping of source code using naive [16] Z. T. Sinkala, S. Herold, Inmap: Automated inter-
bayes, in: 13th European Conference on Software active code-to-architecture mapping recommenda-
Architecture - Volume 2, 2019, p. 209–216. tions, in: IEEE 18th International Conference on
[3] L. De Silva, D. Balasubramaniam, Controlling soft- Software Architecture (ICSA), 2021, pp. 173–183.
ware architecture erosion: A survey, Journal of [17] F. Chen, L. Zhang, X. Lian, An improved mapping
Systems and Software 85 (2012) 132–151. method for automated consistency check between
[4] G. C. Murphy, D. Notkin, K. Sullivan, Software software architecture and source code, in: IEEE
reflexion models: Bridging the gap between source 20th International Conference on Software Quality,
and high-level models, ACM SIGSOFT Software Reliability and Security (QRS), 2020, pp. 60–71.
Engineering Notes 20 (1995) 18–28. [18] J. Garcia, I. Krka, C. Mattmann, N. Medvidovic, Ob-
[5] N. Ali, S. Baker, R. O’Crowley, S. Herold, J. Buck- taining ground-truth software architectures, in:
ley, Architecture consistency: State of the practice, 35th International Conference on Software Engi-
challenges and requirements, Empirical Software neering (ICSE), 2013, pp. 901–910.
Engineering 23 (2017) 1–35. [19] J. Buckley, N. Ali, M. English, J. Rosik, S. Herold,
[6] J. Knodel, D. Popescu, A comparison of static archi- Real-time reflexion modelling in architecture rec-
tecture compliance checking approaches, in: The onciliation: A multi case study, Information and
IEEE/IFIP Working Conference on Software Archi- Software Technology 61 (2015) 107–123.
tecture, 2007, pp. 12–21.
[7] R. A. Bittencourt, G. Jansen de Souza Santos, D. D. S.
Guerrero, G. C. Murphy, Improving automated map-
ping in reflexion models using information retrieval