<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zipani Tom Sinkala</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Herold</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science, Karlstad University</institution>
          ,
          <addr-line>Karlstad</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automating the mapping of a system's code to its architecture helps improve the adoption of successful Software Architecture Consistency Checking (SACC) methods like Reflexion Modelling. InMap is an interactive code-to-architecture mapping recommendation technique that has been shown to do this task with good recall and precision using natural language software architecture descriptions of the architectural modules. However, InMap like most other automated recommendations techniques maps low level source code units like source code files or classes to architectural modules. For large complex systems this can still be a barrier to adoption due to the effort required by a software architect when accepting or rejecting the recommendations. In this study we propose an extension to InMap that provides recommendations for higher-level source code units, that is, packages. It utilizes InMap's information retrieval capabilities, using minimal architecture documentation, applied to a software's codebase, to recommend mappings between the software's high-level source code entities and its architectural modules. We show that using our proposed hierarchical mapping technique we are able to reduce the effort required by the architect, as high as 6-fold in some cases, and still achieve good precision and fairly good recall.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Automated Mapping</kwd>
        <kwd>Software Architecture Consistency Checking</kwd>
        <kwd>Information Retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Mapping code to architecture is a task that is
common in Software Architecture Consistency
Checking (SACC) [1, 11, 14, 16]. Popular SACC
methods like Reflexion Modelling [9, 12] require
a mapping step in order to be able to identify
conformance or divergence of a system’s code to
its intended software architectural modules [8, 9,
12, 13]. The mapping step is a manual and
labourintensive task for the most part that becomes a
barrier to industry adoption of effective SACC
techniques like Reflexion Modelling especially
for large complex software systems [1, 7].</p>
      <p>There have been a number of techniques that
have been created that attempt to decrease the
burden of mapping on software architects by
automating the mapping step [4–6, 11, 15, 16].
Most of these however, are class- or file-based
[11, 15, 16]. This implies, in the case of systems
developed using an object oriented programming
language, where classes are considered as the
underlying unit of source code, they automate
mapping at a class level – attempting to predict
which architectural module, a class (or class-file)
maps to. This has been done quite well with
techniques like InMap [15, 16] and NBC [11]. In
our paper “InMap: Automated Interactive
Codeto-Architecture Mapping Recommendations” we
show that InMap achieved a recall of 0.87-1.00
and precision of 0.70-0.96 for the systems tested.</p>
      <p>However, in a large system of say a 1000+
classes, in spite of achieving recall and precision
of 1, it is still burdensome for an architect to
inspect over a thousand recommendations before
accepting them as correct. In an attempt to reduce
the effort needed, we investigate making mapping
recommendations for higher-level source code
units – that is, we make mapping
recommenddations for larger units of code at a time (packages
rather than classes) thereby reducing the amount
of work required by an architect. In this paper, we
present an automated hierarchical package
mapping technique. It garners from the successful
information retrieval-based InMap approach [15,
16] that computes similarity of an unmapped class
to an architectural module. We exploit
class-tomodule similarity scores produced by InMap to
generate package-to-module similarity scores.
These are filtered using a defined set of hueristics
from which recommendations, that are detemined
by a system’s package hierachy, are made. We
show that using our proposed hierarchical
mapping technique we are able to reduce the effort
required by the architect, as high as 6-fold in some
cases, and still achieve good precision.</p>
      <p>Section 2 briefly discusses automated mapping
techniques along with their hierarchical mapping
capabilities. In Section 3, we detail the approach,
describing how package scores are computed and
how package-to-module mapping
recommenddations are constructed. Section 4 describes the
experiment setup to evaluate the technique and
presents the results obtained. In Section 5, we
interpret and discuss the results and in Section 6
we draw our conclusions on our findings and
present opportunities for further research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Christl et al. conceived, HuGME, a
dependency analysis (DA) based automated
mapping recommendation technique. It clusters a
software system’s source code using an
architect’s knowledge about its intended
architecture [4, 5]. HuGME applies an attraction
function, which minimizes coupling and
maximizes cohesion, to produce a matrix of
attraction scores for unmapped entities to modules
[17]. The calculation of the score uses the
dependency values between unmapped entities
and mapped entities. The higher the score, the
higher the likelihood that an unmapped entity
belongs to a given module. All unmapped entities
that result in only one candidate having a
similarity score higher than the arithmetic mean
of all scores produce a single recommendation.
All unmapped entities for which two or more
candidates exist are presented to the user in ranked
order, from highest to lowest, as
recommenddations. HuGME presents recommendations to
the user to allow cluster decisions to be made
exclusively by the architect. This process is
incremental, in that HuGME does not attempt to
map all source code entities in one complete step;
rather it maps a subset at a time until no more
mapping is possible. The approach is
nonhierarchical as it views the mapping task from a
clustering perspective in which source code
entities that are mapped to the same hypothesized
entity form a cluster [4].</p>
      <p>In their study, the results for HuGME had on
average about 90% recall and 80-90% accuracy
[5]. To get these results the technique needed
about 20% of the system’s source entities to be
pre-mapped before running the algorithm. Of
interest is that because this mapping technique is
dependency-based, for it to give meaningful
results, the 20% pre-mapped source entities need
to be spread across various modules. In addition,
they must have dependencies to unmapped
entities. This presents a problem in that in order to
benefit from this technique one needs to not only
dedicate some time for pre-mapping but must also
ensure that the mapping is evenly spread across
the modules. Additionally, one must also ensure
that the selected pre-mapped source code entities
have dependencies to the unmapped entities
otherwise entity relationship discovery is poor.
This all becomes a highly labour-intensive
exercise. Furthermore, because it uses clustering
algorithms based on high cohesion and low
coupling, if developers do not follow this
principle in the software’s implementation then
the mapping of the algorithm will be affected [2].</p>
      <p>Bittencourt et al. propose an information
retrieval (IR) based technique that uses the same
automated mapping recommendations approach
as HuGME except it replaces dependency-based
attraction functions with IR based similarity
functions [3]. It calculates the similarity of an
unmapped source entity to a module by searching
for specific terms (a module’s name and mapped
classes, methods and fields) within the source
code of the unmapped class. Similar to HuGME,
Bittencourt et al.’s technique needs some manual
pre-mapping before it can automate mapping.</p>
      <p>Olsson et al. combine IR &amp; DA methods in
their automated mapping technique called Naive
Bayes Classification (NBC) [11]. NBC uses
Bayes’ theorem to build a probabilistic model of
classifications using words taken from the source
code entities. The model gives the probability of
words belonging to a source file entity. This is
augmented with syntactical information of the
dependencies, a method called Concrete
Dependency Abstraction [11]. Just like HuGME,
Olsson et al.’s proposed technique requires a
premapped set in order to perform well. Both
Bittencourt et al.’s and Olsson et al.’s results
showed that when there was a smaller pre-mapped
set there was a decreasing trend in the f1-score of
their techniques [3, 11]. Additionally, they both
do not address package-level based mapping.</p>
      <p>Naim et al. present a technique called
Coordinated Clustering of Heterogeneous
Datasets (CCHD), that combines both DA and IR
methods to compute a similarity score for source
code files [10]. CCHD uses an architect’s
feedback on the recovered architecture to
iteratively adjust the results until there are no
suggestions for change. These adjusted results
train a classifier that automatically places new
code added to a codebase in the “right”
architectural module. However, the technique is
not necessarily meant for automated mapping in
SACC but rather for software architecture
recovery tasks. Moreover, it too does not directly
address package-level based mapping.</p>
      <p>Common among industry tools is the use of
naming patterns (or regular expressions). For
example, the expressions **/gui/** or *.gui.* or
net.java.gui.* can be used to map source code
units (whether classes or packages) to an
architecture module named GUI. This is the
technique used by both Sonargraph Architect
and Structure101 Studio in addition to their drag
&amp; drop capabilities. However, the drawback of
using naming patterns and/or drag &amp; drop
functionality is that they are both manual tasks
which makes mapping a tedious exercise –
especially for large software systems that have
complex mapping configurations.</p>
      <p>In summary, despite advances made, available
techniques that are designed to automate mapping
have short comings. Some require an initial set of
the source code to be pre-mapped manually [3–5,
11], while the industry tools that do not require
pre-mapping offer manual methods. Additionally,
the automated mapping techniques that require
pre-mapping in order to “jump-start” mapping, as
it were, require about 15-20% of the source code
to be pre-mapped in order to give worthwhile
results [4, 5, 6, 15].</p>
      <p>InMap [15, 16] addresses the limitations of
these techniques in that it is able to automate
mapping without requiring pre-mapping. Using
simple and concise natural language descriptions
of the architecture modules it is able to automate
mapping of a completely unmapped system with
rather good results. Its limitation though is that the
mapping recommendations provided are for
lowlevel source code units, namely, classes. This
results in considerable work for an architect in the
case of large software systems. We therefore
explore the following research question:
How can we exploit InMap’s good
classto-module mappings to produce
packageto-module mappings, thereby reducing the
effort needed by an architect in accepting
and/or rejecting mapping
recommenddations produced by InMap?</p>
      <p>In the following section, we describe our
approach to answering this question.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <p>We begin by describing the InMap technique
briefly. We then describe a technique for
hierarchical package-to-module mapping that
builds on top of InMap.
3.1.</p>
    </sec>
    <sec id="sec-4">
      <title>InMap</title>
      <p>InMap is an interactive code-to-architecture
automated mapping technique for SACC methods
that uses information retrieval concepts to
produce class-to-module mapping
recommenddations. It does not require manual pre-mapping
in order to produce recommendations, rather it
uses natural language architectural descriptions of
the architectural modules as input to predict
mappings. It presents its best mapping
recommendations a page/set at a time (the most
optimal being 30 per page) from which the
architect can accept and reject. As
recommendations from each page/set are accepted
or rejected, InMap learns from this and adapts its
next page/set of recommendations from the
obtained knowledge. This method works quite
well giving an average recall of 97% and a
precision of 82% for the systems evaluated [16].</p>
    </sec>
    <sec id="sec-5">
      <title>3.1.1. Class-to-Module Similarity</title>
      <p>InMap’s algorithm is made up of seven steps
[16]. However, for our hierarchical
package-tomodule mapping technique the following steps in
InMap are used to generate what are called
classto-module mapping scores.</p>
      <p>Firstly, the source code files are filtered to
exclude any external or third-party package
libraries or classes of system that the architect
does not want to include in the mapping exercise.
Secondly, the filtered sourced files are stripped of
any special characters and programming language
keywords. Third, the pre-processed source code
files are indexed as an inverted index. In the fourth
and fifth steps, InMap formulates a query using
four items namely, (1) the names of the modules
and (2) the module’s architectural descriptions
(stripped of any special characters and stop
words) to search the indexed source code files for
similarity to each module. In the first iteration,
InMap uses this information only to build a query.
However, once the first set of classes are mapped,
InMap then adds to the query (3) the names of
classes mapped to a module and (4) the names of
methods contained within classes mapped to a
module. This ‘enriches’ the query used to search
for the similarity of an unmapped class to a
module. Therefore, after each set of newly
mapped classes the query for the next set of
recommendations looks different. The search
returns a set of scores for every class-module pair
based on the similarity information retrieval
function, tf-idf. The tf-idf scores are called
classto-module similarity scores (  ), where, c and
m are a class-module pair in the system. Specifics
of how tf-idf is calculated can be found in [16].</p>
    </sec>
    <sec id="sec-6">
      <title>3.1.2. Class-to-Module</title>
    </sec>
    <sec id="sec-7">
      <title>Recommendations</title>
    </sec>
    <sec id="sec-8">
      <title>Mapping</title>
      <p>In the sixth and seventh steps, InMap gives as
a class-to-module mapping recommendation the
highest scoring class-to-module pair. The
architect can either accept or reject it. However,
InMap presents as recommendations either: only
those above the arithmetic mean of all highest
scoring class-module pairs; or the best 30
recommendations (if those above the mean is
greater than 30). After the architect gives
feedback, it returns to step 4 and repeats steps 4 to
7 until no more recommendations can be given.</p>
      <p>Our proposed hierarchical package mapping
technique picks up right after the fifth step, that is,
once InMap produces the matrix of
class-tomodule similarity scores (  ).
3.2.</p>
    </sec>
    <sec id="sec-9">
      <title>Hierarchical Package Mapping</title>
      <p>In as much as InMap is able to achieve good
results with the approach described in Section 3.1
because it based on class module mappings, the
effort required by architects could still be
significant for large and complex systems.
However, if we could map entire packages then
we could reduce the effort needed. For example, a
package that has 50 classes that all map to the
same module could be (or should be) given as a
single package-to-module mapping
recommenddation. Additionally, because packages are
hierarchal in nature, they present even more
opportunity to reduce the number of “necessary”
mapping recommendations to present to an
architect. For example, say we have two packages
A and B that are both sub-packages of C. If A and
B have 50 classes each and say all the classes in A
and B map to the same module. Then mapping C
to the module would suffice and saves the
architect from reviewing 99 other mapping
recommendations. Figure 1 illustrates a package
hierarchy, that our technique (and certainly
others) can benefit from to reduce the number of
recommendations needed.</p>
    </sec>
    <sec id="sec-10">
      <title>3.2.1. Package-to-Module Similarity</title>
      <p>Our package-to-module mapping technique picks
up from step 5 of the InMap algorithm after it
produces similarity scores for all class-to-module
pairs. We group the class-to-module similarity
scores   , according to the packages they
belong to. This means for each package we have
a set of classes with scores to each identified
module. From this set of class-to-module
similarity scores that have a given package as
their parent we then calculate the interquartile
mean (  ), where, p and m are a
packagemodule pair in the system. That is, the range of
values between the first quartile and third quartile
(the interquartile range, IQR) are used to
calculate the arithmetic mean. Module IQRs for a
package taken from Jittac are demonstrated in
Figure 2. The lowest 25% and the highest 25% of
the scores are ignored. Important to note is that the
IQR and hence the IQM of a non-terminal package
is calculated from not only the classes that belong
to the package but also the classes of its child
packages. For example,
se.kau.cs.jittac.eclipse.builders.jdt shown in the package tree in Figure 1
has its IQR calculated using the 8 classes that
belong to it but also the 3 classes in
se.kau.cs.jittac.eclipse.builders.jdt.commands and
the single class in
se.kau.cs.jittac.eclipse.builders.jdt.util. Formally, we define   as,
the class distribution inside and outside the IQRs. The x-axis shows the class  
axis shows the architectural modules of the system. The number in brackets beside a module indicates
scores and the
ythe total number of classes for the given package that have an  
score to the module.

2
3
4</p>
      <p>= 4 + 1
where, p and m are a package-module pair in the
system, c has p as its parent package, n is the

number of classes that make up the package p and
i is the position of</p>
      <p>to-module similarity scores for the package p.</p>
      <p>Using the scores within the IQR as opposed to
the full set of scores makes a package-to-module
similarity, more resilient to the presence of outlier
classes in the class-to-module similarity scores
that it is derived from. Figure 2 shows outlier
in the ordered set of
class1.5 (highlighted red) implies it is an outstanding
package-to-module similarity score.
classes with  
classes
which
we define as classes</p>
      <p>with
scores that are higher than the box plot max,</p>
      <p>scores that are lower than the
box plot min but also classes with  
are within the box plot min-max but outside the
IQR. The result of this step is a matrix of IQMs for</p>
      <p>scores that
each package-module combination.</p>
      <p>We then apply feature scaling to normalize the
IQM</p>
      <p>module scores for each package. We use
standardization
(also
known
as</p>
      <p>z-score
normalization) which makes the scores for each
package-module pair have a zero-mean. In our
hierarchical package mapping technique we call
the resulting z-scores of the standardization
package-to-module</p>
      <p>similarity
). Formally we define</p>
      <p>as
normalization
scores ( 
follows,</p>
      <p>=
system, 
of the 
similarity score, 
deviation of 
is the mean</p>
      <p>is the original package-to-module
scores for a specific package to the
range of given modules, and  is the standard</p>
      <p>. Using this method on all
package module pairs we obtain a matrix of
package-to-module similarity scores for the entire
system. Table 1 shows an extract of these scores.</p>
    </sec>
    <sec id="sec-11">
      <title>3.2.2. Package Mapping Filtering</title>
      <p>Using
the
matrix
of package-to-module
similarity scores we then traverse the
package</p>
      <p>Packages
se.kau.cs.jittac.model
se.kau.cs.jittac.model.am
se.kau.cs.jittac.model.am.events
se.kau.cs.jittac.model.am.io
se.kau.cs.jittac.model.im
se.kau.cs.jittac.model.im.events
se.kau.cs.jittac.model.im.io</p>
      <p>Modules
architecturemodel
2.3
2.6
2.4
2.3
1.0
0.9
0.5
eclipseui
-0.6
-0.4
-0.4
-0.5
-0.5
-0.6
implmodel
1.0
0.4
0.3
0.6
2.3
1.6
1.6
tree bottom-up starting with the terminal packages
and working our way up to the root package. At
each tree-depth level we retain the
package-tomodule similarity scores into two sets for each
package, namely a set of outstanding
package-tomodule similarity scores and a set of good
package-to-module similarity scores. Outstanding
mappings are those in which a package has a score
above the outstanding threshold and its child
packages have a score above the good threshold.
Good mappings are those in which a package and
its children have a score above the good threshold.
We formally define this notion with the following
two rules,</p>
      <sec id="sec-11-1">
        <title>Given:</title>
      </sec>
      <sec id="sec-11-2">
        <title>Package p</title>
      </sec>
      <sec id="sec-11-3">
        <title>Module m</title>
      </sec>
      <sec id="sec-11-4">
        <title>Package-to-module score</title>
      </sec>
      <sec id="sec-11-5">
        <title>Good score threshold GSt</title>
      </sec>
      <sec id="sec-11-6">
        <title>Outstanding score threshold OSt</title>
      </sec>
      <sec id="sec-11-7">
        <title>Rule 1: A mapping (pim) is called good iff</title>
        <p>and for all sub-packages pi of p, pim is a good
mapping.</p>
        <p>Rule 2: A mapping (pim) is called outstanding iff</p>
        <p>&gt;= GSt</p>
        <p>&gt;= OSt
mapping.
and for all sub-packages pi of p, pim is a good
or outstanding mappings. This is because it fails
to satisfy the second part of Rule 2, that is, that all
its sub-packages must have good mappings to the
same module. However, one of se.kau.cs.jittac.
model’s sub-packages has a good mapping to the
same module but the other does not hence no good
or outstanding mappings for the se.kau.cs.jittac.
model package.</p>
        <p>These rules are applied from the bottom of the
package tree starting with the deepest terminal
packages then their parent packages, then their
grandparent packages and so on and so forth until
we reach the root package at the top of the tree.
This is necessary as packages higher up in the
package tree depend on the results of packages
lower in the package tree.</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>3.2.3. Package-to-Module</title>
    </sec>
    <sec id="sec-13">
      <title>Recommendation Selection</title>
    </sec>
    <sec id="sec-14">
      <title>Mapping</title>
      <p>Once both sets of good and outstanding
mappings for each package are obtained, we then
traverse the package tree top-down. At each
treelevel we check if a package has outstanding
mappings and pick the highest that fulfils the
above defined criteria for outstanding and
recommend it as the most likely mapping. If a
package is recommended then we terminate
following that tree path downwards and do not
recommend any of its sub-packages, we instead
proceed to check its siblings. If a package returns
an empty set, then we go one-step lower in the
package tree. Figure 3 illustrates this; it shows
two package-to-module mapping
recommenddations (in bold). Observe that
architecturemodel is recommended as the module to which
se.kau.cs.jittac.model.am should map to and
implmodel as the module to which se.kau.cs.jittac.
model.im should map to. Their sub-packages are
skipped since they are already considered as a
result of Rule 2 and se.kau.cs.jittac.model has no
mapping recommendation since it retained no
mappings after the package mapping score
filtering step.</p>
    </sec>
    <sec id="sec-15">
      <title>4. Evaluation</title>
      <p>Test Cases: We evaluated our hierarchical
package mapping approach on six Java-based
systems that were used in the evaluation of
InMap’s class-to-module mapping technique.
These are Ant, a command line and API-based
tool for process automation; ArgoUML, a
desktop-based application for UML modelling;
JabRef a desktop-based bibliographic reference
manager; Jittac an eclipse plugin for reflexion
modelling tasks; ProM a desktop-based processes
mining tool; and TeamMates a web-based
application for handling peer reviews and
feedback. Table 2 shows the attributes of these
systems. The natural language architectural
module descriptions used as input to InMap to
generate class-module similarity scores were
obtained from the previous study of InMap. The
prior study of InMap obtained the oracle
mappings, that is the correct list of
code-tomodule mappings, from experts involved in
developing each respective open-source project.
The oracle package-to-module mappings used in
this study were extracted from these. We retained
in the oracle only packages that had direct 1-1
mappings with a module, and excluded packages
that had child entities that map to more than one
module.</p>
      <p>From the oracle mappings we only extracted
package-to-module mappings, leaving out the
class-to-module mappings to allow us to evaluate
the performance of proposed technique strictly at
a package-level. Table 2 also shows the number
of packages in the oracle mapping of a system.
This is the number of actual packages our
proposed technique should predict mappings for,
in other words, the packages that are of concern.
For example, if se.kau.cs.jittac.eclipse is part of
the oracle mapping and our technique puts up
se.kau.cs.jittac.eclipse.builders as a possible
mapping we count this as a false positive even though
the latter is a child package of the former. The
reason is the technique must reduce the effort
needed by an architect and therefore must be
penalized for recommending child packages of a
package that is already mapped (or should be).</p>
      <p>Experimentation
&amp;</p>
      <p>Data</p>
      <p>Collection: To
experiment on the test cases with various good
and
outstanding
threshold
combinations
we
extended the evaluator tool we developed in our
previous studies of InMap to accommodate the
evaluation of package-based mappings. Using the
oracle architecture package-to-module mappings
of each system the tool automatically simulates a
“human architect” accepting and rejecting the
recommendations produced.</p>
      <p>For all possible single decimal combinations
within the range -5.0 to 5.0 for the good and
outstanding threshold we collected the recall of
the package mappings as well the technique’s
precision. The min-max of the test range was
based on the highest and lowest  
obtained by all 6 systems. We also collected the
number of recommendations it took to achieve the
given
recall
&amp;
precision.</p>
      <p>Finally,
we
also
collected the class coverage (or code reach), that
is, the number of classes that were mapped as a
result of their parent packages being mapped by
scores
our hierarchical mapping technique.</p>
      <p>Results: Table 3 shows the results obtained for
the optimal thresholds for each system, i.e. they
gave the best results for the range of values tested.
We got for three systems, Ant, JabRef and
TeamMates, perfect precision with TeamMates
getting the same for its recall and class coverage.
We found 6 out of Jabref’s 11 package-to-module
mappings (as package recall) and 9 of Ant’s 14,
which resulted in class coverage of 98% and 50%
respectively. For Jittac, 90% of its classes were
mapped by finding 6 of it’s 9 package-to-module
mappings with a precision of 0.86. ArgoUML had
fairly good precision but low recall resulting in
low class coverage as well. ProM appeared to be
an outlier obtaining poor precision and the lowest
recall from the six systems tested. All results
presented are for a single iteration (or pass) of the
technique.</p>
      <p>In Table 4 we compare the effort required by
an architect of our hierarchical mapping technique
vs InMap in its original form. We do this by
looking at the class coverage of each technique
and the number of recommendations an architect
has to sift through to achieve the given class
coverage. Table 4 shows this for the systems that
achieved more than 50% class coverage after a
single iteration. In simple terms we define the
where   is the number of class-to-module
recommendations needed by the InMap
classbased technique and   is the number of
packageto-module
recommendations
needed
by
our
hierarchical package mapping technique. As an
example, Table 4 shows that in the case of Ant it
would take 390 recommendations to map 50% of
Ant’s classes using the InMap class-to-module
technique, whereas it would take 9
recommendations to map 50% of Ant’s classes
using our heirarchical package-to-module
mapping technique. You will also notice the effort
saved is more than 800 recommendations for
JabRef and the effort reduced is more than 90%
for all 4 systems.</p>
    </sec>
    <sec id="sec-16">
      <title>5. Discussion</title>
      <p>Table 3 shows that the technique has almost
perfect precision, 0.91 excluding ProM. This is
likely due to the fact that our hierarchal package
mapping technique is an extension of InMap’s
class-to-module similarity function. Using simple
natural language descriptions of architecture
modules the InMap algorithm, which has the
class-to-module similarity score   function at
its core, was shown to obtain rather good
precision. Our hierarchical package mapping
technique borrows from InMap’s success by using
the information retrieval based   to generate
is own package-to-module similarity score   .</p>
      <p>The package recall of our technique is fairly
good considering that these results are obtained
only after 1 iteration (or pass). As outlined in
Section 3.1, InMap is an interactive-iterative
technique that presents a set of recommendations
at a time and progresses by learning from the
feedback of the architect to formulate the next set
of recommendations. However, the number of
iterations (or passes) is proportional to the size of
the system under review. Compare Tables 2, 3 and
4, observe that systems with a high number of
source files require a high number of passes (or
iterations) compared to the “smaller” systems.
Table 3 shows that with our hierarchical mapping
technique we are able to obtain a package recall
of more than 50% in the first pass for 4 out of the
6 systems. Of these 4, from the first iteration we
get 50% class coverage for Ant with the other 3
getting more than 90% class coverage. Despite
this, two systems get low package recall and class
coverage. We do not see this as a problem because
it is resolved simply by having more
packagemapping recommendation iterations which would
still be far less compared to class-based mapping
recommendation algorithms.</p>
      <p>Table 3 shows the threshold values that give
the optimal results for each system. However, we
observed some similarities across the systems in
our threshold values experiments. The optimal
outstanding score threshold is very close to or the
same as the arithmetic mean of the max package
similarity scores for each module of the system.
And the optimal good score threshold was usually
0.5 less than the optimal outstanding score
threshold. This establishes a basis for developing
an automated approach for deriving threshold
values that will give good results across different
systems.</p>
      <p>Threats to Validity: Since our package-based
technique is derived from InMap the external
validity of its results is affected by similar things,
that is, factors such as number of modules and
classes, code commenting style and quality, and
architecture description quality. Therefore, more
cases studies with varying attributes would add to
the validity of the results. However, the results of
the six test systems used with varying attributes
shown in Table 2 provide a compelling case for an
automated hierarchical package mapping
technique.</p>
      <p>With regard to construct validity, the effort
required by an architect using our technique needs
to be evaluated against other package-based
mapping methods provided by industry tools like
drag &amp; drop, naming patterns or regular
expressions. For example, how does our
hierarchical package-based technique compare
with manually mapping packages? Evaluations
such as these would require enhanced user studies
with software architects in appropriately planned
and controlled experiments.</p>
    </sec>
    <sec id="sec-17">
      <title>6. Conclusion &amp; Future Work</title>
      <p>We have presented a proposed solution to
hierarchical package-based mapping. It extends or
builds on InMap, an information retrieval
classbased mapping technique that uses concise natural
language architectural descriptions of modules.
Our hierarchical package-based mapping
technique provides almost perfect precision and
fairly good recall and great code coverage. But
most importantly our techniques helps reduce the
effort or workload required by an architect in
accepting and rejecting mapping
recommendations in interactive techniques like
InMap. The technique is an improvement over the
manual package mapping methods used in today’s
state-of-the-art reflexion modelling tools.</p>
      <p>Despite reducing effort required, the drawback
of using a purely package-based approach is that
due to their 1-1 package-to-module mapping style
these methods do not work well for systems that
have more complex mapping configurations. It is
not always the case that packages, and their
members directly map to modules in a 1-1
manner. It is more likely the case that a software
system’s code-to-architecture mapping has a
combination of both package and class mappings.
Cases where package members are spread across
multiple modules requires a class-based
technique. Therefore, we plan as future work to
derive an approach to combine InMap’s good
class-based approach with the good package
hierarchy-based approach presented in this paper.
The aim is to combine class and package mapping
recommendations in a way that benefits from the
advantages, and negates the disadvantages, of
both mapping styles. Nevertheless, the
hierarchical packaged/based mapping technique
presented in this paper remains useful and is
useful in cases where it is appropriate to map
entire packages.
7. References</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>where, p and m are a package-module pair</article-title>
          <source>in the [1] [2] [3] [4] [5]</source>
          [6]
          <string-name>
            <surname>Ali</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          et al.
          <year>2018</year>
          .
          <article-title>Architecture Consistency: State of the Practice, Challenges and Requirements</article-title>
          .
          <source>Empirical Software Engineering</source>
          .
          <volume>23</volume>
          ,
          <issue>1</issue>
          (
          <year>2018</year>
          ),
          <fpage>224</fpage>
          -
          <lpage>258</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>DOI:https://doi.org/10.1007/s10664-017- 9515-3.</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Bauer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Trifu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>Architectureaware adaptive clustering of OO systems</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>Eighth European Conference on Software Maintenance and Reengineering</source>
          ,
          <year>2004</year>
          .
          <article-title>CSMR 2004</article-title>
          . Proceedings. (
          <year>2004</year>
          ),
          <fpage>3</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Bittencourt</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          et al.
          <year>2010</year>
          .
          <article-title>Improving automated mapping in reflexion models using information retrieval techniques</article-title>
          .
          <source>Proceedings - Working Conference on Reverse Engineering</source>
          , WCRE. (
          <year>2010</year>
          ),
          <fpage>163</fpage>
          -
          <lpage>172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          DOI:https://doi.org/10.1109/WCRE.
          <year>2010</year>
          .
          <volume>26</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Christl</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          et al.
          <year>2007</year>
          .
          <article-title>Automated Clustering to Support the Reflexion Method</article-title>
          .
          <source>Information and Software Technology</source>
          .
          <volume>49</volume>
          ,
          <issue>3</issue>
          (
          <year>2007</year>
          ),
          <fpage>255</fpage>
          -
          <lpage>274</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          DOI:https://doi.org/https://doi.org/10.1016/j.i nfsof.
          <year>2006</year>
          .
          <volume>10</volume>
          .015.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Christl</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          et al.
          <year>2005</year>
          .
          <article-title>Equipping the reflexion method with automated clustering</article-title>
          .
          <source>12th Working Conference on Reverse Engineering (WCRE'05)</source>
          (
          <year>2005</year>
          ), 10 pp. -
          <volume>98</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Fontana</surname>
            ,
            <given-names>F.A.</given-names>
          </string-name>
          et al.
          <year>2016</year>
          .
          <article-title>Tool Support for Evaluating Architectural Debt of an Existing System: An Experience Report</article-title>
          .
          <source>Proceedings of the 31st Annual ACM Symposium on Applied Computing</source>
          (New York, NY, USA,
          <year>2016</year>
          ),
          <fpage>1347</fpage>
          -
          <lpage>1349</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Knodel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Sustainable Structures in Software Implementations by Live Compliance Checking</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Knodel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>A Comparison of Static Architecture Compliance Checking Approaches</article-title>
          .
          <source>Proceedings of the Sixth Working IEEE/IFIP Conference on Software Architecture (USA</source>
          ,
          <year>2007</year>
          ),
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>G.C.</given-names>
          </string-name>
          et al.
          <year>2001</year>
          .
          <article-title>Software Reflexion Models: Bridging the Gap between Source and High-Level Models</article-title>
          .
          <source>IEEE Transactions on Software Engineering</source>
          .
          <volume>27</volume>
          ,
          <issue>4</issue>
          (Apr.
          <year>2001</year>
          ),
          <fpage>364</fpage>
          -
          <lpage>380</lpage>
          . DOI:https://doi.org/10.1109/32.917525.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Naim</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          et al.
          <year>2017</year>
          .
          <article-title>Reconstructing and Evolving Software Architectures Using a Coordinated Clustering Framework</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>Automated Software Engineering. 24</source>
          ,
          <issue>3</issue>
          (
          <year>2017</year>
          ),
          <fpage>543</fpage>
          -
          <lpage>572</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>DOI:https://doi.org/10.1007/s10515-017- 0211-8.</mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Olsson</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          et al.
          <year>2019</year>
          .
          <article-title>Semi-Automatic Mapping of Source Code using Naive Bayes</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>ACM International Conference Proceeding Series</source>
          .
          <volume>2</volume>
          , (
          <year>2019</year>
          ),
          <fpage>209</fpage>
          -
          <lpage>216</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>DOI:https://doi.org/10.1145/3344948.334498 4.</mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          et al.
          <year>2010</year>
          .
          <article-title>Static ArchitectureConformance Checking: An Illustrative Overview</article-title>
          . IEEE Software.
          <volume>27</volume>
          ,
          <issue>5</issue>
          (
          <year>2010</year>
          ),
          <fpage>82</fpage>
          -
          <lpage>89</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          DOI:https://doi.org/10.1109/MS.
          <year>2009</year>
          .
          <volume>117</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Rosik</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          et al.
          <year>2011</year>
          .
          <article-title>Assessing Architectural Drift in Commercial Software Development: A Case Study</article-title>
          . Softw.,
          <string-name>
            <surname>Pract</surname>
          </string-name>
          . Exper.
          <volume>41</volume>
          , (
          <year>2011</year>
          ),
          <fpage>63</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>de Silva</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Balasubramaniam</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <article-title>Controlling software architecture erosion: A survey</article-title>
          .
          <source>Journal of Systems and Software</source>
          .
          <volume>85</volume>
          ,
          <issue>1</issue>
          (
          <year>2012</year>
          ),
          <fpage>132</fpage>
          -
          <lpage>151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          DOI:https://doi.org/https://doi.org/10.1016/j.js s.
          <year>2011</year>
          .
          <volume>07</volume>
          .036.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Sinkala</surname>
            ,
            <given-names>Z.T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Herold</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2021</year>
          .
          <article-title>InMap: Automated interactive code-to-architecture mapping</article-title>
          .
          <source>Proceedings of the ACM Symposium on Applied Computing (Mar</source>
          .
          <year>2021</year>
          ),
          <fpage>1439</fpage>
          -
          <lpage>1442</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Sinkala</surname>
            ,
            <given-names>Z.T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Herold</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2021</year>
          .
          <article-title>InMap: Automated Interactive Code-to-Architecture Mapping Recommendations</article-title>
          .
          <source>Proceedings - IEEE 18th International Conference on Software Architecture</source>
          ,
          <article-title>ICSA 2021 (Mar</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Wiggerts</surname>
            ,
            <given-names>T.A.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>Using Clustering Algorithms in Legacy Systems Remodularization</article-title>
          .
          <source>Proceedings of the Fourth Working Conference on Reverse Engineering</source>
          (
          <year>1997</year>
          ),
          <fpage>33</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>