What is in a Name? An Analysis of Associations Among
Java Packaging and Artifact Names
Farshad Ghassemi Toosi, Anila Mjeda
Computer Science Department at Munster Technological University, Cork Campus
Computer Science Department at Munster Technological University, Cork Campus


                                    Abstract
                                    Modern Programming Languages (Object Oriented Languages), are equipped with sophisticated mechanisms to assist devel-
                                    opers in organizing the source code. For instance, Java and Python use package names to resolve symbols. In Java, a package
                                    is a namespace declared at the top of each class or interface.
                                         There are several reasons for using packages in the source code: 1) Packages can prevent naming conflicts, (e.g., identical
                                    class name in two packages is possible with no conflict). 2) Packages can categorize the relevant and/or similar classes or
                                    interfaces in some conceptual and logical containers that assist developers in easier maintenance and a better understanding
                                    of the design of the software’s architecture. 3) Structured packaging is one of the core components of a clean architecture
                                    design. Developers may apply different strategies to structure the packages and these differences have repercussions in the
                                    quality and maintainability of the software architecture.
                                         In this work, we run a set of experiments on a number of open-source Java projects and analyse the packaging structures
                                    from a source-code structural and artifact (class, method, variable) names perspective. These experiments aim to investigate
                                    1) the existence of any associations between the packaging structure and textual factors (artefact names) of the classes inside
                                    the package; and 2) what textual factors (artifact names) tend to be more associated with the package structure. The results of
                                    this research indicate that, on average, class names and inheritance (supper class names) tend to be considered as a packaging
                                    strategy. The focus on identifying ‘naturally’ occurring similarities in the packaging of software in the ‘wild’ is underpinned
                                    by the long-term objective to build developer-friendly architecture conformance protocols which help prevent architectural
                                    erosion.


1. Introduction                                                                        level container or module, called a package.
                                                                                          Usually, the visual representation of a software’s archi-
Object oriented programming is underpinned by the idea tecture is a graph-like design where the software compo-
of creating classes and using objects of those classes for nents are program packages that, in their turn, may con-
higher reusability and better maintenance. The object tain other packages (hierarchical packages) [1, 2]. In most
oriented programming paradigm is based on bringing software architecture design practices, modules or com-
related fields and functions/methods together for a par- ponents are seen as a package or a set of packages [3, 4, 5].
ticular concept that is called a class. Different objects Hence, the intuition is that package structure can have a
then can be instantiated from classes with different data direct impact on the quality of the software architecture.
and implementation but they all share the same original Indeed this intuition has attracted the interest of other
type, i.e., the class. For example, a class may represent researchers of the field such as Ebad et al., [6].
a car and its objects can be a hatchback or a sport utility                               One of the fundamental aspects of an architectural
vehicle. In object oriented programming, methods and design is to consider the functionalities and interactions
fields within a given class are expected to be logically between components at different granularities [7] with
grouped in one container called class.                                                 a view of facilitating work among the components in a
    Some of the modern object oriented languages, includ- package.
ing Java and Python, have another mechanism called                                        Researchers [8, 9, 10, 7] show that a clean software
packaging that lets developers have a higher level of architecture has a direct relation to the structured pack-
grouping where related classes can be located in a high- aging; furthermore, they show how implicit packaging
                                                                                       can cause architectural mismatching. They use the term
ECSA2021 Companion Volume
                                                                                       unstructured packaging as a lack of packaging strategy.
" farshad.toosi@mtu.ie (F. G. Toosi); anila.mjeda@mtu.ie
(A. Mjeda)                                                                             For instance, all classes would be located in one package
~                                                                                      or there are random packages, and classes are assigned
https://www.linkedin.com/in/farshad-ghassemi-toosi-428a5852/                           to them based on no particular strategy. As a result of
(F. G. Toosi); https://www.linkedin.com/in/anila-mjeda-32a5064/                        such packaging structure (or unstructured packaging),
(A. Mjeda)                                                                             there will be several of unrelated classes with no naming
 0000-0002-1105-4819 (F. G. Toosi); 0000-0003-1311-6320
(A. Mjeda)
                                                                                       and textual relevancies to each other in a package [8].
          © 2021 Copyright for this paper by its authors. Use permitted under Creative
          Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                       Naming relevancy, in particular, is important since arte-
 CEUR
 Workshop
          CEUR Workshop Proceedings (CEUR-WS.org)
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073                                                          facts (class, method, variable) are meant to be named by
developers according to their responsibilities and func-          Additionally, there is considerable research to auto-
tionalities.                                                   matically optimise inter-package dependencies [16]. A
   Java is one of the object oriented languages that offers    review of looking at object-oriented code issues in this
the packaging mechanism. Every Java class is inside of a       space as refactoring opportunities, can be found in [17].
package (unless there is no package declared, then the            Interestingly for our research, Baxter et al [18] investi-
class will be part of the default package). In this work,      gated some of the reasons behind the structures and struc-
we are using Java as the language of our case study to         tural relationships in Java code, while Abedeen et al. [16]
answer the following research questions:                       proposed a set of metrics to assess modularity principles
                                                               for packages in large legacy systems (namely informa-
      1. Are there any existing associations between the tion hiding, changeability and reusability principles) [19].
         package structure and textual factors within the Coming up to twenty years ago, Hautus [12] proposed
         package?. The textual factors in question include a tool to run a package structure analysis through Java
         artefact names e.g., class, method and variable code and highlight potential weak areas to the human
         names.                                                with an aim to refactor the source code.
      2. What type of names and at what granularity tend          Yet, there is still no standard and unique definition
         to have more weight on influencing packaging of relevant and/or similar classes and developers might
         structure?                                            consider different criteria to insert two or more classes
                                                               into a package. The latter becomes problematically ev-
   By answering the above two research questions, we
                                                               ident when analysing code in the wild. Furthermore,
try to discover the level of textual cohesion among com-
                                                               packages typically appear in software architecture doc-
ponents of each package to understand if there is any
                                                               umentation as not-dividable components of package di-
textual packaging structure in the project or not, and if
                                                               agrams, drilling down within packages and investigate
so, what type of artefact name has a heavier role.
                                                               their relevancy validity within, has an added value.
                                                                  It is exactly this gap that is the focus of this research.
2. Background                                                  Indeed, the research reported in this paper represents the
                                                               initial steps into identifying relevancy (through similar-
In large programs, it is difficult to have an architecture for ity) factors within packages (or architecture components)
a software system that conforms to the system’s packag- with a long term view of building developer-friendly
ing structure. Object-oriented software, has an inherent architecture conformance protocols so as to prevent ar-
affinity for structure such as packages as one of its appeal- chitectural erosion.
ing promises. Albeit, that affinity does not necessarily
automatically translate to a structure that is relevant to
the architecture of the system. This issue has seeded 3. Experiment Design
research into improving the packaging structure of the
                                                               In this work, six open-source Java projects are under
software system. Shaw et al. [10] propose that the po-
                                                               study and their details are represented in Table 1.
tential existing problem with reusing components of a
                                                                  The experiment tries to find whether there are factors
software system is not necessarily due to the bad architec-
                                                               that can define the relation within the members of each
tural design but the packaging strategy as well. Shaw et
                                                               package or not. It is worth noting that the factors are
al., in a different work [7], emphasise the importance of
                                                               mostly textual factors (e.g., artefact names) and not the
a packaging strategy to enforce compatible components
                                                               functional factors (e.g., the functionality of the artifacts)
to be located in the same package.
                                                               unless the functionality of the artifacts is reflected in
   The quality of the software architecture depends on
                                                               their names. The details of these factors are discussed in
several factors; one of which is the applied packaging
                                                               Section 3.2.
strategy [8, 9, 10, 7]. The packaging strategy refers to the
                                                                  The logic of the experiment is as follows:
criteria that is used to combine components in packages.
   One of the first empirical studies to investigate the             1. All classes are put in a pool without considering
structure of written code [11], relied on static and dy-                 the package structure (the left bottom rectangle
namic analysis (of FORTRAN code) and looked at it at                     in Figure 1).
a statement level. Existing research tends to look at im-            2. Pairwise similarity between every pair of classes
proving existing package design, such as through pack-                   is calculated based on some similarity factor (see
age structure analysis [12], using package cohesion to                   Section 3.2).
assess organization and reusability of code [13, 14], or             3. A clustering technique is applied on the members
using artificial intelligence algorithms or multi-objective              of the class pool and 𝑃 clusters are generated (𝑃
approaches based on remodularization objectives [15].                    is the number of packages in the project).
Table 1
Six Open-Source Java projects.

            Name                 #classes    #packages    Details
            JHotDraw             730         64           Visualization tool, MIT license.
            Galaxy               39          17           Galaxy Artifacts is an opensource
                                                          and freeware 4x game, written in Java.
            JavaFX               38          9            JavaFX is a cross platform GUI toolkit, MIT licence.
            JavaParserCore       516         29           Java parser tool, LGPL license
            JavaParserSymbol     167         21           Symbol solver tool, Apache License.
            Jung                 227         14           Visualization tool, open source, Jung licence.


Figure 1: The high-level picture of the proposed model for comparison.


    4. The clustering result (the right bottom rectangle      (i.e., Class name, Method name and Variable/Field name).
       in Figure 1) is compared to the package struc-         Each name will be converted to some simple-names af-
       ture of the project (the top rectangle in Figure 1),   ter the pre-processing. The following list indicates the
       where each package can be seen as an existing          required actions for pre-processing.
       cluster.
                                                                    • Camel Case removal. E.g., StudentGrade → Stu-
   Figure 1 shows the general flowchart of the experiment.            dent Grade (StudentGrade as a name is converted
All the original packages in the system are also seen as              to two simple-names: Student and Grade)
a cluster of classes and the objective is to compare the            • Snake Case removal. E.g., Employee_tax → Em-
existing packaging to the one arrived at by the proposed              ployee tax
clustering algorithm.                                               • Digits removal. E.g., distance100km → distance
                                                                      km, salary100k → salary (Note, words with one
3.1. Comparison Analysis                                              character are ignored).
                                                                    • All lower case. PensionCalculator → pension
All the experiments in this work are at source-code level             calculator
and focus on three different types of artefact names:
1) Class names, 2) Method names and 3) Variable/Field
names. The comparisons are based on textual/term com- 3.2. Comparison Factors
parison. Therefore, a simple pre-processing step is re- As mentioned earlier, the pool of classes is grouped via a
quired prior to the actual comparison on each name clustering algorithm. Clustering algorithms work based
on a similarity or dissimilarity matrix where the similar-
ity/dissimilarity between every pair of entities (classes
in this case) is known. Therefore, a similarity needs to
be defined between every two classes. Each Java class
has several different features and characteristics such as
the class name, the method names within the class, field
names and many more. In this work, we make use of nine
different features of each class and use them as similarity
factors for the clustering algorithm. The nine different
factors that are examined are as follows:
     • Class Names. Two classes are compared accord-
       ing to their names, (CN).
     • Outgoing Methods. Two classes are compared
       according to their outgoing method names, (OM).
     • Incoming Methods. Two classes are compared             Figure 2: Two packages with their classes.
       according to their incoming method names, (IM).
     • Field Declaration. Two classes are compared ac-
       cording to their declared fields names, (FD).          3.2.2. Outgoing Methods
     • Variable Accessed. Two classes are compared ac-
       cording to their accessed variables’ names, (AV).  Outgoing Methods (OM) is the second factor that is con-
     • Outgoing Class. Two classes are compared ac-       sidered to measure the similarity between two classes.
       cording to the class names where they were in-     For class A, all the methods that are called from class
       stantiated, (OC).                                  A in the project are collected and their names are pre-
                                                          processed so a set of simple names is generated for class
     • Incoming Class. Two classes are compared ac-
                                                          A. A similar process is repeated for Class B.
       cording to their instantiated class names in them,
                                                             Figure 3 shows two classes with their meth-
       (IC).
                                                          ods and the callee (outgoing) methods inside of
     • Class Methods Names. Two classes are compared
                                                          them. The set of simple names that can be ex-
       according to their method names, (CM).
                                                          tracted for Class A based on their callee methods is
     • Supper Class Names. Two classes are compared
                                                          {𝑔𝑟𝑒𝑒𝑛, 𝑐𝑖𝑟𝑐𝑙𝑒, 𝑎𝑟𝑒𝑎} and the set of simple names for
       according to their supper class names, (SC).
                                                          Class B is {𝑔𝑟𝑒𝑒𝑛, 𝑠𝑢𝑟𝑓 𝑎𝑐𝑒, 𝑏𝑙𝑎𝑐𝑘, 𝑤ℎ𝑖𝑡𝑒}. There is one
   Each Java program is analysed and nine different types simple term common within these two sets, therefore, a
of information (mentioned earlier) are extracted. In or- degree of similarity exists within Class A and Class B.
der to extract the details from the Java projects, a Java
parser is employed. Among different choices of parsers,
JavaParser [20] was selected due to its simplicity in im-
plementation and high reputation.

3.2.1. Class Names
Class Names (CN) is the first factor that is used for com-
parison. Two classes are said to be similar if their names
are similar or in other words, if they share some simple-
terms. Figure 2 has two packages and each package has
two classes. A set of simple-terms is generated for each
class in the project:                                      Figure 3: Two Classes with their callee methods.
    1. 𝐶𝑖𝑟𝑐𝑙𝑒_𝑎𝑟𝑒𝑎 class: {𝑐𝑖𝑟𝑐𝑙𝑒, 𝑎𝑟𝑒𝑎}.
    2. 𝐷𝑟𝑎𝑤𝐶𝑖𝑟𝑐𝑙𝑒 class: {𝑑𝑟𝑎𝑤, 𝑐𝑖𝑟𝑐𝑙𝑒}.
    3. 𝐶𝑜𝑙𝑜𝑟𝑠 class: {𝑐𝑜𝑙𝑜𝑟𝑠}.                                3.2.3. Incoming Methods
    4. 𝑆𝑢𝑟𝑓 𝑎𝑐𝑒𝐶𝑖𝑟𝑐𝑙𝑒 class: {𝑠𝑢𝑟𝑓 𝑎𝑐𝑒, 𝑐𝑖𝑟𝑐𝑙𝑒}.
                                                              Incoming Methods (IM) is the other selected factor to
  The first class has a degree of similarity with the sec- measure the similarity between two classes. This factor,
ond class and the fourth class as they share 𝑐𝑖𝑟𝑐𝑙𝑒. Like- similar to the last one, works based on the method calls.
wise, the second class and fourth class have a degree of Two classes are said to be similar if their contained meth-
similarity while the third class is not similar to any class.
ods are called by methods with similar name (common
simple terms).

3.2.4. Field Declarations
Field Declaration (FD) is another selected factor and it
measures the similarity between classes based on de-
clared fields within the class. Therefore, two classes with
similarly declared field names are considered similar. Fig-
ure 4 shows two classes with their declared fields. Class A Figure 5: Two classes with their details.
has the following set of simple-terms extracted from its
declared fields {𝑐𝑖𝑟𝑐𝑙𝑒, 𝑐𝑜𝑙𝑜𝑟, 𝑓 𝑢𝑙𝑙, 𝑎𝑟𝑒𝑎} and Class B
has the following: {𝑠𝑢𝑟𝑓 𝑎𝑐𝑒, 𝑠ℎ𝑎𝑝𝑒, 𝑐𝑜𝑙𝑜𝑟}. Therefore, 3.2.7. Incoming Classes
Class A and B are similar due to the existing of 𝑐𝑜𝑙𝑜𝑟 in
both sets of simple terms.                                  Incoming class (IC) is another notion we use in this ex-
                                                            periment as a similarity factor. Class A is considered as
                                                            an incoming class for Class B if Class B is instantiated
                                                            in Class A. In Figure 6 DrawCircle is the incoming class
                                                            for PaintSurface class. Two classes are said to be similar
                                                            if their classes are instantiated with the same class or
                                                            classes with similar names.


Figure 4: Two classes with their declared fields.


                                                            Figure 6: Two classes, one instantiates the other one.
3.2.5. Accessed Variables
Variable Accessed (AV) is the other factor we use to mea-
sure the class similarities. Two classes are considered 3.2.8. Class Methods
similar if they are accessing variables/fields with similar
names.                                                      The other employed factor in this work is method name
                                                            (CM). Two classes are considered similar if they have
3.2.6. Outgoing Classes                                     methods with similar names. Figure 7 shows two classes
                                                            with their contained methods. Class A has the follow-
The next factor to measure the package similarity is Out- ing set of simple-names extracted from method names:
going Class names (OC). The characterization of being {𝑝𝑎𝑖𝑛𝑡, 𝑠𝑢𝑟𝑓 𝑎𝑐𝑒, 𝑔𝑒𝑡, 𝑐𝑜𝑙𝑜𝑟} and class B has the follow-
an Outgoing Class is a subjective role for a class. Having ing set: {𝑐𝑜𝑙𝑜𝑟, 𝑐𝑖𝑟𝑐𝑙𝑒, 𝑜𝑣𝑎𝑙, 𝑔𝑟𝑒𝑒𝑛, 𝑙𝑎𝑟𝑔𝑒}. Since there
two classes (Class A and Class B), Class B is said to be is one term common in both sets, therefore, Class A and
an outgoing class for Class A, if Class B is instantiated B are similar with some degree.
in Class A. Figure 5 shows two classes, each class has
two methods and each method instantiates another class. 3.2.9. Supper Classes
The name of the instantiated classes for each class are
extracted, pre-processed and compared. Class A contains Classes are also compared by their supper classes. For
the following set of simple terms extracted from instanti- each class, all the super class names are collected, pre-
ated classes: {𝑙𝑎𝑟𝑔𝑒, 𝑐𝑖𝑟𝑐𝑙𝑒, 𝑜𝑣𝑎𝑙} and the set associated processed and a set of simple-terms is generated. Similar
with Class B is: {𝑙𝑎𝑟𝑔𝑒, 𝑐𝑖𝑟𝑐𝑙𝑒, 𝑜𝑣𝑎𝑙, 𝑔𝑟𝑒𝑒𝑛, 𝑐𝑜𝑙𝑜𝑟}. As to other similarity factors, the common simple-terms
shown in Figure 5, three simple terms are common within for each pair of classes is an indication of the degree of
these two sets: {𝑙𝑎𝑟𝑔𝑒, 𝑐𝑖𝑟𝑐𝑙𝑒, 𝑜𝑣𝑎𝑙}. Therefore, Class similarity. In Figure 8, there are two classes with some
A and B are similar with some degree. More details of super classes for each. The set of simple-terms for A is:
how the degree of similarities is taken into account for {𝑠ℎ𝑎𝑝𝑒, 𝑔𝑒𝑜𝑚𝑒𝑡𝑟𝑦} and for B is: {𝑠𝑞𝑢𝑎𝑟𝑒}. Since there
comparison, will be discussed in later sections.            is no common simple-term in these two sets, there is also
                                                             and is more suitable for a set of individuals where connec-
                                                             tivity relations (e.g., similarity between two individuals)
                                                             can be defined between them. Unlike other clustering
                                                             algorithms (e.g., K-Means), Spectral Clustering, requires
                                                             the relations/similarity between individuals to be com-
                                                             puted as a matrix in advance and eigendecomposition
                                                             can be applied on that matrix. Therefore, Spectral Clus-
                                                             tering was found a good fit to be the clustering algorithm
                                                             in this work. As seen in the previous section, each class
                                                             gets a set of simple-terms (based on the applied similar-
Figure 7: Two classes with their contained classes and meth- ity factors); the number of common simple-terms among
ods.                                                         two sets from two classes is considered as the measure
                                                             of relations/similarity between two classes. Therefore,
                                                             a 𝑁 × 𝑁 (𝑁 is the number of classes) similarity matrix
no similarity between these two classes based on this should be created for Spectral clustering.
factor.                                                         Spectral Clustering, like most other clustering algo-
                                                             rithms, requires to know the number of clusters/groups
                                                             in advance. As shown in Figure 1, the number of clusters
                                                             for the applied clustering algorithm is 𝑃 where 𝑃 is the
                                                             number of existing packages in the project. Once Spectral
                                                             Clustering returns 𝑃 clusters of similar classes (based
                                                             on a given similarity factor), one can compare those 𝑃
                                                             clusters against the existing 𝑃 packages in the system.

                                                             4.2. Clustering vs Packaging
Figure 8: Two classes with their supper classes.
                                                            The objective is to analyse the individual similarity fac-
                                                            tors and see how much each of them conform the pack-
                                                            aging structure. To do this, the clustering that is resulted
                                                            from each factor needs to be compared against the pack-
4. Clustering                                               aging structure. Since the clustering is done on 𝑃 clusters
Clustering is a task of splitting individuals into a number (𝑃 is the number of packages in the project), therefore,
of groups or clusters where the members of a cluster there are two sets of groups where each set contains 𝑃
are more similar to other members of the same cluster number of groups of classes. In order to measure the
than the members of other clusters. In this experiment, similarities between two sets of groups, we make use
classes are considered as individuals, therefore, classes of Normalized Mutual Information [25] technique from
with more similarity would be clustered in one group. As SKlearn in Python. Normalized Mutual Information mea-
mentioned in previous section, there are nine different sures the similarity between two clusterings [26] and
similarity factors considered in this work, therefore, for returns a value between 0 to 1. Given two clusterings by
each Java project, clustering runs nine times, each time two different techniques, Normalized Mutual Information
with a different factor. The goal is to measure how much specifies how much these two clustering are correlated.
a given factor, as a similarity criteria, conforms to the Figure 9 shows two clusterings where each clustering has
existing packaging in the system.                           three clusters with their members. As it is shown, there
   Clustering is an unsupervised learning technique that are some differences between the results of these two
has applications in many different fields and domains. techniques. For instance, the first cluster of PS contains
K-Means [21], Affinity propagation [22], DBSCAN [23] 𝑎1 , 𝑎3 and 𝑎3 and the first cluster of CR contains 𝑎1 , 𝑎3
and Spectral Clustering [24] are a number of clustering and 𝑧1 . The degree of similarity between the results of
algorithms and the choice of algorithm depends on the these two techniques by Normalized Mutual Information
nature of the data.                                         is 0.2804.


4.1. Spectral Clustering                                     5. Evaluation
In this experiment, we employ Spectral Clustering [24] to In this experiment, six different Open Source Java project
cluster the pool of classes (see Figure 1). Spectral Cluster- are analysed (see Table 1). For each project, nine differ-
ing algorithm is based on eigendecomposition calculation
                                                               Table 2
                                                               Accumulated Factors.
                                                                             Name                 Similarity
                                                                             JHotDraw             0.364
                                                                             Galaxy               0.818
                                                                             JavaFX               0.4562
                                                                             JavaParserCore       0.622
                                                                             JavaParserSymbol     0.58140
                                                                             Jung                 0.387


Figure 9: Two different clusterings.


ent similarity factors are separately employed to apply
a clustering technique and compared against the pack-
aging structure in the system. The nine factors are fully
described in section 3.2.

5.1. Results and Discussion
In total, there are 54 + 6 experiments performed. The first
                                                               Figure 10: The percentage of each similarity between the ap-
54 experiments are for 6 projects and for each project 9
                                                               plied clustering technique and the packaging structure using
individual similarity factors are tested. We run an extra      9 similarity factors.
experiment for each project where the similarity factor
is the accumulative of all the 9 individual factors.
   Figures 10 to 15 show the percentage similarity be-
tween the applied clustering technique (Spectral Cluster-
ing) and the packaging structure.
   The very first observation from all the results indicates
the association between class names and packaging. Ex-
cept for the Java Parser Core project, the class name has
the highest impact on the packaging. Even for Java Parser
Core, the class name comes in second-highest score. The
other observation that can be realized from all diagrams
is the association between supper class names and pack-
aging. Except JavaFX project and the Galaxy project,
supper class names are the second ‘winners’. Method
names for one project (JavaFX ) have a higher associa-         Figure 11: The percentage of each similarity between the ap-
tion with the packaging compared to other projects. On         plied clustering technique and the packaging structure using
the other hand, class instantiation (incoming and outgo-       9 similarity factors.
ing classes) on average has smaller association with the
packaging.
   As mentioned earlier, six extra experiments are per-        6. Conclusion
formed to see the impact of overall similarity factors
when they are accumulated all together. Table 2 depicts        In this work, we presented a comparative analysis on
the results for each individual project. On average the        six different Java Projects to discover the applied pack-
Galaxy project has a strong naming association with the        aging strategy from textual and naming point of view.
packaging followed by Java Parser Core and Java Parser         Our findings (see Table 3) illustrate that there is a tex-
Symbol.                                                        tual similarities among components at each package to
                                                               some extend (the first research question). On average, the
                                                               textual similarity is stronger when class names are cho-
                                                               sen as a similarity factor (the second research question).
Table 3
Details of all experiments for 6 subject systems. Green indicates the applied factor that shows the highest similarity between
the packaging structure and the clustering technique and orange indicates the second highest and blue indicates the third
highest.

                  Class      Method       Field     Variable       Super        In          Out         In         Out
                  Name       Name         Name      Accessed       Class       Class       Class      Method      Method
                                                     Name          Name        Name        Name       Name        Name
    JHotDraw      0.6264     0.4674       0.454      0.3415       0.4989      0.4636      0.3658        0.312      0.3638
      JavaFX      0.449       0.392       0.4139      0.446        0.357      0.3286       0.368        0.425      0.453
     J-P Core     0.6159      0.332       0.392      0.3937        0.635       0.455       0.336        0.393      0.3977
   J-P Symbol     0.6783      0.552       0.423      0.4392        0.534      0.3219       0.361        0.452        0.37
       Jung       0.458       0.313       0.243       0.254        0.331        0.24       0.162        0.226       0.189
      Galaxy       0.803      0.755       0.748      0.7518        0.734       0.713      0.6913        0.689       0.714
     Average      0.6051    0.468567     0.44565     0.4377      0.514983     0.42035    0.380683     0.416167    0.414583


Figure 12: The percentage of each similarity between the ap-    Figure 14: The percentage of each similarity between the ap-
plied clustering technique and the packaging structure using    plied clustering technique and the packaging structure using
9 similarity factors.                                           9 similarity factors.


Figure 13: The percentage of each similarity between the ap-    Figure 15: The percentage of each similarity between the ap-
plied clustering technique and the packaging structure using    plied clustering technique and the packaging structure using
9 similarity factors.                                           9 similarity factors.


The second factor, after class names, that shows strong         packages. Method names, as the third strong factor, on
similarities among packages’ components is, on average,         average show relatively high similarity among the pack-
the super class name. This also indicates that most in-         ages’ components.
heritances are within the packages that is potentially an          Although we can confirm that there are a couple of pat-
indication for low cohesion and high decoupling between         terns common in all projects (similarity of class names),
still almost every project behaves differently. This can        [9] R. C. Martin, J. Grenning, S. Brown, Clean architec-
be further confirmed by looking at the results in Table 2           ture: a craftsman’s guide to software structure and
where each project shows a different aggregated degree              design, Prentice Hall, 2018.
of similarity packaging ranging from 0.36 to 0.81.             [10] M. Shaw, Architectural issues in software reuse:
   Looking from another angle, since class names score              It’s not just the functionality, it’s the packaging, in:
high in terms of similarity factors among the contents in           Proceedings of the 1995 Symposium on Software
a package, they can potentially be used to validate the rel-        reusability, 1995, pp. 3–6.
evancy within a package or other architectural construct.      [11] D. E. Knuth, An empirical study of fortran pro-
This claim, however, requires more experimentation on               grams, Software: Practice and experience 1 (1971)
a larger number of subject systems.                                 105–133.
   This research is only based on the artifact (class,         [12] E. Hautus, Improving java software through pack-
method and variables) names, therefore, the role of the             age structure analysis, in: IASTED International
developers’ naming style plays an important role in the             Conference Software Engineering and Applications,
results.                                                            2002, pp. 1–5.
   In future work, we plan to include other similarity         [13] V. Gupta, J. K. Chhabra, Package coupling mea-
factors such as factors that define the functionality of            surement in object-oriented software, Journal of
the artefacts. This, with a long term objective of using            computer science and technology 24 (2009) 273–
these ’naturally’ occurring similarities in the packaging           283.
of software in the ‘wild’ to build developer-friendly ar-      [14] P. J. Kaur, S. Kaushal, A. K. Sangaiah, F. Piccialli, A
chitecture conformance protocols which help prevent                 framework for assessing reusability using package
architectural erosion.                                              cohesion measure in aspect oriented systems, Inter-
                                                                    national Journal of Parallel Programming 46 (2018)
                                                                    543–564.
References                                                     [15] A. Prajapati, J. K. Chhabra, Madhs: Many-objective
                                                                    discrete harmony search to improve existing pack-
 [1] M.-A. Storey, C. Best, J. Michand, Shrimp views:
                                                                    age design, Computational Intelligence 35 (2019)
     An interactive environment for exploring java pro-
                                                                    98–123.
     grams, in: Proceedings 9th International Workshop
                                                               [16] H. Abdeen, S. Ducasse, H. Sahraoui, I. Alloui, Au-
     on Program Comprehension. IWPC 2001, IEEE,
                                                                    tomatic package coupling and cycle minimization,
     2001, pp. 111–112.
                                                                    in: 2009 16th Working Conference on Reverse En-
 [2] M. Shaw, R. DeLine, D. V. Klein, T. L. Ross, D. M.
                                                                    gineering, IEEE, 2009, pp. 103–112.
     Young, G. Zelesnik, Abstractions for software archi-
                                                               [17] J. Al Dallal, Identifying refactoring opportunities
     tecture and tools to support them, IEEE transactions
                                                                    in object-oriented code: A systematic literature re-
     on software engineering 21 (1995) 314–335.
                                                                    view, Information and software Technology 58
 [3] J. Veit, Modules, Components, and Elements – Soft-
                                                                    (2015) 231–249.
     ware Architecture Terms explained (2021). URL:
                                                               [18] G. Baxter, M. Frean, J. Noble, M. Rickerby, H. Smith,
     https://dev.to/jessica_veit/modules-componen
                                                                    M. Visser, H. Melton, E. Tempero, Understand-
     ts-and-elements-software-architecture-terms-ex
                                                                    ing the shape of java software, in: Proceedings
     plained-g59.
                                                                    of the 21st annual ACM SIGPLAN conference on
 [4] Tutisani, Modular Software Architecture - Tutisani
                                                                    Object-oriented programming systems, languages,
     Consulting, 2021. URL: https://www.tutisani.com/s
                                                                    and applications, 2006, pp. 397–412.
     oftware-architecture/modular-software-architec
                                                               [19] H. Abdeen, S. Ducasse, H. Sahraoui, Modulariza-
     ture.html.
                                                                    tion metrics: Assessing package organization in
 [5] J. T. Taylor, W. T. Taylor, Software architecture, in:
                                                                    legacy large object-oriented software, in: 2011 18th
     Patterns in the Machine, Springer, 2021, pp. 63–82.
                                                                    Working Conference on Reverse Engineering, IEEE,
 [6] S. A. Ebad, M. Ahmed, Investigating the effect of
                                                                    2011, pp. 394–398.
     software packaging on modular structure stabil-
                                                               [20] JavaParser.org, JavaParser - Home, 2021. URL: "htt
     ity, Computer Systems Science and Engineering 34
                                                                    ps://javaparser.org".
     (2019) 283–296.
                                                               [21] J. A. Hartigan, M. A. Wong, Ak-means clustering
 [7] M. Shaw, D. Garlan, Formulations and formalisms
                                                                    algorithm, Journal of the Royal Statistical Society:
     in software architecture, in: Computer Science
                                                                    Series C (Applied Statistics) 28 (1979) 100–108.
     Today, Springer, 1995, pp. 307–323.
                                                               [22] K. Wang, J. Zhang, D. Li, X. Zhang, T. Guo, Adap-
 [8] Vasiliy, 5 Most Popular Package Structures for Soft-
                                                                    tive affinity propagation clustering, arXiv preprint
     ware Projects, 2020. URL: https://www.techyourch
                                                                    arXiv:0805.1096 (2008).
     ance.com/popular-package-structures/.
                                                               [23] K. Khan, S. U. Rehman, K. Aziz, S. Fong, S. Saras-
     vady, Dbscan: Past, present and future, in: The fifth
     international conference on the applications of dig-
     ital information and web technologies (ICADIWT
     2014), IEEE, 2014, pp. 232–238.
[24] J. Liu, J. Han, Spectral clustering, in: Data Cluster-
     ing, Chapman and Hall/CRC, 2018, pp. 177–200.
[25] R. Koopman, S. Wang, Mutual information based
     labelling and comparing clusters, Scientometrics
     111 (2017) 1157–1167.
[26] A. F. McDaid, D. Greene, N. Hurley, Normal-
     ized mutual information to evaluate overlapping
     community finding algorithms, arXiv preprint
     arXiv:1110.2515 (2011).