Experiences in Replicating an Experiment on
Comparing Static and Dynamic Coupling Metrics
Richard Müller1 , Dirk Mahler2 and Christopher Klinkmüller3
1
  ipoque GmbH, Leipzig, Germany
2
  BUSCHMAIS GbR, Dresden, Germany
3
  CSIRO Data61, Sydney, Australia


                                         Abstract
                                         In software engineering, coupling metrics are used to assess the quality of a software system’s architecture,
                                         especially its maintainability and understandability. On an abstract level, two types of coupling metrics
                                         can be distinguished: static metrics are solely based on source and/or byte code, and dynamic metrics also
                                         take observed run-time behavior into account. While both metric types complement each other, a recent
                                         study by Schnoor and Hasselbring suggests that these two types are correlated. This observation indicates
                                         that to a certain degree both metric types encode redundant information. In this paper, we replicate
                                         the original experiment using the same data but a different tool set. Our successful replication hence
                                         substantiates the original findings. Moreover, the summary of our experience provides valuable insights
                                         to researchers who want to ensure reproducibility and replicability of their experiments. Following open
                                         science principles, we publish all data and scripts online.

                                         Keywords
                                         Software Metrics, Monitoring, Dynamic Analysis, Static Analysis, Open Science, Replication


1. Introduction
In software engineering, a common design principle for improving the quality of a system
architecture, especially its maintainability and understandability, is to ensure a high cohesion
within modules and a low coupling between modules [1, 2]. Coupling metrics serve as indicators
for the degree to which this design principle is adhered to and thus for the quality of the
architecture. A fundamental metric is the coupling degree of a module which measures the
number of connections between a module and other system modules [3], for example, between
classes or packages. It can be restricted to certain kinds of connections, such as method
calls, types of member variables, or types of thrown exceptions. Depending on whether the
connections were derived via static analysis of source and/or byte code, or via dynamic analysis
of monitoring logs, coupling degrees (or metrics) are referred to as static or dynamic, respectively.
   Schnoor and Hasselbring [3] recently investigated the extent to which the information
provided by dynamic coupling metrics complements the information captured by static metrics,
and vice versa. To this end, they empirically analyzed coupling orders of modules for a specific

SSP’21: Symposium on Software Performance, November 09–10, 2021, Leipzig, Germany
" richard.mueller@rohde-schwarz.com (R. Müller); dirk.mahler@buschmais.com (D. Mahler);
christopher.klinkmueller@data61.csiro.au (C. Klinkmüller)
 0000-0001-6730-4082 (R. Müller); 0000-0002-5926-2238 (C. Klinkmüller)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
software system. That is, they compared rankings of system’s modules obtained by ordering
the modules with respect to a variety of static and dynamic coupling degrees. The observation
that rankings based on static coupling degrees are not statistically independent from rankings
based on dynamic coupling degrees led Schnoor and Hasselbring to the conclusion that the
information provided by both types of coupling degrees is related.
   In this paper, we replicate the experiment conducted by Schnoor and Hasselbring [3] to verify
the previously observed findings. While we rely on the input data provided by the authors and
follow their procedure for computing and comparing the coupling degrees, we apply a different
tool set to process and analyze the data. This change implies that we go beyond a reproduction
of the original experiment1 . Additionally, we report our experiences from the replication study,
explicating challenges that we faced and that are of interest to researchers wanting to ensure
reproducibility and replicability of their experiments.
   In the remainder of the paper, we summarize the original experiment in Section 2. We then
outline our replication setup and results in Section 3 and discuss our experiences in Section 4.
Finally, Section 5 concludes the paper. Following open science principles [4], all scripts used in
this replication study are publicly available and can be executed online2 .


2. Original Experiment
Schnoor and Hasselbring [3] empirically examined the extent to which different coupling
degrees provide similar information. For this purpose, they investigated Atlassian Jira3 , a
commercial issue tracking system, in a series of four experiments. Each experiment spanned
a period of four weeks in which the system was actively used by students participating in a
mandatory programming course and in which data reflecting its runtime behavior was collected.
Note that the experiments relied on different versions of Jira, as Schnoor and Hasselbring
updated their Jira installation over time. However, while they used different versions across the
series of experiments, within each experiment the same version was applied. As a consequence,
the system’s architecture and runtime behavior varied, albeit slightly, across all experiments.
Table 1 briefly summarizes the main characteristics of the experiments.
   For each experiment, the same analytical procedure was followed. First, the static and
dynamic dependency graphs based on method calls were created for each experiment. The
static dependency graph was derived from a byte code scan of the corresponding Jira version
using BCEL4 . By contrast, the dynamic dependency graph was derived from monitoring data
obtained at runtime during the experiment run with the monitoring tool Kieker [5]. For each
experiment the monitoring data is provided as a publicly available dataset5,6,7,8 . Each dataset

    1
      We follow the terminology defined by the Association for Computing Machinery in https://www.acm.org/
publications/policies/artifact-review-and-badging-current.
    2
      https://github.com/softvis-research/coupling-metrics-replication
    3
      https://www.atlassian.com/de/software/jira
    4
      https://commons.apache.org/proper/commons-bcel/
    5
      https://doi.org/10.5281/zenodo.3648094 (experiment 1)
    6
      https://doi.org/10.5281/zenodo.3648228 (experiment 2)
    7
      https://doi.org/10.5281/zenodo.3648240 (experiment 3)
    8
      https://doi.org/10.5281/zenodo.3648269 (experiment 4)
Table 1
Numbers of users, monitored calls, and used Jira versions of the four experiments [3].
                           #           Date         Users   Method Calls    Jira version
                           1        February 2017    19      196,442,044       7.3.0
                           2       September 2017    48      854,657,027       7.4.3
                           3        February 2018    16      475,357,185       7.7.1
                           4       September 2018    58     2,409,688,701      7.7.1


yields compressed binary files (bin.xz) and a kieker file (.map) that contains the encrypted fully
qualified class names.
   Next, the coupling degrees were derived from the dependency graphs. The nodes of this
graph are program modules that are either classes or packages. The edges are aggregated
call relationships between the modules which can have weights representing the number of
calls. Depending on whether the coupling metrics take the weights into account, they are
referred to as weighted or unweighted. Please note that weighted metrics are omitted in the
static dependency graph as byte code is used and compiler optimizations may produce different
results. Furthermore, the direction of the calls is distinguished into outgoing and incoming calls
referred to as import and export coupling degree of a program module. The sum of import and
export coupling is referred to as combined coupling.
   Finally, the differences of the coupling metrics were studied by comparing the ranking
obtained by ordering the program modules by their coupling degree using the Kendall-Tau
distance [6]. Values smaller than 0.5 indicate that the orders are closer together than expected
from two random orders. Distance values larger than 0.5 indicate the opposite. Considering
the above-mentioned measurements this results in 18 comparisons. The results were presented
using the following triple 𝛼 : 𝛽1 ↔ 𝛽2, where

    • 𝛼 is 𝑐 or 𝑝 expressing class or package coupling,
    • 𝛽1 is 𝑠 or 𝑢 expressing whether the left-hand side analysis is static or (dynamic) unweighted,
    • 𝛽2 is 𝑢 or 𝑤 expressing whether the right-hand side analysis is (dynamic) unweighted or
      (dynamic) weighted.

  Schnoor and Hasselbring observed that the distance values for all pairs of compared rankings
were smaller than 0.5. Table 2 shows the obtained distance values for all four experiments.
These observations led the authors to the conclusion that coupling metrics obtained from static
and dynamic analysis encode similar information.


3. Replication
The major steps of the replication process with corresponding inputs and outputs are summarized
in Figure 1. First, we used jQAssistant9 and its scanner plugins for Kieker [7, 8] and for Java
byte code [9] to process the Kieker traces provided by Schnoor and Hasselbring as well as

    9
        https://jqassistant.org/
Table 2
Coupling analysis results of all four original experiments [3] including the differences (+/-) of the
replication study.

                                                  (a) Experiment 1.
                         𝑐:𝑠↔𝑢        𝑐:𝑠↔𝑤          𝑐:𝑢↔𝑤            𝑝:𝑠↔𝑢   𝑝:𝑠↔𝑤       𝑝:𝑢↔𝑤
         import          0.31+0.01    0.36+0.01       0.13-0.01        0.33   0.36-0.01      0.08
         export             0.41      0.41-0.01          0.24          0.30   0.32-0.01   0.21-0.01
         combined           0.35      0.41-0.01       0.29-0.01        0.29   0.33-0.01   0.23-0.01
         average               0.35     0.39          0.22-0.01        0.31     0.33        0.17

                                                  (b) Experiment 2.
                         𝑐:𝑠↔𝑢        𝑐:𝑠↔𝑤          𝑐:𝑢↔𝑤            𝑝:𝑠↔𝑢   𝑝:𝑠↔𝑤       𝑝:𝑢↔𝑤
         import          0.30+0.02    0.36+0.01          0.14          0.31      0.35     0.09-0.01
         export             0.41      0.43-0.01       0.26-0.01        0.30      0.33     0.22-0.01
         combined        0.34+0.01    0.41-0.01       0.31-0.01        0.28   0.33-0.01      0.23
         average         0.35+0.01      0.40          0.24-0.01        0.30     0.33        0.18

                                                  (c) Experiment 3.
                         𝑐:𝑠↔𝑢        𝑐:𝑠↔𝑤          𝑐:𝑢↔𝑤            𝑝:𝑠↔𝑢   𝑝:𝑠↔𝑤       𝑝:𝑢↔𝑤
         import                0.38   0.42+0.01         0.12           0.37     0.39         0.06
         export                0.38      0.40           0.22           0.28     0.31         0.20
         combined              0.36   0.40+0.01         0.28           0.30     0.33      0.23+0.01
         average               0.37     0.41            0.21           0.32     0.35        0.17

                                                  (d) Experiment 4.
                         𝑐:𝑠↔𝑢        𝑐:𝑠↔𝑤          𝑐:𝑢↔𝑤            𝑝:𝑠↔𝑢   𝑝:𝑠↔𝑤       𝑝:𝑢↔𝑤
         import                0.37      0.42           0.12           0.36     0.39        0.06
         export                0.38      0.40           0.23           0.28     0.32        0.20
         combined              0.35   0.40+0.01         0.29           0.30     0.33        0.24
         average               0.37     0.41            0.21           0.31     0.35        0.17


Jira’s Java byte code. This resulted in static and dynamic dependency graphs that contained
call relationships at method-level. We stored these graphs in a Neo4j database10 . Second, we
aggregated the method calls at class- and package-level using custom Neo4j Cypher queries
specified as jQAssistant concepts [9]. Third, we queried the graphs using Cypher to calculate
the import, export, and combined coupling degrees for each module. Finally, we compared the
coupling orders of the modules using the Kendall-Tau distance.
   We implemented steps 1 and 2 as batch scripts and steps 3 and 4 in a Jupyter notebook11 . The
complete replication package including the batch scripts, the Jupyter notebook, and instructions
   10
        https://neo4j.com/
   11
        https://jupyter.org/
               Kieker traces
             (*.bin.xz, *.map),
              Java byte code
               (*.class, *.jar)


                  1. Scan
                                   2. Aggregate        3. Query            4. Compare
  Start      Kieker traces and                                                              End
                                       calls        coupling metrics     coupling orders
              Java byte code


                                   Dynamic/static                         Kendall-Tau
              Dynamic/static                         Coupling metrics
                                      call graph                            distance
                call graph                           (import, export,
                                  (class/package-                       (import, export,
              (method-level)                           combined)
                                        level)                             combined)


Figure 1: Flowchart of the replication process.


are provided in a GitHub repository and can be executed online, see Footnote 2.
   The results of the replication study are also shown in Table 2. The slight differences between
the originally reported and the replicated values are probably due to the following two circum-
stances. First, for unknown reasons a few classes that we discovered in the static dependency
analysis were not included in the static dependency graphs in the original experiment. However,
in the original experiment they were included in dynamic dependency graphs. Second, the
kieker files of experiment 1 and 2 contain an empty key-value pair in the the second line ($1=).
Here, a class might be missing in the published datasets that was contained in the original
experiments.


4. Discussion
Using our setup we were able to replicate the Kendall-Tau distances from the original experiment
and thus to substantiate the findings by Schnoor and Hasselbring. However, we faced some
challenges that exacerbated the replication. As we could eventually resolve all of them, not least
due to the help of the authors, we share our experiences in the following, hoping to contribute
to a better understanding of the level of detail with which experiments must be reported and
documented, in order to warrant their successful reproduction or replication.
   We encountered two challenges regarding the data that was available for the replication.
First, in addition to different Jira versions, Atlassian also distinguishes between different Jira
variants, for example, Jira Core and Jira Software. As information regarding the latter was
not specified in [3], it was unclear which variant served as the basis for the static analysis.
Second, the names for packages, classes, and methods were pseudonymized in the publicly
available monitoring data. In more detail, the fully qualified Java names were replaced with
versions in which each of a name’s components was replaced with its hash. For example, a
fully qualified class name like package.subpackage.class$innerclass would be represented as
hash(package).hash(subpackage).hash(class$innerclass). On the one hand, we were not able
distinguish between class and inner class calls. On the other hand, the encryption also affected
the creation of module rankings, because the clear names were used as a secondary sorting
criterion in the original experiment. We thus had to decrypt the pseudonymized names in the
monitoring data and replaced them with the clear names from the Java byte code.
   Furthermore, a few challenges occurred during the analysis of the dependency graphs or
module rankings, respectively. First, we tested different Python implementations of the Kendall-
Tau distance. All of them produced slightly different results, none of which were equal to the
results from the original experiment. In the end, we implemented our own calculation for the
Kendall-Tau distance. Second, not all call relationships between classes and packages were
actually considered by Schnoor and Hasselbring. Here, we found that the selection criteria
specified in [3] did not yield the desired results. Yet, with the help of the authors, we were able
to determine that lambda method calls indeed needed to be excluded, but constructor calls had
to be included.
   Lastly, we identified a mistake in the documentation of the results of the original experiment.
In particular, Table 10 in the original paper shows the average export coupling degrees in the
four experiments. In case of the static dependency graphs these values are calculated using
the static weighted dependency graph. However, with the static analysis only the unweighted
coupling degrees were used in the original experiment. As we took the reported averages in
Table 10 to confirm the correctness of our replication setup, we first suspected a problem in our
setup, leading to significant verification efforts.
   In summary, we encountered a variety of challenges related to the available data, the applied
method, and the documented results. We would like to emphasize that none of the identified
challenges led to significant differences between the original and our replicated results. However,
the investigation and rectification of these issues required time and effort, especially considering
that their effects influenced each other and that fixing one issue often revealed a new one.


5. Conclusion
We successfully replicated the experiment from Schnoor and Hasselbring [3] which empiri-
cally investigated differences between static and dynamic coupling metrics. Hence, we could
substantiate the authors’ finding that static and dynamic coupling degrees are not statistically
independent. Furthermore, we discussed our experiences and, following open science principles,
made all scripts of the replication study available in a reproduction package that can be executed
online, see Footnote 2.
  In general, researchers who strive to ensure reproducibility or replicability should check
whether the empirical data, the applied methods, and the obtained results are sufficiently well
described in the paper and/or documented elsewhere. In this regard, we greatly benefited from
the help of the authors and hence encourage researchers who plan a replication study to contact
the authors of the original study as early as possible. A specific recommendation arising from
our replication study is that researchers who pseudonomize or encrypt (parts of the) data should
review and explicate how this affects the reproduction or replication of their results.


Acknowledgments
We would like to thank the authors of the original experiment, Henning Schnoor and Wilhelm
Hasselbring, for publishing their data and for their great support with solving the challenges of
this replication study.
References
[1] I. Candela, G. Bavota, B. Russo, R. Oliveto, Using Cohesion and Coupling for Software
    Remodularization: Is It Enough?, ACM Trans. Softw. Eng. Methodol. 25 (2016). doi:10.
    1145/2928268.
[2] J. Bogner, S. Wagner, A. Zimmermann, Automatically Measuring the Maintainability of
    Service- and Microservice-Based Systems: A Literature Review, in: Proceedings of the 27th
    International Workshop on Software Measurement and 12th International Conference on
    Software Process and Product Measurement, IWSM Mensura ’17, Association for Computing
    Machinery, New York, NY, USA, 2017, pp. 107–115. doi:10.1145/3143434.3143443.
[3] H. Schnoor, W. Hasselbring, Comparing Static and Dynamic Weighted Software Coupling
    Metrics, Computers 9 (2020) 24. doi:10.3390/computers9020024.
[4] D. Mendez, D. Graziotin, S. Wagner, H. Seibold, Open Science in Software Engineering,
    in: M. Felderer, G. H. Travassos (Eds.), Contemporary Empirical Methods in Software
    Engineering, Springer International Publishing, Cham, 2020, pp. 477–501. doi:10.1007/
    978-3-030-32489-6_17.
[5] W. Hasselbring, A. van Hoorn, Kieker: A monitoring framework for software engineering
    research, Software Impacts 5 (2020) 1–5. doi:10.1016/j.simpa.2020.100019.
[6] L. Briand, K. E. Emam, S. Morasca, On the application of measurement theory in software
    engineering, Empirical Software Engineering 1 (1996) 61–88. doi:10.1007/BF00125812.
[7] R. Müller, M. Fischer, Graph-Based Analysis and Visualization of Software Traces, in: 10th
    Symposium on Software Performance, Würzburg, Germany, 2019.
[8] R. Müller, T. Strempel, Graph-Based Performance Analysis at System- and Application-Level,
    in: 11th Symposium on Software Performance, Leipzig, Germany, 2020.
[9] R. Müller, D. Mahler, M. Hunger, J. Nerche, M. Harrer, Towards an Open Source Stack to
    Create a Unified Data Source for Software Analysis and Visualization, in: 6th IEEE Working
    Conference on Software Visualization, IEEE, Madrid, Spain, 2018. doi:10.1109/VISSOFT.
    2018.00019.