=Paper= {{Paper |id=Vol-2605/12 |storemode=property |title=Towards Meaningful Software Metrics Aggregation |pdfUrl=https://ceur-ws.org/Vol-2605/12.pdf |volume=Vol-2605 |authors=Maria Ulan,Welf Löwe,Morgan Ericsson,Anna Wingkvist |dblpUrl=https://dblp.org/rec/conf/benevol/UlanLEW19 }} ==Towards Meaningful Software Metrics Aggregation== https://ceur-ws.org/Vol-2605/12.pdf
       Towards Meaningful Software Metrics Aggregation

                     Maria Ulan, Welf Löwe, Morgan Ericsson, Anna Wingkvist
                      Department of Computer Science and Media Technology
                                 Linnaeus University, Växjö, Sweden
                  {maria.ulan | welf.lowe | morgan.ericsson | anna.wingkvist}@lnu.se



                                                                       The problem of metrics aggregation is addressed by
                                                                    the research community. Metrics are often defined at
                       Abstract                                     a method or class level, but quality assessment some-
                                                                    times requires insights at the system level. One bad
    Aggregation of software metrics is a challeng-                  metric value can be evened out by other good met-
    ing task, it is even more complex when it                       ric values when summing them up or computing their
    comes to considering weights to indicate the                    mean [1]. Some effort has been directed into met-
    relative importance of software metrics. These                  rics aggregation based on inequality indices [2, 3], and
    weights are mostly determined manually, it                      based on thresholds [4–8] to map source code level mea-
    results in subjective quality models, which                     surements to software system rating.
    are hard to interpret. To address this chal-                       In this research, we do not consider aggregation
    lenge, we propose an automated aggregation                      along the structure of software artifacts, e.g., from
    approach based on the joint distribution of                     classes to the system. We focus on another type
    software metrics. To evaluate the effective-                    of metrics aggregation, from low-level to higher-level
    ness of our approach, we conduct an empir-                      quality properties, Mordal-Manet et al. call such type
    ical study on maintainability assessment for                    of aggregation metrics composition [9].
    around 5 000 classes from open source soft-                        Different software quality models that use weighted
    ware systems written in Java and compare                        metrics aggregation have been proposed, such
    our approach with a classical weighted lin-                     as QMOOD [10], QUAMOCO [11], SIG [12],
    ear combination approach in the context of                      SQALE [13], and SQUALE [14]. The weights in these
    maintainability scoring and anomaly detec-                      models are defined based on experts’ opinions or sur-
    tion. The results show that approaches assign                   veys. It is questionable whether manual weighting and
    similar scores, while our approach is more in-                  combination of the values with an arbitrary (not nec-
    terpretable, sensitive, and actionable.                         essarily linear) function are acceptable operations for
                                                                    metrics of different scales and distributions.
  Index terms— Software metrics, Aggregation,                          As a countermeasure, we propose to use a proba-
Weights, Copula                                                     bilistic approach for metrics aggregation. In previous
                                                                    research, we considered software metrics to be equally
1    Introduction                                                   important and developed a software metrics visualiza-
                                                                    tion tool. This tool allowed the user to define and ma-
Quality models provide a basic understanding of                     nipulate quality models to reason about where qual-
what data to collect and what software metrics to                   ity problems were located, to detect patterns, correla-
use. However, they do not provide how software                      tions, and anomalies [15].
(sub-)characteristics should be quantified, and metrics                Here, we define metrics scores by probability as
should be aggregated.                                               complementary Cumulative Distribution Function and
                                                                    link them with joint probability by the so-called cop-
Copyright © by the paper’s authors. Use permitted under Cre-        ula function. We determine weights from the joint dis-
ative Commons License Attribution 4.0 International (CC BY
4.0).
                                                                    tribution and aggregate software metrics by weighted
In: D. Di Nucci, C. De Roover (eds.): Proceedings of the 18th
                                                                    product of the scores. We formalize quality models
Belgium-Netherlands Software Evolution Workshop, Brussels,          to expresses quality as the probability of observing a
Belgium, 28-11-2019, published at http://ceur-ws.org                software artifact with equal or better quality. This




                                                                1
approach is objective since it relies solely on data. It                2. If all scores of one software artifact are greater or
allows to modify quality models on the fly, and it cre-                    equal than all scores of another software artifact,
ates a realistic scale since the distribution represents                   the same should be true for the overall scores.
quality scores for a set of software artifacts.
                                                                                              si1 ≥ sl1 ∧ · · · ∧ sin ≥ sln ⇒
2     Approach Overview                                                                 F (si1 , . . . , sin ) ≥ F (sl1 , . . . , sln ),
We consider a joint distribution of software metrics                            where sij = sj (ej (ai )), slj = sj (ej (al ))             (4)
values, and for each software artifact, we assign a
probabilistic score. W.l.o.g, we assume that all soft-                  3. If the software artifact perfectly meets all but one
ware metrics are defined such that larger values indi-                     metric, the overall score is equal to that metrics
cate lower quality. The joint distribution of software                     score.
metrics provides the means of objective comparison                                     F (1, . . . , 1, sj , 1, . . . , 1) = sj (5)
of software artifacts in terms of their quality scores,                   We propose to express the degree of satisfaction
which represent the relative rank of the software arti-                with respect to a metric using probability. We define
fact within the set of all software artifacts observed so              the score function of Equation (1) as follows:
far, i.e., how good or bad a quality score compare to
other quality scores.
   Let A = {a1 , . . . , ak } be a set of k software artifacts,              sj (ej (a)) = P r(Ej > ej (a)) = CCDF ej (a)                  (6)
and M = {m1 , . . . , mn } be a set of n software metrics.
Each software artifact is assessed by metrics from M ,                    We calculate the Complementary Cumulative Dis-
and the result of this assessment is represented as k×n                tribution Function (CCDF). This score represents the
performance matrix of metrics values.                                  probability of finding another software artifact with an
   We denote by ej (ai ) for ∀i ∈ {1, k}, ∀j ∈ {1, n}                  evaluation value greater than the given value. For a
an (i, j)-entry, which shows the degree of performance                 multi-criteria case, we can specify a joint distribution
for an software artifact ai measured for metric mj . We                in terms of n marginal distributions and a so-called
denote by Ej = [ej (a1 ), . . . , ej (ak )]T ∈ Ejk the j-th col-       copula function [16]:
umn of performance matrix, which represents metrics
values for all software artifacts with respect to metric
                                                                             Cop(CCDF e1 (a), . . . , CCDF en (a)) =
mj where Ej is the domain of these values.
   For each software artifact ai ∈ A and metric mj ∈                               P r(E1 > e1 (a), . . . , En > en (a))                   (7)
M , we define a score sj (ai ), which indicates the degree                 The copula representation of a joint probability dis-
to which this software artifact meets the requirements                 tribution allows us to model both marginal distribu-
for the metric. Formally, for each metric mj we define                 tions and dependencies. The copula function Cop sat-
a score function sj :                                                  isfies the signature (2) and fulfills the required proper-
                                                                       ties (3), (4), and (5).
                   ej (a)    : A 7→ Ej                                     We consider a weight vector, where each wi repre-
                                                                       sents the relative importance of metric mi compared
                   sj (e)    : Ej 7→ [0, 1]                 (1)        to the others:
   Based on the score functions sj for each metric, our                                                                 n
                                                                                                                        X
goal is to define an overall score function such that, for                     w = [w1 , . . . , wn ]T , where                wi = 1       (8)
any software artifact, it indicates the degree to which                                                                 i=1

this software artifact satisfies all metrics. Formally, we                We compute weights using a non-linear exponen-
are looking for a function:                                            tial regression model for a sample of software artifacts
                                                                       mapping metrics scores of Equation(6) to copula value
              F (s1 , . . . , sn ) : [0, 1]n 7→ [0, 1]      (2)        of Equation(7). Note that these weights regard depen-
   Such an aggregation function takes an n-tuple of                    dencies between software metrics. Finally, we define
metrics scores and returns a single overall score. We                  software metrics aggregation as a weighted composi-
require the following properties:                                      tion of metrics score functions:
                                                                                                                  n
                                                                                                                     w
                                                                                                                  Y
    1. If a software artifact does not meet the require-                                F (s1 , . . . , sn ) =      sj j                   (9)
       ments for one of the metrics, the overall score                                                            j=1
       should be close to zero.
                                                                          We consider a software artifact al to be better than
                  F (s1 , . . . , sn ) → 0 as sj → 0        (3)        or equally good as another software artifact ai , if the




                                                                   2
 total score according to Equation (2) of al is greater                CBO, Coupling Between Objects
 than or equal the total score of ai :
                                                                       DIT, Depth of Inheritance Tree
                 al  ai ⇔ F (al ) ≥ F (ai )         (10)
                                                                       LCOM, Lack of Cohesion in Methods
 Aggregation is defined as a composition of the product,
 exponential, and CCDF functions, which are mono-                      NOC, Number Of Children
 tonic functions. Hence, the score which is obtained by
 aggregation allows to rank set A of software artifacts                RFC, Response For a Class
 with respect to metrics set M :                                       WMC, Weighted Method Count
                                                                       (using Cyclomatic Complexity as method weight)
       Rank (al ) ≤ Rank (ai ) ⇔ F (al ) ≥ F (ai )   (11)
                                                                 3.2    Data Set Description
    From a practical point of view, probabilities can be
                                                                 We chose to investigate three open-source software
 calculated empirically, and each score can be obtained
                                                                 systems. The systems were chosen by such criteria:
 as a ratio of the number of software artifacts with lower
                                                                 (i) they are written in Java, (ii) available in GitHub,
 than a given metric value to the number |A| of software
                                                                 (iii) they were forked at least once, (iv) they are suf-
 artifacts.
                                                                 ficiently large (several tens of thousands of lines of
    The proposed aggregation approach makes it possi-
                                                                 code and several hundreds of classes), and (v) they
 ble to express the score for a software artifact as the
                                                                 have been under active development for several years.
 probability to observe something with equal or worse
                                                                 The projects we selected are three well-known and
 metrics values, based on all software artifacts observed.
                                                                 frequently used systems: JabRef 1 , JUnit 2 , and Rx-
 Once the quality scores are computed, the software ar-
                                                                 Java.3 Table 1 shows descriptive statistics for these
 tifacts can trivially be ranked by the score by simply
                                                                 systems.
 ordering the values from smallest to largest. We as-
 sign the same rank for software artifacts in case their
                                                                 Table 1: Descriptive statistics of investigated systems
 total scores are equal. Low (high) ranks correspond
 to high (low) probabilities. This interpretation is the                                       JabRef      JUnit     RxJava
 same on all levels of aggregation, from metrics scores          Number of classes (NOC)          1 532     1 119       2 744
 to the total quality scores.                                    Lines of code (LOC)            136 039    44 082     378 987
                                                                 Version                          4.3.1     5.3.2       3.0.0
 3     Preliminary Evaluation
 We apply our approach to assess Maintainability and             3.3    Measures
 compare the results with the aggregation approach
 based on a weighted linear combination of software              The result of the aggregation is a maintainability score,
 metrics. We measure the difference between rankings             and a ranked list of software artifacts according to
 obtained by these approaches and study the agreement            their maintainability score. To evaluate our approach,
 between aggregated scores. Finally, we compare ap-              we compare it to a well-known approach considering
 proaches by means of sensitivity, and the ability to            the following measures:
 detect extreme values and Pareto optimal solutions.                Correlation We study the Spearman’s correla-
    In the following subsections, we investigate Java            tion [18] between maintainability scores to assess the
 classes and their quality assessment using two research         ordering, relative spacing, and possible functional de-
 questions:                                                      pendency.
                                                                    Ranking distance We measure a distance between
RQ1 How effective is our approach for a quality assess-          the two rankings based on the Kendall tau distance,
    ment?                                                        which counts the number of pairwise disagreements
                                                                 between two lists [19].
RQ2 How actionable is our approach by means of sen-
    sitivity and anomaly detection?                                 1 JabRef, Graphical Java application for managing BibTeX

                                                                 and biblatex databases, https://github.com/JabRef/jabref
                                                                    2 JUnit, A framework to write repeatable tests for the Java
 3.1   Quality Model Description
                                                                 programming language,       https://github.com/junit-team/
 We consider a quality model for maintainability as-             junit5
                                                                    3 RxJava, Reactive Extensions for the JVM – a library
 sessment of classes, which relies on well-known soft-           for composing asynchronous and event-based programs using
 ware metrics from Chidamber & Kemerer [17] software             observable sequences for the Java VM, https://github.com/
 metrics suit:                                                   ReactiveX/RxJava




                                                             3
   Agreement We measure agreement between main-
                                                                         Table 2: Agreement between approaches
tainability scores using Bland-Altman statistics [20].
                                                                           Correlation (Spearman)       Distance (Kendall)
   To evaluate if the aggregated scores can be used to
detect extreme values and Pareto optimal solutions,               JabRef                      0.93397               0.04829
we consider the following measures:                               jUnit                       0.98899               0.02483
                                                                  RxJava                      0.96978               0.03083
   Sensitivity We study a variety of values to under-             Merged                      0.98953               0.03382
stand a percentage of software artifacts that have the
same maintainability score. The overall sensitivity is
the ratio of unique scores and the number of software
artifacts.
   Anomaly detection We compare approaches in
terms of their ability to detect anomalies (extreme val-
ues and Pareto optimal solutions) using a ratio of the
number of detected anomalies and the total number of
anomalies in a sample data set.

3.4   Preliminary Results and Analysis
We implemented all algorithms and statistical analyses
in R 4 . The metrics data for analysis was collected with
VizzMaintenance.5 We collected the metrics values for                    Figure 1: Bland-Altman plot for JabRef
classes of JabRef, JUnit, and RxJava software systems
(5 317 classes in total). We considered their packages                Third, we study an agreement, in the Bland-Altman
structure to group classes and applied Kolmogorov-                plot each class is represented by a point with the av-
Smirnov statistical test [21] to select a subset for fur-         erage of the maintainability scores obtained by two
ther statistical analysis, which was composed of 5 101            approaches as the x-value and the difference between
classes. Moreover, we consider the quality assessment             these two scores as the y-value. The blue line repre-
of each system separately to study potential differ-              sents the mean difference between scores and the red
ences between software systems. We apply our ag-                  lines the 95% confidence interval (mean±1.96SD). We
gregation approach (See Equation (12)) and compare                can observe that plots for JabRef and RxJava have a
the results with a weighted linear sum of metrics (see            similar shape (cf. Figure 1, Figure 3) compare to jU-
Equation(13)), which we normalized by the min-max                 nit (cf. Figure 2). We can observe a similar shape for
transformation.                                                   merged data set (cf. Figure 4), since in total JabRef
                                                                  and RxJava have almost four times more classes than
                                                                  jUnit. We can observe that in all plots measurements
 sw      w2     w3      w4     w5     w6                          are mostly concentrated near the blue line and only a
  CBO × sDIT × sLCOM × sNOC × sRFC × sWMC (12)
   1

                                                                  few of them are outside of the red lines. The difference
                                                                  for jUnit is slightly smaller than for JabRef and Rx-
                                                                  Java. In sum, we conclude that the approaches agree,
      w1 × CBO + w2 × DIT + w3 × LCOM +                           i.e., aggregation results do not differ statistically, and
          w4 × NOC + w5 × RFC + w6 × WMC              (13)        may be used interchangeably for the ranking of soft-
                                                                  ware classes.
RQ1-effectiveness
                                                                  RQ2-actionability
We compare approaches within a single software sys-
tem and the merged data set. First, we study a cor-               First, we study the variety of values for each metric
relation between aggregation results. Second, we rank             and number of extreme values, which we define by
software classes based on maintainability scores ob-              means of outliers. We detected 19 extreme values in
tained by two approaches. Table 2 shows Kendall Tau               total. In Table 3 we can observe that metrics have
distance and Spearman’s rho correlation. We observe a             quite low sensitivity, for each metric 40 values on av-
strong correlation between maintainability scores and             erage are unique.
low distance between rankings.                                       We consider a multi-objective optimization problem
  4 The
                                                                  based on metrics, and we detect five possible Pareto
         R Project for Statistical Computing, https://www.
r-project.org
                                                                  optimal solutions, i.e., none of the metrics values can
   5 VizzMaintenance, Eclipse plug-in, http://www.arisa.se/       be improved without degrading some of the other met-
products.php                                                      rics values. Second, we study the sensitivity and abil-




                                                              4
       Figure 2: Bland-Altman plot for jUnit                      Figure 4: Bland-Altman plot for Merged data


                                                              Table 4: Comparison of approaches by actionability
                                                                                         Aggregation (Eq.12)    Aggregation (Eq.13)
                                                              Sensitivity                             0.41317                0.31656
                                                              Extreme values                          0.94736                0.63158
                                                              Pareto optimal solutions                      1                    0.6


                                                              tribution, which we consider as a ground truth. This
                                                              might be a threat to internal validity. We compare our
                                                              approach with a weighted linear combination of met-
                                                              rics, it might be a treat as well since we do not compare
                                                              it with other approaches. In this preliminary evalua-
                                                              tion, we consider six metrics, three software systems
      Figure 3: Bland-Altman plot for RxJava                  written in Java, and focus on maintainability. This
                                                              might be a threat to external validity.
ity to detect anomalies (extreme values and Pareto
optimal solutions) for both approaches. In Table 4 we
can observe that our approach is more sensitive and           4     Conclusion and Future Work
more suitable for anomaly detection.
                                                              In conclusion, we defined an automated aggregation
                                                              approach for software quality assessment. We defined
          Table 3: Metrics variety of values                  probabilistic scores based on software metrics distribu-
         Number of Extreme Values         Sensitivity         tions and aggregate them using the weighted product,
CBO                                   3        0.00768        we obtained the weights from joint distribution. To
DIT                                   7        0.00109        evaluate the effectiveness and actionability of our ap-
LCOM                                  3        0.05746        proach, we conducted an empirical study for maintain-
NOC                                   2        0.00365        ability assessment. We collected CBO, DIT, LCOM,
RFC                                   2        0.01848        NOC, RFC, and WMC metrics from Chidamber & Ke-
WMC                                   2        0.02086        merer metrics suit for classes of JabRef, JUnit, and
                                                              RxJava software systems, and compared our approach
                                                              with a weighted linear combination of metrics. The
3.5   Discussion
                                                              results showed that the approaches agree and can be
We define metric scores by means of probability, as           used interchangeably for ranking software artifacts.
it provides a simple interpretation for a quality score       However, our approach is more effective and action-
by means of the joint distribution. In contrast, qual-        able, i.e., it has clear interpretation, higher sensitivity,
ity scores obtained by a weighted linear combination          and is better at detecting extreme values and Pareto
of metrics do not provide clear interpretation, espe-         optimal solutions.
cially when metrics are incomparable. We assume that             Our approach is mathematically well-defined since
larger metrics values indicate worse quality, however         generalization is not questionable, and can be theo-
both too small and too large values can be problematic        retically validated. For example, we can conduct sim-
for some of the software metrics. Note that it is not         ulation experiments to study the deviation between
a limitation since we could transform metrics to have         our and other approaches depending on the number of
this property. We extracted weights from joint dis-           classes, number of metrics, levels of aggregation, etc.




                                                          5
However, there is still a need for empirical validation         [8] Kazuhiro Yamashita, Changyun Huang, Meiyap-
of our approach. In the future, we plan to evaluate                 pan Nagappan, Yasutaka Kamei, Audris Mockus,
our approach on other data sets, such as The GitHub                 Ahmed E Hassan, and Naoyasu Ubayashi.
Java corpus, which contains around 15 000 software                  Thresholds for size and complexity metrics: A
systems [22]. We also plan to compare our approach                  case study from the perspective of defect density.
with Bakota et al. probabilistic approach [23].                     In 2016 IEEE international conference on soft-
                                                                    ware quality, reliability and security (QRS), pages
Acknowledgments                                                     191–201. IEEE, 2016.
We thank the anonymous reviewers whose comments                 [9] Karine Mordal, Nicolas Anquetil, Jannik Laval,
and suggestions helped us improve and clarify the re-               Alexander Serebrenik, Bogdan Vasilescu, and
search onto paper.                                                  Stéphane Ducasse. Software quality metrics ag-
                                                                    gregation in industry. Journal of Software: Evo-
References                                                          lution and Process, 25(10):1117–1135, 2013.
 [1] Bogdan Vasilescu, Alexander Serebrenik, and               [10] Jagdish Bansiya and Carl G Davis. A hierarchi-
     Mark Van den Brand. By no means: A study                       cal model for object-oriented design quality as-
     on aggregating software metrics. In Proceedings                sessment. IEEE Transactions on software engi-
     of the 2nd International Workshop on Emerging                  neering, 28(1):4–17, 2002.
     Trends in Software Metrics, pages 23–26. ACM,
     2011.                                                     [11] Stefan Wagner, Andreas Goeb, Lars Heinemann,
                                                                    Michael Kläs, Constanza Lampasona, Klaus
 [2] Rajesh Vasa, Markus Lumpe, Philip Branch, and                  Lochmann, Alois Mayr, Reinhold Plösch, Andreas
     Oscar Nierstrasz. Comparative analysis of evolv-               Seidl, Jonathan Streit, et al. Operationalised
     ing software systems using the gini coefficient. In            product quality models and assessment: The
     2009 IEEE International Conference on Software                 quamoco approach. Information and Software
     Maintenance, pages 179–188. IEEE, 2009.                        Technology, 62:101–123, 2015.
 [3] Alexander Serebrenik and Mark van den Brand.              [12] Robert Baggen, José Pedro Correia, Katrin Schill,
     Theil index for aggregation of software metrics                and Joost Visser. Standardized code quality
     values. In 2010 IEEE International Conference                  benchmarking for improving software maintain-
     on Software Maintenance, pages 1–9. IEEE, 2010.                ability. Software Quality Journal, 20(2):287–307,
 [4] Ilja Heitlager, Tobias Kuipers, and Joost Visser.              2012.
     A practical model for measuring maintainability.          [13] Jean-Louis Letouzey and Thierry Coq. The sqale
     In null, pages 30–39. IEEE, 2007.                              analysis model: An analysis model compliant
 [5] José Pedro Correia and Joost Visser. Certifica-               with the representation condition for assessing
     tion of technical quality of software products. In             the quality of software source code. In Ad-
     Proc. of the Int’l Workshop on Foundations and                 vances in System Testing and Validation Lifecycle
     Techniques for Open Source Software Certifica-                 (VALID), 2010 Second International Conference
     tion, pages 35–51, 2008.                                       on, pages 43–48. IEEE, 2010.

 [6] Tiago L Alves, José Pedro Correia, and Joost             [14] K. Mordal-Manet, F. Balmas, S. Denier,
     Visser. Benchmark-based aggregation of met-                    S. Ducasse, H. Wertz, J. Laval, F. Bellingard, and
     rics to ratings. In 2011 Joint Conference of the               P. Vaillergues. The squale model; a practice-based
     21st International Workshop on Software Mea-                   industrial quality model. In 2009 IEEE Int. Conf.
     surement and the 6th International Conference                  on Software Maintenance (ICSM), pages 531–534,
     on Software Process and Product Measurement,                   Sept 2009.
     pages 20–29. IEEE, 2011.                                  [15] Maria Ulan, Sebastian Hönel, Rafael M Martins,
 [7] Paloma Oliveira, Fernando P Lima, Marco Tulio                  Morgan Ericsson, Welf Löwe, Anna Wingkvist,
     Valente, and Alexander Serebrenik. Rttool: A                   and Andreas Kerren. Quality models inside out:
     tool for extracting relative thresholds for source             Interactive visualization of software metrics by
     code metrics. In 2014 IEEE International Con-                  means of joint probabilities. In 2018 IEEE Work-
     ference on Software Maintenance and Evolution,                 ing Conference on Software Visualization (VIS-
     pages 629–632. IEEE, 2014.                                     SOFT), pages 65–75. IEEE, 2018.

                                                               [16] Roger B Nelsen. An introduction to copulas.
                                                                    Springer Science & Business Media, 2007.



                                                           6
[17] Shyam R Chidamber and Chris F Kemerer. A                 [21] Myles Hollander and Douglas A Wolfe. Nonpara-
     metrics suite for object oriented design. IEEE                metric statistical methods. Wiley-Interscience,
     Transactions on software engineering, 20(6):476–              1999.
     493, 1994.
                                                              [22] Miltiadis Allamanis and Charles Sutton. Mining
[18] C. Spearman. General intelligence, objectively de-            source code repositories at massive scale using
     termined and measured. The American Journal                   language modeling. In Proceedings of the 10th
     of Psychology, 15(2):201–292, 1904.                           Working Conference on Mining Software Reposi-
                                                                   tories, pages 207–216. IEEE Press, 2013.
[19] Maurice Kendall. Rank correlation methods. Grif-
     fin, 1948.                                               [23] Tibor Bakota, Péter Hegedűs, Péter Körtvélyesi,
                                                                   Rudolf Ferenc, and Tibor Gyimóthy. A proba-
[20] J Martin Bland and Douglas Altman. Measuring                  bilistic software quality model. In 2011 27th IEEE
     agreement in method comparison studies. Statis-               International Conference on Software Mainte-
     tical methods in medical research, 8(2):135–160,              nance (ICSM), pages 243–252. IEEE, 2011.
     1999.




                                                          7