=Paper= {{Paper |id=Vol-2273/QuASoQ-04 |storemode=property |title=What Do We Know about Software Security Evaluation? A Preliminary Study |pdfUrl=https://ceur-ws.org/Vol-2273/QuASoQ-04.pdf |volume=Vol-2273 |authors=Séverine Sentilles,Efi Papatheocharous,Federico Ciccozzi |dblpUrl=https://dblp.org/rec/conf/apsec/SentillesPC18 }} ==What Do We Know about Software Security Evaluation? A Preliminary Study== https://ceur-ws.org/Vol-2273/QuASoQ-04.pdf
                   6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)



             What do we know about software security
                 evaluation? A preliminary study
              Séverine Sentilles                        Efi Papatheocharous                               Federico Ciccozzi
     School of Innovation, Design and                        ICT SICS                             School of Innovation, Design and
     Engineering, Mälardalen University         RISE Research Institutes of Sweden               Engineering, Mälardalen University
             Västerås, Sweden                         Stockholm, Sweden                                 Västerås, Sweden
         severine.sentilles@mdh.se                   efi.papatheocharous@ri.se                        federico.ciccozzi@mdh.se



   Abstract—In software development, software quality is nowa-              – there is no consensus on what software quality exactly is,
days acknowledged to be as important as software functionality              how it is achieved and evaluated.
and there exists an extensive body-of-knowledge on the topic. Yet,             In our research we investigate to what extent properties and
software quality is still marginalized in practice: there is no con-
sensus on what software quality exactly is, how it is achieved and          their evaluation methods are explicitly defined in the existing
evaluated. This work investigates the state-of-the-art of software          literature. We are interested in properties such as reliabil-
quality by focusing on the description of evaluation methods for            ity, efficiency, security, and maintainability. In our previous
a subset of software qualities, namely those related to software            work [4], we investigated which properties are most often
security. The main finding of this paper is the lack of information         mentioned in literature and how they are defined in a safety-
regarding fundamental aspects that ought to be specified in an
evaluation method description. This work follows up the authors’            critical context. We found that the most prominent properties
previous work on the Property Model Ontology by carrying out                of cost, performance and reliability/safety were used, albeit not
a systematic investigation of the state-of-the-art on evaluation            always well-defined by the authors: divergent or non existent
methods for software security. Results show that only 25% of                definitions of the properties were commonly observed.
the papers studied provide enough information on the security                  The target of this work is to investigate evaluation methods
evaluation methods they use in their validation processes, whereas
the rest of the papers lack important information about various
                                                                            related to software security, which is a critical extra-functional
aspects of the methods (e.g., benchmarking and comparison to                property directly and significantly impacting different devel-
other properties, parameters, applicability criteria, assumptions           opment stages (i.e., requirements, design, implementation and
and available implementations). This is a major hinder to their             testing). Security is defined by McGraw [5] as “engineering
further use.                                                                software so that it continues to function correctly under
   Index Terms—Software security, software quality evaluation,              malicious attack”. The high diversity of components and
systematic review, property model ontology.
                                                                            complexity of current systems makes it almost impossible to
                                                                            identify, assess and address all possible attacks and aspects of
                       I. I NTRODUCTION                                     security vulnerabilities.
                                                                               More specifically, in this paper we want to answer the
   Software quality measurement quantifies to what extent a                 following research question: “What percentage of evalua-
software complies with or conforms to a specific set of desired             tion method descriptions contains pertinent information to
requirements or specifications. Typically, these are classified             facilitate its use?”. For this, we identified a set of papers
as: (i) functional requirements, pertaining to what the software            representing the state-of-the-art from the literature in the field
delivers, and (ii) non-functional requirements, reflecting how              and we assessed the degree of explicit information about
well it performs according to the specifications. While there               evaluation methods. The set of papers was analyzed to identify
exists a vast plethora of options for functional requirements,              definition and knowledge representation issues with respect to
non-functional requirements have been studied extensively                   security, and answer the following three sub-questions:
and classified through standards and models (i.e., ISO/IEC                     • RQ1: What proportion of properties is explicitly defined?
9126 [1], ISO/IEC 25010 [2] and ISO/IEC 25000 [3]). They                       • RQ2: What proportion of evaluation methods provide
are often referred to as extra-functional properties, non-                        explicit key information to enable their use (e.g., formula,
functional properties, quality properties, quality of service,                    description)?
product quality or simply metrics.                                             • RQ3: What proportion of other supporting elements of
   Despite the classification and terminology accompanying                        the evaluation methods is explicit (e.g., assumptions,
these properties, their evaluation and the extent of how well                     available implementations)?
a software system performs under specific circumstances are                 We used a systematic mapping methodology [6], started from
both hard to quantify without substantial knowledge on the                  63 papers and selected 24 papers, which we then analyzed
particular context, concurring measurements and measurement                 further. After thorough analysis, we excluded eight papers,
methods. In practice, quality assessment is still marginalized              resulting in a final set of 16 papers, which we used for



      Copyright © 2018 for this paper by its authors.                  30
                   6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

synthesizing and reporting our results. The final set of papers           their degree of assessment coverage or how to achieve their
is listed in Table V. The main contribution of this paper is              evaluation. Moreover, in their work, the generic property of
represented by the identification of the following aspects:               security, including its definition, relation to a reference and
   • relation between evaluation methods and a generic prop-              detailed explanation on how to achieve assessment, is not
      erty definition (i.e., if it exists explicitly and how it is        addressed.
      described),                                                            Therefore, we decided to investigate thoroughly the liter-
   • which aspects of other supporting elements to improve                ature and quantify the quality of the description of security
      understandability and applicability of the method (i.e,             evaluation methods. Our approach differs from the work of
      assumptions, applicability, output etc.) are explicitly men-        Morrison et al. [11] in that we are not interested to collect
      tioned, and                                                         metrics and their mapping to SDLC. Instead, we examine
   • the missing information regarding fundamental aspects                evidence on properties’ explicit definitions, their evaluation
      which ought to be specified in a property and method                methods and parameters (including their explanations on how
      description.                                                        they are used) for assessing security. We develop a mapping
   The remainder of the paper is structured as follows. In                from an applicability perspective of the evaluations, which
Section II we introduce the related work in the topic, while              may be used by researchers and practitioners at different
in Section III we describe our methodology. Section IV                    phases of security evaluations, i.e., at the low-level internal
summarizes quantitative and qualitative interpretations of the            or at the external (often called operational) phase of software
extracted data, while Section V reports the main results.                 engineering.
Section VI discusses the threats to validity and Section VII                                  III. M ETHODOLOGY
concludes the paper and delineates possible future directions.
                                                                             We base our work on the mapping study previously con-
                     II. R ELATED WORK                                    ducted by Morrison et al. in [11]. The reasoning is that the
   An extensive body-of-knowledge already exists on software              basis of the work is well aligned and applicable to answering
security and many studies have already investigated security              our RQs. The details of the methodology are explained in the
assessment as well. For example, the U.S. National Institute              following.
of Information Standards and Technology (NIST) in [7]
                                                                          A. Search and selection strategy
developed a performance measurement guide for information
security and in particular how an organization through the use               The data sources used in [11] are the online databases,
of metrics can identify the adequacy of controls, policies and            conference proceedings and academic journals of ACM Digital
procedures. The approach is mostly focused on the level of                Library, IEEE Xplore and Elsevier. The search terms include
technical security controls in the organization rather than the           the keywords: “software”, “security”, “measure”, “metric” and
technical security level of specific products.                            “validation”. Considering another 31 synonyms, the following
   In [8], a taxonomy for information security-oriented metrics           search string was used:
is proposed in alignment to common business goals (e.g.,                  “(security OR vulnerability) AND (metric
cost-benefit analysis, business collaboration, risk analysis,             OR measure OR measurement OR indicator OR
information security management and security dependability                attribute OR property)”,
and trust for ICT products, systems and services) covering                in which the terms are successively replaced with the identified
both organizational and product levels.                                   synonyms.
   Verendel [9] analyzed quantitatively information security                 The selection criteria of Morrison et al. include the sets of
from a set of articles published from 1981-2008 and concluded             inclusion and exclusion items listed below.
that theoretical methods are difficult to apply in practice and           Inclusion criteria:
there is limited experimental repeatability. A recent literature            • Paper is primarily related to measuring software security
survey [10] carried out on ontologies and taxonomies of                        in the software development process and/or its artifacts.
security assessment identified among others a gap in works                     For example software artifacts (e.g., source code files,
addressing research issues like knowledge reuse, automatic                     binaries), software process (e.g., requirements phase, de-
processes, increasing the assessment coverage, defining secu-                  sign, implementation, testing), and/or software process
rity standards and measuring security.                                         artifacts (e.g., design and functional specifications).
   Morrison et al. [11] carried out a systematic mapping                    • Measurements and/or metrics are the main paper subject
study on software security metrics to create a catalogue of                 • Refereed paper
metrics, their subject of measurement, validation methods                   • Paper published since 2000
and mappings based on the software development life cycle
(SDLC). Based on the vast catalogue of metrics and definitions            Exclusion criteria:
collected, a major problem is that they are extremely hard to               • Related to sensors
be compared, as they typically measure different aspects of                 • Related to identity, anonymity, privacy
the property (something obvious from the emergent categories                • Related to forgery and/or biometrics
proposed in the paper). Thus, there is no agreement on                      • Related to network security (or vehicles)




      Copyright © 2018 for this paper by its authors.                31
                   6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

  •  Related to encryption                                               E. Data extraction and interpretation
  •  Related to database security                                           The data extraction is carried out by the researchers us-
   • Related to imagery, audio, or video
                                                                         ing independent spreadsheet columns (one for each paper).
   • Specific to single programming languages
                                                                         The data extraction for each paper is reviewed by another
   As explained above, search string, search strategy, inclusion         researcher (i.e. reviewer) and discussions are carried out until
and exclusion criteria defined by Morrison et al. are applicable         an agreement is reached. The data is then interpreted and
to our work. Thus, we reuse the list of selected papers from             analyzed by all researchers together, as explained in the next
their mapping study as the initial set for our study. However,           section.
as we wanted to investigate the applicability and quality
of the security evaluation methods found, we extended the                                 IV. DATA INTERPRETATION
selection and data extraction methodology to isolate the subset          A. Quantitative analysis
of selected papers that is considered most relevant. In order
to identify the subset, we performed the following actions:                 One interesting, albeit not very surprising fact, is that most
                                                                         of the descriptions contained in the papers explicitly state what
   • Define new exclusion criteria
                                                                         is supposed to be calculated (i.e., the property and the output)
   • Design a new selection process
                                                                         and how (i.e., through the overall approach description, the
   • Define a new data collection scheme
                                                                         parameters involved in the approach, and to some extent the
   • Extract the data
                                                                         advantages and disadvantages). As visible in Table II, which
   • Intepret data and synthesize results
                                                                         shows the results of the quantitative analysis, the property is
B. New exclusion criteria                                                explicitly referred to in 87,5% of the papers, the output in
                                                                         75%, the approach description in 87,5%, the parameters in
   We decided to exclude sources according to the following              81%, the advantages in 62% and the disadvantages in 69%.
criteria:                                                                   No method explicitly specifies the unit to be used for the
   • Full-text is not available                                          output. This might be explained by the fact that software
   • Sources are Ph.D. dissertations or books                            security is a rather recent research field and there is currently
   • Sources are not industrially validated or do not use well-          no widely acknowledged metric and corresponding units to be
      known software, tool or platforms for their validations            used. As a result, most methods typically fall back on standard
   • Sources are model predictions that do not assess any                data types for which units are less relevant (e.g., percentage,
      property                                                           probability, item counts).
                                                                            Applicability criteria are rarely explicitly mentioned, only
C. New selection process                                                 in 19% of the papers. When they are mentioned, it is not
   The selection is performed on the list of 63 selected papers          obvious whether the set of provided criteria is complete or
from [11] by three independent researchers by looking at                 at least sufficient to ensure the applicability of the evaluation
the title, abstract and by going through a quick reading of              method.
the content. Papers independently selected by all researchers               Only half of the papers clearly specify the assumptions and
are included. Similarly, papers which are discarded by all               hypotheses that are assumed to hold when using the method.
researchers are excluded. For papers for which there is a                This is a setback as these aspects are important for correctly
disagreement, a discussion is held between the involved re-              applying a method.
searchers to unanimously decide whether to include or exclude               Drivers are rarely explicitly mentioned; only in 12.5% of
the papers. In one case (p18), an extended version of the paper          the papers. Drivers can be important as they can implicitly
was found and used instead (i.e. we used [12] instead of [13]).          affect the output of a method.
This selection process resulted in 24 included papers and 39                Despite being often mentioned in the papers as part of the
rejected. Furthermore, 8 papers were later excluded during the           solution being implemented, in practice, only a few imple-
data extraction as they did not contain useful information or            mentations or programs are available to directly support the
were not of sufficient quality. Hence, our selection set is based        method evaluation. We could find mentions of available tool
on the 16 papers listed in Table V.                                      support in only 2 out of the 16 papers (i.e., WEKA toolkit and
                                                                         Fortify Source Code Analyzer (SCA) version 5.10.0.0102 in
D. Data collection scheme                                                p43 and p47). A few additional papers (4) refer to available
   As support for answering the RQs, we created a form based             tool support but no explicit link or reference is provided.
on excel spreadsheets. The form consists of the questions                   Out of the 16 analyzed papers, only 1 (i.e., p60) performed
specified in Table I, with the list of possible values for each          some kind of benchmarking or comparison to similar proper-
question. The questions are derived from our previous work on            ties or evaluation methods. This corresponds to only 6% of all
the Property Model Ontology (PMO), which formally specifies              the papers.
which concepts should be described for properties and their                 To a large extent, the information provided in the papers is
evaluation methods and how these concepts relate to one                  insufficient to directly apply the method, especially by non-
another. For details on the PMO, the reader is directed to [4].          experts (in 12 papers, 75%).



      Copyright © 2018 for this paper by its authors.               32
                     6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

                                                           TABLE I: Data collection form
       Data to collect                                                                             Possible Values
 q1    PaperId                                                                                     Px, with x a number
 q2    Method name                                                                                 Free text
 q3    Does the evaluation method provide an explicit reference to a property or a metric          Yes | no
 q4    If yes on q3, how?                                                                          Name + def. + ref. | name + def. | name + ref. | other
 q5    If other on q4, how?                                                                        Free text
 q6    If yes on q3, which one? Give the example                                                   Free text
 q7    If yes on q3, does the reference match what is needed for the evaluation method?            Yes | no | I don’t know
 q8    Does the evaluation method explicitly state the output of the method                        Yes | no
 q9    If yes on q8, how?                                                                          With data format and unit | With data only | other
 q10   If other on q9, how?                                                                        Free text
 q11   If yes on q8, which one? Give the example                                                   Free text
 q14   Does the evaluation method explicitly state applicability criteria?                         Yes | no
 q15   If yes, which ones?                                                                         Free text
 q16   Does the evaluation method explicitly state how the property is evaluated?                  Yes | no | I don’t know
 q17   If yes on q16, how?                                                                         Free text
 q18   Does the evaluation method explicitly explain the parameters involved in the evaluation?    Yes | no
 q19   If yes on q19, which parameters (with their explanations)?                                  Free text
 q25   Does the evaluation method explicitly describe additional drivers that might affect the     Yes | no
       evaluation?
 q26   If yes on q25, which ones?                                                                  Free text
 q27   Does the evaluation method explicitly state the assumptions, hypotheses which the           Yes | no
       method is based on?
 q28   If yes on q27, which ones?                                                                  Free text
 q29   Does the evaluation method explicitly mentioned advantages for the method?                  Yes | no
 q30   If yes on q30, which ones?                                                                  Free text
 q31   Does the evaluation method explicitly mentioned disadvantages for the method                Yes | no
 q32   If yes on q31, which ones?                                                                  Free text
 q33   Does the evaluation method explicitly reference an implementation or a program that         Yes | no
       can be used?
 q34   If yes on q33, which ones?                                                                  Free text
 q35   If yes on q33, is the implementation or program accessible                                  Yes | no
 q36   Is there any comparison to other properties (e.g., benchmark, validation)                   Yes | no
 q37   If yes on q36, how?                                                                         Free text
 q38   Additional comments                                                                         Free text
 q39   Is the information provided in the paper sufficient to understand the method and directly   Yes | no
       use it?
 q40   Extractor’s name                                                                            Séverine | Efi | Federico
 q41   Reviewer’s name                                                                             Séverine | Efi | Federico



B. Qualitative analysis                                                              From the data extracted, 34 definitions and references on
                                                                                  security are collected from the papers, as listed in Table IV.
   From analyzing the answers to the questions in Table II,
                                                                                  We categorized the collected properties in two different levels,
papers can be categorized into three groups based on their
                                                                                  depending on their level of specificity (less coarse-grained).
main purpose. The first group focuses on defining a new
                                                                                  A few papers shared some definitions (e.g., p25, p59, p63)
property or metric to assess some specific software security
                                                                                  whereas most papers defined their own security aspect to
aspects (p1, p18, p27, p47, p63). The papers belonging to the
                                                                                  evaluate. The security aspect expressed (even if in some cases
second group (p15, p25, p37, p51, p60) base their work on
                                                                                  is explained with many details) is only able to capture some
already defined properties. Their main objective is to either
                                                                                  facets of the property, thus it is hard to know if the property
find ways to combine existing metrics to evaluate a given
                                                                                  definitions together with their evaluations are enough to asses
security aspect, define a new evaluation method for existing
                                                                                  security in a rational way.
properties or validate previously specified methods in applying
them on a specific system. The last group aims at finding                            Related to the descriptions of the properties and evaluations
correlation between already defined properties or performing                      extracted, we captured the level of applicability between them,
predictions to evaluate the accuracy of previously defined                        as they were described in the papers: most commonly the
models (p38, p43, p53, p55, p59). There are papers that do                        deployment and operation phase (post-release), the implemen-
not belong to these groups as they are not explicitly referring                   tation (development or code level, including testing) and the
to any property, method or metric (p22, p50). This is shown                       design phase as shown in Table III. The level at which security
in the answers to questions on whether the paper explicitly                       evaluation is carried out varies among the papers. 7 of them
refers to a property and, if yes, how (i.e., q3 and q4). Papers                   (44%) apply the evaluation method at the implementation
belonging to the first group use “definition without reference”,                  (code-level) phase, 1 paper at the testing phase, 6 papers
papers from the second group use “definition plus reference”                      (37,5%) at the operational phase, 1 paper during maintenance
and papers from the last group use “other”.                                       and 4 papers (25%) at the design level.



       Copyright © 2018 for this paper by its authors.                       33
                          6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

                      TABLE II: Quantitative results                                             information to facilitate its use?”, only 25% of the papers that
                                                     Count of        Count of                    we investigated were judged to provide enough information
        Question
                                                     ‘Yes’ (%)       ‘No’ (%)                    to enable a direct application of the method. Overall, the
        q3: Explicit property                            14               2
                                                      (87,5%)         (12,5%)
                                                                                                 papers were good at explicitly stating the property under study,
        q8: Explicit method output                       12               4                      describing how the property is evaluated and which parameters
                                                       (75%)           (25%)                     are involved. On the other hand, in several cases the papers
        q14: Explicit applicability criteria              3              13
                                                       (19%)           (81%)
                                                                                                 lacked information regarding the unit representing the value of
        q16: Explicit description                        14               2                      the property, the applicability criteria for the method as well
                                                      (87,5%)         (12,5%)                    as its assumptions, possible advantages and disadvantages,
        q18: Explicit method parameters                  13               3                      eventual openly available tool support, and comparison to
                                                       (81%)           (19%)
        q25: Explicit drivers                             2              14                      other properties.
                                                      (12,5%)         (87,5%)                       Security being a relatively new area in software engineering
        q27: Explicit assumptions                         8               8                      can be a major contributing factor to these results. For exam-
                                                       (50%)           (50%)
        q29: Explicit advantages                         10               6                      ple, advantages, disadvantages, and applicability criteria are
                                                      (62,5%)         (37,5%)                    difficult to identify initially. They require time and perspective
        q31: Explicit disadvantages                      11               5                      on the topic as well as the methods having been used over
                                                       (69%)           (31%)
        q33: Explicit tool reference                      6              10
                                                                                                 time in a wide range of applications and scenarios. Given that
                                                      (37,5%)         (62,5%)                    many of the papers which were included in our set focused on
        q35: Tool accessibility                           2               4                      defining new properties for security, it is not so surprising that
                                                      (12,5%)          (25%)                     so few papers mentioned those aspects. Furthermore, following
        q36: Benchmark                                    1              15
                                                        (6%)           (94%)                     the same reasoning, even if those criteria were explicitly stated
        q39: Information is sufficient                    4              12                      in the papers, it is not certain that the proposed lists of
                                                       (25%)           (75%)                     advantages, disadvantages and applicability is exhaustive, or at
                                                                                                 least sufficient to guarantee the proper usage of the evaluation
                TABLE III: Security level applicability
                                                                                                 method that they introduce.
  pID              Applicability level                            Analysis level*                   When comparing to the results in [11], we found fewer
   p1 system architecture - design - implementation                   internal                   metrics and definitions. This is due to the fact that our
  p15                 implementation                                  internal
  p18                   operational                                  external
                                                                                                 purpose was to identify the main property (or properties) of
  p22   operational - system architecture - design                  int. & ext.                  the paper and their corresponding evaluation methods. Some
  p25                   operational                                 int. & ext.                  of the metrics identified in [11] are classified in our results
  p27                   operational                                  external                    as parameters. Others have been ignored as they were used
  p37                 implementation                                  internal
  p38                     design                                      internal                   for comparison purposes and therefore not as relevant for our
  p43            design - implementation                              internal                   work. However, the conclusions by Morrison et al. still hold
  p47                   operational                                  external                    and are further supported by our results. Security properties are
  p50                   operational                                  external
  p51                 implementation                                  internal
                                                                                                 not mature, most of them have been proposed and evaluated
  p53                  maintenance                                   external                    solely by their authors, and few comparisons between already
  p59                     testing                                    external                    defined properties exist. “Despite the abundance of metrics
  p60       detailed design - implementation                          internal                   found in the literature, those available give us an incomplete,
  p63                 implementation                                  internal
* Based on the definitions from [1] on internal and external metrics for product quality.        disjointed, hazy view of software security.”
                                                                                                    These results are to be put in perspective, since none of
   As mentioned above, no units are explicitly specified. How-                                   the authors of this paper is an expert in software security
ever, the output of the evaluation methods falls back onto                                       and due to the small number of papers that were studied in
implicit scales of measurements: nominal (p60), ordinal (p1,                                     this work. However, we are confident that the observations
p18, p37, p47, p50, p60, p63), interval (p53) and ratio (p18,                                    resulting from this study are good indications of the issues
p25, p43, p60).                                                                                  occurring with property and evaluation methods descriptions
   Regarding advantages of the methods, the most commonly                                        in software security.
reported are: reliability of the method (p1, p15, p25), sim-                                        A bi-product of our analysis is the following interesting
plicity of the method (p1, p18, p25), accuracy of the method                                     aspect. Especially for the works assessing software vulnera-
(p18, p43) and objectivity of the results (p47, p60). The                                        bility, evaluation methods exploit well-established prediction
disadvantages mostly refer to the accuracy of the results being                                  models that leverage a discernible set of software metrics
dependent on the quality of the available data (p18, p25, p37,                                   (related to the code itself, developers activity, versioning data,
p43) and subjectivity involved in the method (p22, p37, p63).                                    etc.). A (what seems to be) common step to the definition of
                                                                                                 security-related evaluation methods, especially when dealing
                     V. R ESULTS AND DISCUSSIONS                                                 with vulnerability, is the comparison of existing prediction
  As an answer to our initial research question “What per-                                       models, or the comparison of a newly defined prediction model
centage of evaluation methods description contains pertinent                                     with a set of existing ones, in order to identify the “best” model



        Copyright © 2018 for this paper by its authors.                                     34
                         6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

                     TABLE IV: Collection of properties or coarse-grained metrics explicitly defined in the papers1
  pID     Generic/coarse-grained definition                                                     Less generic definition
  p1      Security to mean control over data confidentiality [...] and data integrity [...]1    Total Security Index (TSI) as the sum of the security design principle metrics
  p15     Vulnerability as a weakness in a software system that allows an attacker to
          use the system for a malicious purpose
  p18     End-user software vulnerability exposure as a combination of lifespans and            Median Active Vulnerabilities (MAV): the median number of software vulnerabilities
          vulnerability announcement rates                                                      which are known to the vendor of a particular piece of software but for which no
                                                                                                patch has been publicly released by the vendor

                                                                                                Vulnerability Free Days (VFD): captures the probability that a given day has exactly
                                                                                                zero active vulnerabilities
  p25     Vulnerability (defined in [14])
  p27     Operational security as representation of as accurately as possible the               Mean Effort To security Failure (METF): Mean effort for a potential attacker to reach
          security of the system in operation, i.e., its ability to resist possible attacks     the specified target
          or, equivalently, the difficulty for an attacker to exploit the vulnerabilities
          present in the system and defeat the security objectives
  p37     Security through an attack surface metric                                             Attack surface metric a measure of a systems attack surface along three dimensions by
                                                                                                estimating the total contribution of the methods, the total contribution of the channels,
                                                                                                and the total contribution of the data items to the systems attack surface [...]1
  p38     Security as through the number of violations of the least privilege principle         LP principle metric (defined in [15])
          and surface metrics                                                                   Attack surface metric (defined in [16])
                                                                                                Coupling between components (CBM) (defined in [17])
          Maintainability as coupling between components and components instability             Components instability (CI) (defined in [18])
  p43     Vulnerability through dependency graphs (*)
  p47     Vulnerability through static analysis (*)                                             Static-analysis vulnerability density (SAVD): number of vulnerabilities a static-analysis
                                                                                                tool finds per thousand LOC
                                                                                                Density of post-release reported vulnerability (NVD) (*)
                                                                                                Application vulnerability rank (SAVI) (*)
  p50     Vulnerability as flaws or weakness in a systems design, implementation, or
          operation and management that could be exploited to violate the systems
          security policy. Any flaw or weakness in an information system could
          be exploited to gain unauthorized access to, damage or compromise the
          information system.
  p51     Application security (defined in [5])                                                 Stall Ratio (SR): a measure of how much a programs progress is impeded by frivolous
          Software security (defined in [5])                                                    activities.
          Security at program level (defined in [19])                                           Coupling Corruption Propagation (CCP): Number of child methods invoked with the
                                                                                                parameter(s) based on the parameter(s) of the original invocation
                                                                                                Critical Element Ratio (CER) (*)
  p53     Security through vulnerability density (*)                                            Static analysis vulnerability density (SAVD) (see p47)

                                                                                                Security Resources Indicator (SRI): the sum of four indicator items, ranging from 0
                                                                                                to 4. The items are: the documentation of the security implications of configuring and
                                                                                                installing the application, a dedicated e-mail alias to report security problems, a list or
                                                                                                database of security vulnerabilities specific to the application, and the documentation
                                                                                                of secure development practices, such as coding standards or techniques to avoid
                                                                                                common secure programming errors.
  p59     Vulnerability (defined in [14])
  p60     Software vulnerability (defined in [19])                                       Structural severity: uses software attributes to evaluate the risk of an attacker reaching
                                                                                         a vulnerability location from attack surface entry points [...]1
                                                                                         Attack Surface Entry Points (defined in [20])
                                                                                         Reachability Analysis (*)
  p63     Software vulnerability (defined in [14])                                       Vulnerability-Contributing Commits (VCCs): original coding mistakes or commits in
                                                                                         the version control repository that contributed to the introduction of a post-release
                                                                                         vulnerability
  1
    The definitions have been shorten to comply with the page limitation. Readers are referred to the original paper for the complete definition.
  (*) No explicit definition is found.



to use as basis for evaluation purposes. Another interesting                                        formulated based on these clarified notions and the research
aspect is represented by the fact that, in several papers, authors                                  followed a methodological procedure known as systematic
exercise their reasoning and methods on well-known code-                                            mapping.
bases (e.g., Windows Vista, Mozilla); this shows an inter-                                             The selection of papers was based on the work of Morrison
esting strong inclination towards assessing the applicability                                       et al. [11], thus we inherit the work’s limitations. Therefore,
of research (theoretical) results to practical cases, which is                                      papers not listed in the sources used, i.e., ACM, IEEE and
too seldom seen in other research branches within software                                          Elsevier, have been missed. In addition, we excluded Ph.D.
engineering.                                                                                        dissertations and books since we limited our selection to peer-
                                                                                                    reviewed conference and scholar journal publications only.
                       VI. T HREATS TO VALIDITY
                                                                                                      The way the involved researchers individually interpreted
   Construct validity relates to what extent the phenomenon                                         the methods described in the papers reflects their own biases
under study represents what the researchers wanted to in-                                           and views. However, we worked hard to reduce the bias
vestigate and what is specified by the research questions.                                          by discussing the content of the papers in pairs to dissolve
We explicitly defined the context of the work and discussed                                         uncertainties. In several cases, where the decision to include or
related terms and concepts. Also, the research questions were                                       not a paper, a third research was involved to reach a consensus.



        Copyright © 2018 for this paper by its authors.                                        35
                   6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

   Morrison at al. [11] excluded sources dealing with networks,           under grant agreement No 737422. This Joint Undertaking
sensors, database and vehicle security and this limited the               receives support from the European Unions Horizon 2020 re-
set of papers analyzed in our study. This opens however up                search and innovation programme and Austria, Spain, Finland,
for opportunities of further research targeting these particular          Ireland, Sweden, Germany, Poland, Portugal, Netherlands,
types of systems.                                                         Belgium, Norway.
   External validity is about to what extent the findings are
generalizable. Due to the focus on the security aspect and
the small selection of papers, one should avoid generalizing                                             R EFERENCES
the results over other properties such as reliability and safety.
However, despite the limitations of our study (i.e., low number            [1] ISO/IEC, ISO/IEC 9126. Software engineering – Product quality.
                                                                               ISO/IEC, 2001.
of papers, one method per paper, no snowballing to find com-               [2] ——, Systems and software engineering-Systems and software Quality
plementary information), our conclusions are representative                    Requirements and Evaluation (SQuaRE) - System and software quality
of the current issues in software security. Furthermore, other                 models, 2011.
works in the state-of-the-art on software quality share similar            [3] ——, “ISO/IEC 25000 software and system engineering–software prod-
                                                                               uct quality requirements and evaluation (SQuaRE)–guide to SQuaRE,”
conclusions as Morrison et. al. in [11], which points towards                  International Organization for Standardization, 2005.
the applicability of our results to other properties.                      [4] S. Sentilles, E. Papatheocharous, F. Ciccozzi, and K. Petersen, “A
                                                                               property model ontology,” in Software Engineering and Advanced Ap-
                     VII. C ONCLUSIONS                                         plications (SEAA), 2016 42th Euromicro Conference on. IEEE, 2016,
                                                                               pp. 165–172.
   Despite the low number of papers this work is based on,                 [5] G. McGraw, Software security: building security in. Addison-Wesley
we believe it is, to some extent, representative of the current                Professional, 2006, vol. 1.
issues in the state-of-the-art on software quality in general and          [6] K. Petersen, R. Feldt, S. Mujtaba, and M. Mattsson, “Systematic
                                                                               mapping studies in software engineering.” in EASE, vol. 8, 2008, pp.
security in particular: a number of useful properties and meth-                68–77.
ods to evaluate them do exist today. However, it is difficult to           [7] E. Chew, M. M. Swanson, K. M. Stine, N. Bartol, A. Brown, and
know about them and understand whether they are applicable                     W. Robinson, “Performance measurement guide for information secu-
                                                                               rity,” Tech. Rep., 2008.
in a given context and if they are, how to use them. This can be
                                                                           [8] R. Savola, “Towards a security metrics taxonomy for the information and
attributed to the lack of information in the descriptions of their             communication technology industry,” in Software Engineering Advances,
evaluation methods. This hampers the activities towards better                 2007. ICSEA 2007. International Conference on. IEEE, 2007, pp. 60–
quality assurance in software engineering and it limits at the                 60.
                                                                           [9] V. Verendel, “Quantified security is a weak hypothesis: a critical survey
same time the application of these activities or newly specified               of results and assumptions,” in Proceedings of the 2009 workshop on
methods in industrial settings. For example, knowing when to                   New security paradigms workshop. ACM, 2009, pp. 37–50.
apply a method (e.g., applicability at design time, at run-time,          [10] F. d. F. Rosa, R. Bonacin, and M. Jino, “The security assessment
                                                                               domain: A survey of taxonomies and ontologies,” arXiv preprint
etc.) restricts which methods can be used in a given context.                  arXiv:1706.09772, 2017.
Knowing the advantages and disadvantages allows to trade-                 [11] P. Morrison, D. Moye, and L. A. Williams, “Mapping the field of
off available methods and limits selection bias. Older, more                   software security metrics,” North Carolina State University. Dept. of
established or traditional, software quality fields provide more               Computer Science, Tech. Rep., 2014.
                                                                          [12] J. L. Wright, M. McQueen, and L. Wellman, “Analyses of two end-user
reference properties and methods to systematically compare to;                 software vulnerability exposure metrics (extended version),” Information
however the level of detail and quality of given information is                Security Technical Report, vol. 17, no. 4, pp. 173–184, 2013.
still relatively low.                                                     [13] ——, “Analyses of two end-user software vulnerability exposure met-
                                                                               rics,” in 2012 Seventh International Conference on Availability, Relia-
   As future work, we plan to expand the selection of papers                   bility and Security. IEEE, 2012, pp. 1–10.
to include those from the references in the analyzed papers so            [14] I. V. Krsul, Software vulnerability analysis. Purdue University West
to investigate if our conclusions still hold. Similarly, we will               Lafayette, IN, 1998.
explore the quality of assessments of other quality properties            [15] K. Buyens, B. De Win, and W. Joosen, “Identifying and resolving least
                                                                               privilege violations in software architectures,” in Availability, Reliability
in literature such as reliability, safety and maintainability. The             and Security, 2009. ARES’09. International Conference on. IEEE, 2009,
results of these studies will be included in PROMOpedia [21],                  pp. 232–239.
an online encyclopedia of software properties and their eval-             [16] P. K. Manadhata, D. K. Kaynar, and J. M. Wing, “A formal model for a
                                                                               system’s attack surface,” CARNEGIE-MELLON UNIV PITTSBURGH
uation methods. Lastly, we plan to propose an improved and                     PA SCHOOL OF COMPUTER SCIENCE, Tech. Rep., 2007.
validated ontology to express several critical and time sensitive         [17] M. Lindvall, R. T. Tvedt, and P. Costa, “An empirically-based process
properties towards a more systematic (in terms of consistent)                  for software architecture evaluation,” Empirical Software Engineering,
                                                                               vol. 8, no. 1, pp. 83–108, 2003.
and formal (in terms of codified) way and approach a better
                                                                          [18] R. C. Martin, Agile software development: principles, patterns, and
trade-off support between the properties.                                      practices. Prentice Hall, 2002.
   Part of the work is also supported by the Electronic Com-              [19] C. P. Pfleeger and S. L. Pfleeger, Security in computing. Prentice Hall
                                                                               Professional Technical Reference, 2002.
ponent Systems for European Leadership Joint Undertaking
                                                                          [20] P. K. Manadhata and J. M. Wing, “An attack surface metric,” IEEE
                      ACKNOWLEDGMENTS                                          Transactions on Software Engineering, no. 3, pp. 371–386, 2010.
                                                                          [21] S. Sentilles, F. Ciccozzi, and E. Papatheocharous, “PROMOpedia: a web-
  The work is supported by a research grant for the ORION                      content management-based encyclopedia of software property models,”
project (reference number 20140218) from The Knowledge                         in Proceedings of the 40th International Conference on Software Engi-
Foundation in Sweden.                                                          neering: Companion Proceeedings. ACM, 2018, pp. 45–48.




      Copyright © 2018 for this paper by its authors.                36
               6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)




                                         TABLE V: List of selected papers with their ID
pID   Paper reference
 p1   B. Alshammari, C. Fidge, and D. Corney, “A hierarchical security assessment model for object-oriented programs,” in Quality Soft-
      ware(QSIC), 2011 11th International Conference on. IEEE, 2011, pp. 218227.
p15   Y. Shin and L. Williams, “An empirical model to predict security vulnerabilities using code complexity metrics,” inProceedings of the
      Second ACM-IEEE international symposium on Empirical software engineering and measurement. ACM, 2008, pp. 315317.
p18   J. L. Wright, M. McQueen, and L. Wellman, “Analyses of two end-user software vulnerability exposure metrics (extended version),”
      Information Security Technical Report, vol. 17, no. 4, pp. 173184, 2013
p22   M. Almorsy, J. Grundy, and A. S. Ibrahim, “Automated software architecture security risk analysis using formalized signatures,” in
      Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 2013, pp. 662671.
p25   Shin, A. Meneely, L. Williams, and J. A. Osborne, “Evaluating complexity, code churn, and developer activity metrics as indicators of
      software vulnerabilities,” IEEE Transactions on Software Engineering,vol. 37, no. 6, pp. 772787, 2011.
p27   R. Ortalo, Y. Deswarte, and M. Kaaniche, “Experimenting with quantitative evaluation tools for monitoring operational security,” IEEE
      Transactions on Software Engineering, no. 5, pp. 633650, 1999.
p37   P. Manadhata, J. Wing, M. Flynn, and M. McQueen, “Measuring the attack surfaces of two FTP daemons,” in Proceedings of the 2nd
      ACMworkshop on Quality of protection. ACM, 2006, pp. 310.
p38   K. Buyens, R. Scandariato, and W. Joosen, “Measuring the interplay of security principles in software architectures,” in Proceedings of the
      2009 3rd International Symposium on Empirical Software Engineering and Measurement. IEEE Computer Society, 2009, pp. 554563.
p43   V. H. Nguyen and L. M. S. Tran, “Predicting vulnerable software components with dependency graphs,” in Proceedings of the 6th
      International Workshop on Security Measurements and Metrics. ACM, 2010, p. 3.
p47   J. Walden and M. Doyle, “Savi: Static-analysis vulnerability indicator,” IEEE Security & Privacy, no. 1, 2012.
p50   J. A. Wang, H. Wang, M. Guo, and M. Xia, “Security metrics for software systems,” in Proceedings of the 47th Annual Southeast Regional
      Conference. ACM, 2009, p. 47.
p51   I. Chowdhury, B. Chan, and M. Zulkernine, “Security metrics for source code structures,” in Proceedings of the fourth international workshop
      on Software engineering for secure systems. ACM, 2008, pp. 5764.
p53   J. Walden, M. Doyle, G. A. Welch, and M. Whelan, “Security of open source web applications,” in Empirical Software Engineering and
      Measurement, 2009. ESEM 2009. 3rd International Symposium on.IEEE, 2009, pp. 545553
p59   M. Gegick, P. Rotella, and L. Williams, “Toward non-security failures as a predictor of security faults and failures,” in International
      Symposium on Engineering Secure Software and Systems. Springer, 2009, pp. 135149.
p60   A. A. Younis, Y. K. Malaiya, and I. Ray, “Using attack surface entry points and reachability analysis to assess the risk of software
      vulnerability exploitability,” in High-Assurance Systems Engineering (HASE),2014 IEEE 15th International Symposium on. IEEE, 2014,
      pp. 18
p63   A. Meneely, H. Srinivasan, A. Musa, A. R. Tejeda, M. Mokary, and B. Spates, “When a patch goes bad: Exploring the properties of
      vulnerability-contributing commits,” in Empirical Software Engineering and Measurement, 2013 ACM/IEEE International Symposium on.
      IEEE,2013, pp. 6574.




 Copyright © 2018 for this paper by its authors.                      37