=Paper=
{{Paper
|id=Vol-2273/QuASoQ-04
|storemode=property
|title=What Do We Know about Software Security Evaluation? A Preliminary Study
|pdfUrl=https://ceur-ws.org/Vol-2273/QuASoQ-04.pdf
|volume=Vol-2273
|authors=Séverine Sentilles,Efi Papatheocharous,Federico Ciccozzi
|dblpUrl=https://dblp.org/rec/conf/apsec/SentillesPC18
}}
==What Do We Know about Software Security Evaluation? A Preliminary Study==
6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018) What do we know about software security evaluation? A preliminary study Séverine Sentilles Efi Papatheocharous Federico Ciccozzi School of Innovation, Design and ICT SICS School of Innovation, Design and Engineering, Mälardalen University RISE Research Institutes of Sweden Engineering, Mälardalen University Västerås, Sweden Stockholm, Sweden Västerås, Sweden severine.sentilles@mdh.se efi.papatheocharous@ri.se federico.ciccozzi@mdh.se Abstract—In software development, software quality is nowa- – there is no consensus on what software quality exactly is, days acknowledged to be as important as software functionality how it is achieved and evaluated. and there exists an extensive body-of-knowledge on the topic. Yet, In our research we investigate to what extent properties and software quality is still marginalized in practice: there is no con- sensus on what software quality exactly is, how it is achieved and their evaluation methods are explicitly defined in the existing evaluated. This work investigates the state-of-the-art of software literature. We are interested in properties such as reliabil- quality by focusing on the description of evaluation methods for ity, efficiency, security, and maintainability. In our previous a subset of software qualities, namely those related to software work [4], we investigated which properties are most often security. The main finding of this paper is the lack of information mentioned in literature and how they are defined in a safety- regarding fundamental aspects that ought to be specified in an evaluation method description. This work follows up the authors’ critical context. We found that the most prominent properties previous work on the Property Model Ontology by carrying out of cost, performance and reliability/safety were used, albeit not a systematic investigation of the state-of-the-art on evaluation always well-defined by the authors: divergent or non existent methods for software security. Results show that only 25% of definitions of the properties were commonly observed. the papers studied provide enough information on the security The target of this work is to investigate evaluation methods evaluation methods they use in their validation processes, whereas the rest of the papers lack important information about various related to software security, which is a critical extra-functional aspects of the methods (e.g., benchmarking and comparison to property directly and significantly impacting different devel- other properties, parameters, applicability criteria, assumptions opment stages (i.e., requirements, design, implementation and and available implementations). This is a major hinder to their testing). Security is defined by McGraw [5] as “engineering further use. software so that it continues to function correctly under Index Terms—Software security, software quality evaluation, malicious attack”. The high diversity of components and systematic review, property model ontology. complexity of current systems makes it almost impossible to identify, assess and address all possible attacks and aspects of I. I NTRODUCTION security vulnerabilities. More specifically, in this paper we want to answer the Software quality measurement quantifies to what extent a following research question: “What percentage of evalua- software complies with or conforms to a specific set of desired tion method descriptions contains pertinent information to requirements or specifications. Typically, these are classified facilitate its use?”. For this, we identified a set of papers as: (i) functional requirements, pertaining to what the software representing the state-of-the-art from the literature in the field delivers, and (ii) non-functional requirements, reflecting how and we assessed the degree of explicit information about well it performs according to the specifications. While there evaluation methods. The set of papers was analyzed to identify exists a vast plethora of options for functional requirements, definition and knowledge representation issues with respect to non-functional requirements have been studied extensively security, and answer the following three sub-questions: and classified through standards and models (i.e., ISO/IEC • RQ1: What proportion of properties is explicitly defined? 9126 [1], ISO/IEC 25010 [2] and ISO/IEC 25000 [3]). They • RQ2: What proportion of evaluation methods provide are often referred to as extra-functional properties, non- explicit key information to enable their use (e.g., formula, functional properties, quality properties, quality of service, description)? product quality or simply metrics. • RQ3: What proportion of other supporting elements of Despite the classification and terminology accompanying the evaluation methods is explicit (e.g., assumptions, these properties, their evaluation and the extent of how well available implementations)? a software system performs under specific circumstances are We used a systematic mapping methodology [6], started from both hard to quantify without substantial knowledge on the 63 papers and selected 24 papers, which we then analyzed particular context, concurring measurements and measurement further. After thorough analysis, we excluded eight papers, methods. In practice, quality assessment is still marginalized resulting in a final set of 16 papers, which we used for Copyright © 2018 for this paper by its authors. 30 6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018) synthesizing and reporting our results. The final set of papers their degree of assessment coverage or how to achieve their is listed in Table V. The main contribution of this paper is evaluation. Moreover, in their work, the generic property of represented by the identification of the following aspects: security, including its definition, relation to a reference and • relation between evaluation methods and a generic prop- detailed explanation on how to achieve assessment, is not erty definition (i.e., if it exists explicitly and how it is addressed. described), Therefore, we decided to investigate thoroughly the liter- • which aspects of other supporting elements to improve ature and quantify the quality of the description of security understandability and applicability of the method (i.e, evaluation methods. Our approach differs from the work of assumptions, applicability, output etc.) are explicitly men- Morrison et al. [11] in that we are not interested to collect tioned, and metrics and their mapping to SDLC. Instead, we examine • the missing information regarding fundamental aspects evidence on properties’ explicit definitions, their evaluation which ought to be specified in a property and method methods and parameters (including their explanations on how description. they are used) for assessing security. We develop a mapping The remainder of the paper is structured as follows. In from an applicability perspective of the evaluations, which Section II we introduce the related work in the topic, while may be used by researchers and practitioners at different in Section III we describe our methodology. Section IV phases of security evaluations, i.e., at the low-level internal summarizes quantitative and qualitative interpretations of the or at the external (often called operational) phase of software extracted data, while Section V reports the main results. engineering. Section VI discusses the threats to validity and Section VII III. M ETHODOLOGY concludes the paper and delineates possible future directions. We base our work on the mapping study previously con- II. R ELATED WORK ducted by Morrison et al. in [11]. The reasoning is that the An extensive body-of-knowledge already exists on software basis of the work is well aligned and applicable to answering security and many studies have already investigated security our RQs. The details of the methodology are explained in the assessment as well. For example, the U.S. National Institute following. of Information Standards and Technology (NIST) in [7] A. Search and selection strategy developed a performance measurement guide for information security and in particular how an organization through the use The data sources used in [11] are the online databases, of metrics can identify the adequacy of controls, policies and conference proceedings and academic journals of ACM Digital procedures. The approach is mostly focused on the level of Library, IEEE Xplore and Elsevier. The search terms include technical security controls in the organization rather than the the keywords: “software”, “security”, “measure”, “metric” and technical security level of specific products. “validation”. Considering another 31 synonyms, the following In [8], a taxonomy for information security-oriented metrics search string was used: is proposed in alignment to common business goals (e.g., “(security OR vulnerability) AND (metric cost-benefit analysis, business collaboration, risk analysis, OR measure OR measurement OR indicator OR information security management and security dependability attribute OR property)”, and trust for ICT products, systems and services) covering in which the terms are successively replaced with the identified both organizational and product levels. synonyms. Verendel [9] analyzed quantitatively information security The selection criteria of Morrison et al. include the sets of from a set of articles published from 1981-2008 and concluded inclusion and exclusion items listed below. that theoretical methods are difficult to apply in practice and Inclusion criteria: there is limited experimental repeatability. A recent literature • Paper is primarily related to measuring software security survey [10] carried out on ontologies and taxonomies of in the software development process and/or its artifacts. security assessment identified among others a gap in works For example software artifacts (e.g., source code files, addressing research issues like knowledge reuse, automatic binaries), software process (e.g., requirements phase, de- processes, increasing the assessment coverage, defining secu- sign, implementation, testing), and/or software process rity standards and measuring security. artifacts (e.g., design and functional specifications). Morrison et al. [11] carried out a systematic mapping • Measurements and/or metrics are the main paper subject study on software security metrics to create a catalogue of • Refereed paper metrics, their subject of measurement, validation methods • Paper published since 2000 and mappings based on the software development life cycle (SDLC). Based on the vast catalogue of metrics and definitions Exclusion criteria: collected, a major problem is that they are extremely hard to • Related to sensors be compared, as they typically measure different aspects of • Related to identity, anonymity, privacy the property (something obvious from the emergent categories • Related to forgery and/or biometrics proposed in the paper). Thus, there is no agreement on • Related to network security (or vehicles) Copyright © 2018 for this paper by its authors. 31 6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018) • Related to encryption E. Data extraction and interpretation • Related to database security The data extraction is carried out by the researchers us- • Related to imagery, audio, or video ing independent spreadsheet columns (one for each paper). • Specific to single programming languages The data extraction for each paper is reviewed by another As explained above, search string, search strategy, inclusion researcher (i.e. reviewer) and discussions are carried out until and exclusion criteria defined by Morrison et al. are applicable an agreement is reached. The data is then interpreted and to our work. Thus, we reuse the list of selected papers from analyzed by all researchers together, as explained in the next their mapping study as the initial set for our study. However, section. as we wanted to investigate the applicability and quality of the security evaluation methods found, we extended the IV. DATA INTERPRETATION selection and data extraction methodology to isolate the subset A. Quantitative analysis of selected papers that is considered most relevant. In order to identify the subset, we performed the following actions: One interesting, albeit not very surprising fact, is that most of the descriptions contained in the papers explicitly state what • Define new exclusion criteria is supposed to be calculated (i.e., the property and the output) • Design a new selection process and how (i.e., through the overall approach description, the • Define a new data collection scheme parameters involved in the approach, and to some extent the • Extract the data advantages and disadvantages). As visible in Table II, which • Intepret data and synthesize results shows the results of the quantitative analysis, the property is B. New exclusion criteria explicitly referred to in 87,5% of the papers, the output in 75%, the approach description in 87,5%, the parameters in We decided to exclude sources according to the following 81%, the advantages in 62% and the disadvantages in 69%. criteria: No method explicitly specifies the unit to be used for the • Full-text is not available output. This might be explained by the fact that software • Sources are Ph.D. dissertations or books security is a rather recent research field and there is currently • Sources are not industrially validated or do not use well- no widely acknowledged metric and corresponding units to be known software, tool or platforms for their validations used. As a result, most methods typically fall back on standard • Sources are model predictions that do not assess any data types for which units are less relevant (e.g., percentage, property probability, item counts). Applicability criteria are rarely explicitly mentioned, only C. New selection process in 19% of the papers. When they are mentioned, it is not The selection is performed on the list of 63 selected papers obvious whether the set of provided criteria is complete or from [11] by three independent researchers by looking at at least sufficient to ensure the applicability of the evaluation the title, abstract and by going through a quick reading of method. the content. Papers independently selected by all researchers Only half of the papers clearly specify the assumptions and are included. Similarly, papers which are discarded by all hypotheses that are assumed to hold when using the method. researchers are excluded. For papers for which there is a This is a setback as these aspects are important for correctly disagreement, a discussion is held between the involved re- applying a method. searchers to unanimously decide whether to include or exclude Drivers are rarely explicitly mentioned; only in 12.5% of the papers. In one case (p18), an extended version of the paper the papers. Drivers can be important as they can implicitly was found and used instead (i.e. we used [12] instead of [13]). affect the output of a method. This selection process resulted in 24 included papers and 39 Despite being often mentioned in the papers as part of the rejected. Furthermore, 8 papers were later excluded during the solution being implemented, in practice, only a few imple- data extraction as they did not contain useful information or mentations or programs are available to directly support the were not of sufficient quality. Hence, our selection set is based method evaluation. We could find mentions of available tool on the 16 papers listed in Table V. support in only 2 out of the 16 papers (i.e., WEKA toolkit and Fortify Source Code Analyzer (SCA) version 5.10.0.0102 in D. Data collection scheme p43 and p47). A few additional papers (4) refer to available As support for answering the RQs, we created a form based tool support but no explicit link or reference is provided. on excel spreadsheets. The form consists of the questions Out of the 16 analyzed papers, only 1 (i.e., p60) performed specified in Table I, with the list of possible values for each some kind of benchmarking or comparison to similar proper- question. The questions are derived from our previous work on ties or evaluation methods. This corresponds to only 6% of all the Property Model Ontology (PMO), which formally specifies the papers. which concepts should be described for properties and their To a large extent, the information provided in the papers is evaluation methods and how these concepts relate to one insufficient to directly apply the method, especially by non- another. For details on the PMO, the reader is directed to [4]. experts (in 12 papers, 75%). Copyright © 2018 for this paper by its authors. 32 6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018) TABLE I: Data collection form Data to collect Possible Values q1 PaperId Px, with x a number q2 Method name Free text q3 Does the evaluation method provide an explicit reference to a property or a metric Yes | no q4 If yes on q3, how? Name + def. + ref. | name + def. | name + ref. | other q5 If other on q4, how? Free text q6 If yes on q3, which one? Give the example Free text q7 If yes on q3, does the reference match what is needed for the evaluation method? Yes | no | I don’t know q8 Does the evaluation method explicitly state the output of the method Yes | no q9 If yes on q8, how? With data format and unit | With data only | other q10 If other on q9, how? Free text q11 If yes on q8, which one? Give the example Free text q14 Does the evaluation method explicitly state applicability criteria? Yes | no q15 If yes, which ones? Free text q16 Does the evaluation method explicitly state how the property is evaluated? Yes | no | I don’t know q17 If yes on q16, how? Free text q18 Does the evaluation method explicitly explain the parameters involved in the evaluation? Yes | no q19 If yes on q19, which parameters (with their explanations)? Free text q25 Does the evaluation method explicitly describe additional drivers that might affect the Yes | no evaluation? q26 If yes on q25, which ones? Free text q27 Does the evaluation method explicitly state the assumptions, hypotheses which the Yes | no method is based on? q28 If yes on q27, which ones? Free text q29 Does the evaluation method explicitly mentioned advantages for the method? Yes | no q30 If yes on q30, which ones? Free text q31 Does the evaluation method explicitly mentioned disadvantages for the method Yes | no q32 If yes on q31, which ones? Free text q33 Does the evaluation method explicitly reference an implementation or a program that Yes | no can be used? q34 If yes on q33, which ones? Free text q35 If yes on q33, is the implementation or program accessible Yes | no q36 Is there any comparison to other properties (e.g., benchmark, validation) Yes | no q37 If yes on q36, how? Free text q38 Additional comments Free text q39 Is the information provided in the paper sufficient to understand the method and directly Yes | no use it? q40 Extractor’s name Séverine | Efi | Federico q41 Reviewer’s name Séverine | Efi | Federico B. Qualitative analysis From the data extracted, 34 definitions and references on security are collected from the papers, as listed in Table IV. From analyzing the answers to the questions in Table II, We categorized the collected properties in two different levels, papers can be categorized into three groups based on their depending on their level of specificity (less coarse-grained). main purpose. The first group focuses on defining a new A few papers shared some definitions (e.g., p25, p59, p63) property or metric to assess some specific software security whereas most papers defined their own security aspect to aspects (p1, p18, p27, p47, p63). The papers belonging to the evaluate. The security aspect expressed (even if in some cases second group (p15, p25, p37, p51, p60) base their work on is explained with many details) is only able to capture some already defined properties. Their main objective is to either facets of the property, thus it is hard to know if the property find ways to combine existing metrics to evaluate a given definitions together with their evaluations are enough to asses security aspect, define a new evaluation method for existing security in a rational way. properties or validate previously specified methods in applying them on a specific system. The last group aims at finding Related to the descriptions of the properties and evaluations correlation between already defined properties or performing extracted, we captured the level of applicability between them, predictions to evaluate the accuracy of previously defined as they were described in the papers: most commonly the models (p38, p43, p53, p55, p59). There are papers that do deployment and operation phase (post-release), the implemen- not belong to these groups as they are not explicitly referring tation (development or code level, including testing) and the to any property, method or metric (p22, p50). This is shown design phase as shown in Table III. The level at which security in the answers to questions on whether the paper explicitly evaluation is carried out varies among the papers. 7 of them refers to a property and, if yes, how (i.e., q3 and q4). Papers (44%) apply the evaluation method at the implementation belonging to the first group use “definition without reference”, (code-level) phase, 1 paper at the testing phase, 6 papers papers from the second group use “definition plus reference” (37,5%) at the operational phase, 1 paper during maintenance and papers from the last group use “other”. and 4 papers (25%) at the design level. Copyright © 2018 for this paper by its authors. 33 6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018) TABLE II: Quantitative results information to facilitate its use?”, only 25% of the papers that Count of Count of we investigated were judged to provide enough information Question ‘Yes’ (%) ‘No’ (%) to enable a direct application of the method. Overall, the q3: Explicit property 14 2 (87,5%) (12,5%) papers were good at explicitly stating the property under study, q8: Explicit method output 12 4 describing how the property is evaluated and which parameters (75%) (25%) are involved. On the other hand, in several cases the papers q14: Explicit applicability criteria 3 13 (19%) (81%) lacked information regarding the unit representing the value of q16: Explicit description 14 2 the property, the applicability criteria for the method as well (87,5%) (12,5%) as its assumptions, possible advantages and disadvantages, q18: Explicit method parameters 13 3 eventual openly available tool support, and comparison to (81%) (19%) q25: Explicit drivers 2 14 other properties. (12,5%) (87,5%) Security being a relatively new area in software engineering q27: Explicit assumptions 8 8 can be a major contributing factor to these results. For exam- (50%) (50%) q29: Explicit advantages 10 6 ple, advantages, disadvantages, and applicability criteria are (62,5%) (37,5%) difficult to identify initially. They require time and perspective q31: Explicit disadvantages 11 5 on the topic as well as the methods having been used over (69%) (31%) q33: Explicit tool reference 6 10 time in a wide range of applications and scenarios. Given that (37,5%) (62,5%) many of the papers which were included in our set focused on q35: Tool accessibility 2 4 defining new properties for security, it is not so surprising that (12,5%) (25%) so few papers mentioned those aspects. Furthermore, following q36: Benchmark 1 15 (6%) (94%) the same reasoning, even if those criteria were explicitly stated q39: Information is sufficient 4 12 in the papers, it is not certain that the proposed lists of (25%) (75%) advantages, disadvantages and applicability is exhaustive, or at least sufficient to guarantee the proper usage of the evaluation TABLE III: Security level applicability method that they introduce. pID Applicability level Analysis level* When comparing to the results in [11], we found fewer p1 system architecture - design - implementation internal metrics and definitions. This is due to the fact that our p15 implementation internal p18 operational external purpose was to identify the main property (or properties) of p22 operational - system architecture - design int. & ext. the paper and their corresponding evaluation methods. Some p25 operational int. & ext. of the metrics identified in [11] are classified in our results p27 operational external as parameters. Others have been ignored as they were used p37 implementation internal p38 design internal for comparison purposes and therefore not as relevant for our p43 design - implementation internal work. However, the conclusions by Morrison et al. still hold p47 operational external and are further supported by our results. Security properties are p50 operational external p51 implementation internal not mature, most of them have been proposed and evaluated p53 maintenance external solely by their authors, and few comparisons between already p59 testing external defined properties exist. “Despite the abundance of metrics p60 detailed design - implementation internal found in the literature, those available give us an incomplete, p63 implementation internal * Based on the definitions from [1] on internal and external metrics for product quality. disjointed, hazy view of software security.” These results are to be put in perspective, since none of As mentioned above, no units are explicitly specified. How- the authors of this paper is an expert in software security ever, the output of the evaluation methods falls back onto and due to the small number of papers that were studied in implicit scales of measurements: nominal (p60), ordinal (p1, this work. However, we are confident that the observations p18, p37, p47, p50, p60, p63), interval (p53) and ratio (p18, resulting from this study are good indications of the issues p25, p43, p60). occurring with property and evaluation methods descriptions Regarding advantages of the methods, the most commonly in software security. reported are: reliability of the method (p1, p15, p25), sim- A bi-product of our analysis is the following interesting plicity of the method (p1, p18, p25), accuracy of the method aspect. Especially for the works assessing software vulnera- (p18, p43) and objectivity of the results (p47, p60). The bility, evaluation methods exploit well-established prediction disadvantages mostly refer to the accuracy of the results being models that leverage a discernible set of software metrics dependent on the quality of the available data (p18, p25, p37, (related to the code itself, developers activity, versioning data, p43) and subjectivity involved in the method (p22, p37, p63). etc.). A (what seems to be) common step to the definition of security-related evaluation methods, especially when dealing V. R ESULTS AND DISCUSSIONS with vulnerability, is the comparison of existing prediction As an answer to our initial research question “What per- models, or the comparison of a newly defined prediction model centage of evaluation methods description contains pertinent with a set of existing ones, in order to identify the “best” model Copyright © 2018 for this paper by its authors. 34 6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018) TABLE IV: Collection of properties or coarse-grained metrics explicitly defined in the papers1 pID Generic/coarse-grained definition Less generic definition p1 Security to mean control over data confidentiality [...] and data integrity [...]1 Total Security Index (TSI) as the sum of the security design principle metrics p15 Vulnerability as a weakness in a software system that allows an attacker to use the system for a malicious purpose p18 End-user software vulnerability exposure as a combination of lifespans and Median Active Vulnerabilities (MAV): the median number of software vulnerabilities vulnerability announcement rates which are known to the vendor of a particular piece of software but for which no patch has been publicly released by the vendor Vulnerability Free Days (VFD): captures the probability that a given day has exactly zero active vulnerabilities p25 Vulnerability (defined in [14]) p27 Operational security as representation of as accurately as possible the Mean Effort To security Failure (METF): Mean effort for a potential attacker to reach security of the system in operation, i.e., its ability to resist possible attacks the specified target or, equivalently, the difficulty for an attacker to exploit the vulnerabilities present in the system and defeat the security objectives p37 Security through an attack surface metric Attack surface metric a measure of a systems attack surface along three dimensions by estimating the total contribution of the methods, the total contribution of the channels, and the total contribution of the data items to the systems attack surface [...]1 p38 Security as through the number of violations of the least privilege principle LP principle metric (defined in [15]) and surface metrics Attack surface metric (defined in [16]) Coupling between components (CBM) (defined in [17]) Maintainability as coupling between components and components instability Components instability (CI) (defined in [18]) p43 Vulnerability through dependency graphs (*) p47 Vulnerability through static analysis (*) Static-analysis vulnerability density (SAVD): number of vulnerabilities a static-analysis tool finds per thousand LOC Density of post-release reported vulnerability (NVD) (*) Application vulnerability rank (SAVI) (*) p50 Vulnerability as flaws or weakness in a systems design, implementation, or operation and management that could be exploited to violate the systems security policy. Any flaw or weakness in an information system could be exploited to gain unauthorized access to, damage or compromise the information system. p51 Application security (defined in [5]) Stall Ratio (SR): a measure of how much a programs progress is impeded by frivolous Software security (defined in [5]) activities. Security at program level (defined in [19]) Coupling Corruption Propagation (CCP): Number of child methods invoked with the parameter(s) based on the parameter(s) of the original invocation Critical Element Ratio (CER) (*) p53 Security through vulnerability density (*) Static analysis vulnerability density (SAVD) (see p47) Security Resources Indicator (SRI): the sum of four indicator items, ranging from 0 to 4. The items are: the documentation of the security implications of configuring and installing the application, a dedicated e-mail alias to report security problems, a list or database of security vulnerabilities specific to the application, and the documentation of secure development practices, such as coding standards or techniques to avoid common secure programming errors. p59 Vulnerability (defined in [14]) p60 Software vulnerability (defined in [19]) Structural severity: uses software attributes to evaluate the risk of an attacker reaching a vulnerability location from attack surface entry points [...]1 Attack Surface Entry Points (defined in [20]) Reachability Analysis (*) p63 Software vulnerability (defined in [14]) Vulnerability-Contributing Commits (VCCs): original coding mistakes or commits in the version control repository that contributed to the introduction of a post-release vulnerability 1 The definitions have been shorten to comply with the page limitation. Readers are referred to the original paper for the complete definition. (*) No explicit definition is found. to use as basis for evaluation purposes. Another interesting formulated based on these clarified notions and the research aspect is represented by the fact that, in several papers, authors followed a methodological procedure known as systematic exercise their reasoning and methods on well-known code- mapping. bases (e.g., Windows Vista, Mozilla); this shows an inter- The selection of papers was based on the work of Morrison esting strong inclination towards assessing the applicability et al. [11], thus we inherit the work’s limitations. Therefore, of research (theoretical) results to practical cases, which is papers not listed in the sources used, i.e., ACM, IEEE and too seldom seen in other research branches within software Elsevier, have been missed. In addition, we excluded Ph.D. engineering. dissertations and books since we limited our selection to peer- reviewed conference and scholar journal publications only. VI. T HREATS TO VALIDITY The way the involved researchers individually interpreted Construct validity relates to what extent the phenomenon the methods described in the papers reflects their own biases under study represents what the researchers wanted to in- and views. However, we worked hard to reduce the bias vestigate and what is specified by the research questions. by discussing the content of the papers in pairs to dissolve We explicitly defined the context of the work and discussed uncertainties. In several cases, where the decision to include or related terms and concepts. Also, the research questions were not a paper, a third research was involved to reach a consensus. Copyright © 2018 for this paper by its authors. 35 6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018) Morrison at al. [11] excluded sources dealing with networks, under grant agreement No 737422. This Joint Undertaking sensors, database and vehicle security and this limited the receives support from the European Unions Horizon 2020 re- set of papers analyzed in our study. This opens however up search and innovation programme and Austria, Spain, Finland, for opportunities of further research targeting these particular Ireland, Sweden, Germany, Poland, Portugal, Netherlands, types of systems. Belgium, Norway. External validity is about to what extent the findings are generalizable. Due to the focus on the security aspect and the small selection of papers, one should avoid generalizing R EFERENCES the results over other properties such as reliability and safety. However, despite the limitations of our study (i.e., low number [1] ISO/IEC, ISO/IEC 9126. Software engineering – Product quality. ISO/IEC, 2001. of papers, one method per paper, no snowballing to find com- [2] ——, Systems and software engineering-Systems and software Quality plementary information), our conclusions are representative Requirements and Evaluation (SQuaRE) - System and software quality of the current issues in software security. Furthermore, other models, 2011. works in the state-of-the-art on software quality share similar [3] ——, “ISO/IEC 25000 software and system engineering–software prod- uct quality requirements and evaluation (SQuaRE)–guide to SQuaRE,” conclusions as Morrison et. al. in [11], which points towards International Organization for Standardization, 2005. the applicability of our results to other properties. [4] S. Sentilles, E. Papatheocharous, F. Ciccozzi, and K. Petersen, “A property model ontology,” in Software Engineering and Advanced Ap- VII. C ONCLUSIONS plications (SEAA), 2016 42th Euromicro Conference on. IEEE, 2016, pp. 165–172. Despite the low number of papers this work is based on, [5] G. McGraw, Software security: building security in. Addison-Wesley we believe it is, to some extent, representative of the current Professional, 2006, vol. 1. issues in the state-of-the-art on software quality in general and [6] K. Petersen, R. Feldt, S. Mujtaba, and M. Mattsson, “Systematic mapping studies in software engineering.” in EASE, vol. 8, 2008, pp. security in particular: a number of useful properties and meth- 68–77. ods to evaluate them do exist today. However, it is difficult to [7] E. Chew, M. M. Swanson, K. M. Stine, N. Bartol, A. Brown, and know about them and understand whether they are applicable W. Robinson, “Performance measurement guide for information secu- rity,” Tech. Rep., 2008. in a given context and if they are, how to use them. This can be [8] R. Savola, “Towards a security metrics taxonomy for the information and attributed to the lack of information in the descriptions of their communication technology industry,” in Software Engineering Advances, evaluation methods. This hampers the activities towards better 2007. ICSEA 2007. International Conference on. IEEE, 2007, pp. 60– quality assurance in software engineering and it limits at the 60. [9] V. Verendel, “Quantified security is a weak hypothesis: a critical survey same time the application of these activities or newly specified of results and assumptions,” in Proceedings of the 2009 workshop on methods in industrial settings. For example, knowing when to New security paradigms workshop. ACM, 2009, pp. 37–50. apply a method (e.g., applicability at design time, at run-time, [10] F. d. F. Rosa, R. Bonacin, and M. Jino, “The security assessment domain: A survey of taxonomies and ontologies,” arXiv preprint etc.) restricts which methods can be used in a given context. arXiv:1706.09772, 2017. Knowing the advantages and disadvantages allows to trade- [11] P. Morrison, D. Moye, and L. A. Williams, “Mapping the field of off available methods and limits selection bias. Older, more software security metrics,” North Carolina State University. Dept. of established or traditional, software quality fields provide more Computer Science, Tech. Rep., 2014. [12] J. L. Wright, M. McQueen, and L. Wellman, “Analyses of two end-user reference properties and methods to systematically compare to; software vulnerability exposure metrics (extended version),” Information however the level of detail and quality of given information is Security Technical Report, vol. 17, no. 4, pp. 173–184, 2013. still relatively low. [13] ——, “Analyses of two end-user software vulnerability exposure met- rics,” in 2012 Seventh International Conference on Availability, Relia- As future work, we plan to expand the selection of papers bility and Security. IEEE, 2012, pp. 1–10. to include those from the references in the analyzed papers so [14] I. V. Krsul, Software vulnerability analysis. Purdue University West to investigate if our conclusions still hold. Similarly, we will Lafayette, IN, 1998. explore the quality of assessments of other quality properties [15] K. Buyens, B. De Win, and W. Joosen, “Identifying and resolving least privilege violations in software architectures,” in Availability, Reliability in literature such as reliability, safety and maintainability. The and Security, 2009. ARES’09. International Conference on. IEEE, 2009, results of these studies will be included in PROMOpedia [21], pp. 232–239. an online encyclopedia of software properties and their eval- [16] P. K. Manadhata, D. K. Kaynar, and J. M. Wing, “A formal model for a system’s attack surface,” CARNEGIE-MELLON UNIV PITTSBURGH uation methods. Lastly, we plan to propose an improved and PA SCHOOL OF COMPUTER SCIENCE, Tech. Rep., 2007. validated ontology to express several critical and time sensitive [17] M. Lindvall, R. T. Tvedt, and P. Costa, “An empirically-based process properties towards a more systematic (in terms of consistent) for software architecture evaluation,” Empirical Software Engineering, vol. 8, no. 1, pp. 83–108, 2003. and formal (in terms of codified) way and approach a better [18] R. C. Martin, Agile software development: principles, patterns, and trade-off support between the properties. practices. Prentice Hall, 2002. Part of the work is also supported by the Electronic Com- [19] C. P. Pfleeger and S. L. Pfleeger, Security in computing. Prentice Hall Professional Technical Reference, 2002. ponent Systems for European Leadership Joint Undertaking [20] P. K. Manadhata and J. M. Wing, “An attack surface metric,” IEEE ACKNOWLEDGMENTS Transactions on Software Engineering, no. 3, pp. 371–386, 2010. [21] S. Sentilles, F. Ciccozzi, and E. Papatheocharous, “PROMOpedia: a web- The work is supported by a research grant for the ORION content management-based encyclopedia of software property models,” project (reference number 20140218) from The Knowledge in Proceedings of the 40th International Conference on Software Engi- Foundation in Sweden. neering: Companion Proceeedings. ACM, 2018, pp. 45–48. Copyright © 2018 for this paper by its authors. 36 6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018) TABLE V: List of selected papers with their ID pID Paper reference p1 B. Alshammari, C. Fidge, and D. Corney, “A hierarchical security assessment model for object-oriented programs,” in Quality Soft- ware(QSIC), 2011 11th International Conference on. IEEE, 2011, pp. 218227. p15 Y. Shin and L. Williams, “An empirical model to predict security vulnerabilities using code complexity metrics,” inProceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement. ACM, 2008, pp. 315317. p18 J. L. Wright, M. McQueen, and L. Wellman, “Analyses of two end-user software vulnerability exposure metrics (extended version),” Information Security Technical Report, vol. 17, no. 4, pp. 173184, 2013 p22 M. Almorsy, J. Grundy, and A. S. Ibrahim, “Automated software architecture security risk analysis using formalized signatures,” in Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 2013, pp. 662671. p25 Shin, A. Meneely, L. Williams, and J. A. Osborne, “Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities,” IEEE Transactions on Software Engineering,vol. 37, no. 6, pp. 772787, 2011. p27 R. Ortalo, Y. Deswarte, and M. Kaaniche, “Experimenting with quantitative evaluation tools for monitoring operational security,” IEEE Transactions on Software Engineering, no. 5, pp. 633650, 1999. p37 P. Manadhata, J. Wing, M. Flynn, and M. McQueen, “Measuring the attack surfaces of two FTP daemons,” in Proceedings of the 2nd ACMworkshop on Quality of protection. ACM, 2006, pp. 310. p38 K. Buyens, R. Scandariato, and W. Joosen, “Measuring the interplay of security principles in software architectures,” in Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement. IEEE Computer Society, 2009, pp. 554563. p43 V. H. Nguyen and L. M. S. Tran, “Predicting vulnerable software components with dependency graphs,” in Proceedings of the 6th International Workshop on Security Measurements and Metrics. ACM, 2010, p. 3. p47 J. Walden and M. Doyle, “Savi: Static-analysis vulnerability indicator,” IEEE Security & Privacy, no. 1, 2012. p50 J. A. Wang, H. Wang, M. Guo, and M. Xia, “Security metrics for software systems,” in Proceedings of the 47th Annual Southeast Regional Conference. ACM, 2009, p. 47. p51 I. Chowdhury, B. Chan, and M. Zulkernine, “Security metrics for source code structures,” in Proceedings of the fourth international workshop on Software engineering for secure systems. ACM, 2008, pp. 5764. p53 J. Walden, M. Doyle, G. A. Welch, and M. Whelan, “Security of open source web applications,” in Empirical Software Engineering and Measurement, 2009. ESEM 2009. 3rd International Symposium on.IEEE, 2009, pp. 545553 p59 M. Gegick, P. Rotella, and L. Williams, “Toward non-security failures as a predictor of security faults and failures,” in International Symposium on Engineering Secure Software and Systems. Springer, 2009, pp. 135149. p60 A. A. Younis, Y. K. Malaiya, and I. Ray, “Using attack surface entry points and reachability analysis to assess the risk of software vulnerability exploitability,” in High-Assurance Systems Engineering (HASE),2014 IEEE 15th International Symposium on. IEEE, 2014, pp. 18 p63 A. Meneely, H. Srinivasan, A. Musa, A. R. Tejeda, M. Mokary, and B. Spates, “When a patch goes bad: Exploring the properties of vulnerability-contributing commits,” in Empirical Software Engineering and Measurement, 2013 ACM/IEEE International Symposium on. IEEE,2013, pp. 6574. Copyright © 2018 for this paper by its authors. 37