Quality Characteristics of Artificially Intelligent Systems Adam Leon Smith Raphaël Clifford Dragonfly Email: raphael@clifford.net Email: adam@wearedragonfly.co Abstract—This paper explores the quality characteristics of adaptability, transparency, societal and ethical risk mitigation systems using artificial intelligence components, referencing ex- as non-functional characteristics. This paper gives examples of isting work in this area from research, the private sector measures that can be used, but it not intended to be exhaustive and the international standards community. It proposes a new quality model that considers existing work, and where applicable, on this matter. proposes measures that may be appropriate for each quality characteristic. II. E XISTING WORK Index Terms—artificial intelligence, machine learning, quality, A. SQuaRE systems engineering, quality management, testing ISO/IEC 25010 [5] is a popular standard in software and I. I NTRODUCTION systems quality management defining system and software quality models, alongside ISO/IEC 25012 [6] that defines a Artificial Intelligence (AI) can be loosely defined as applying data quality model. ISO/IEC 25010 has been identified by prac- acquired knowledge to make decisions, in contrast to using titioners [7] as requiring modification for AI systems. ISO/IEC explicit logic. This presents both opportunities and problems have also commenced a project to create a standardised model [1] for the field of software and systems quality. One of [8], that as an International Standard, consistent with ISO/IEC the most challenging aspects is the the probabilistic nature 25010, will drive efficiency in industry. This is expected to be of sub-symbolic systems, and difficulties in reproducing and published in 2023. explaining results. Another challenge is convincing everyone that AI systems can be trusted with important decisions, and B. DIN Spec 92001-1 some researchers propose that what AI systems actually need The DIN SPEC 92001-1 [9] is a freely available standard pub- to achieve is reliability [2], that is related to quality. lished in April 2019 by the German standardisation body (DIN). Artificial intelligence can include symbolic rule-based expert It aims to provide an outline of AI lifecycle process and quality knowledge systems and sub-symbolic (statistical) machine requirements. It outlines three quality pillars: functionality learning systems. Machine learning is the most common AI and performance, robustness and comprehensibility. This paper method, and it is difficult to specify quality, and analyse how refers to functionality and performance as functional correctness to test. Research in Japan involving 278 machine learning and completeness in order to stay consistent with existing ISO engineers identified the biggest new challenges they face standards [5]. Similarly, this paper refers to comprehensibility integrating machine learning is in decision making with as transparency. Nevertheless, the scope of the three quality customers and testing/quality assurance. Further, they identify pillars covered in the DIN SPEC are also covered in this the lack of a test oracle1 , and imperfection as the top causes proposed model. of this [3]. Should a practitioner wish to define a strategy or approach C. ISO/IEC Standards on Artificial Intelligence to holistically evaluate the quality of an AI system, at present, ISO/IEC are working on a large number of new standards it requires review of many scholarly articles in order to relating to AI, including those that relate to quality, verification identify the relevant properties. It is therefore important that a and validation. The first relevant standard to the quality topic standard quality model for AI systems is developed to support is a technical report giving an overview of trustworthiness in practitioners. AI [10]. This identifies vulnerabilities in AI systems including This paper reviews existing quality models in the context of security and privacy threats, bias, unpredictability, opaqueness, AI, that is acquiring knowledge, applying the knowledge and and more. Some of these vulnerabilities map directly on the producing decisions. Robustness and context completeness are requirements for a quality model for AI. introduced as characteristics that relate to the input domain; bias, functional correctness, and ex-post explainability (run III. P ROPOSED QUALITY MODEL transparency) as relating to the output decision domain; and A. Model Types 1 In software testing, a test oracle is a source to determine an expected result Quality cannot be quantified as a single metric. It to compare with the actual result of the system under test [4] requires the definition of characteristics and terminology Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) that can be used when specifying quality requirements, manifests as cognitive biases that exist on the development and evaluating them. ISO/IEC 25010 defines two models, team, and societal biases that exist in historical datasets. a product quality model and quality-in-use model. The Overall, bias in an AI system is a property of the system former is the characteristics of a static system, and the latter that results in different treatment for different people, objects are characteristics of a system with a specified context of or groups. In this context, it is an accuracy issue that exists use. To draw an analogy, AI product quality (PQ) may be in relation to the functional correctness and completeness of determined once at the point of release, AI quality in use a system. Bias can be measured using MAE or α and β as (QiU) can only be determined in each actual context of use. In described above, but in a way that filters out results for a traditional software systems, few characteristics overlap. Table particular cohort of transactions that belong to a specific group. 1, below, lists the quality characteristics discussed in this paper. In this way the results can be compared between the general population to identify bias. Another approach is to use statistical Quality model in Characteristic Discussed sub- parity [12]. This uses S as a variable that identifies the cohort ISO/IEC 25010 characteristics under analysis, and S = 1 indicates membership of the relevant Product quality Functional Functional cohort: suitability correctness P (Y = 1|S = 0) DI = Product quality Functional Bias P (Y = 1|S = 1) suitability Other metrics are required when assessing ranked outputs [13] Product quality Portability Adaptability or continuous variables, but the principle of comparing the Product quality Security Robustness to adver- group under analysis to the general population remains. sarial examples Bias belongs to both the product quality model (product bias), Product quality Usability Run transparency and the quality in use model (bias in use). This is because it Product quality Usability Controllability is the property of a single system and the data inputs used in Product quality Maintainability Functional the production of that system, but it is also a property of the transparency system in actual use, where the inputs may be very different. Quality in use Context cover- Context age completeness C. Adaptability Quality in use Freedom from Societal and ethical Adaptability is defined as a product quality characteristic in risk risk mitigation ISO/IEC 25010: Table 1 - New quality sub-characteristics to be discussed degree to which a product or system can ef- fectively and efficiently be adapted for different or B. Functional Suitability evolving hardware, software or other operational or 1) Functional Correctness: Whilst there are significant usage environments challenges relating to the verification of functional correctness and is part of the portability characteristic - which refers to and completeness, how to statistically analyse the results of the environment. This paper proposes that the definition of classification and regression systems common in AI, is a mature adaptability is extended. It is much more the case with AI topic. Measurement of type I (α) and type II (β) errors is one systems that the data observed by the system can now be part common approach [11] to presenting classifier results. For of the environment, in real-time with reinforcement learning, example, where H0 represents all negative predictions and H1 or as models are “retrained” with new datasets. This is starkly represents all type II errors (false negative predictions), the different to making a change to existing logic, as the model type II rate can be calculated as: is completely re-baselined rather than incrementally changed, β=H 1 and the change may be interactive, dynamic, periodic or even H0 in real-time. Regression problems also have various metrics, of which Adaptability could be defined as the time taken for a system one of the most common is Mean Absolute Error (MAE), this to start to react differently based on a change in observed data, gives no indication about the directionality of performance, or the ease with which it can be retrained. simply the scale. Where Yi represents predicted values, and Xi represents the ground truth, the error can be expressed as: D. Controllability The degree to which a system can be controlled is not M AE = n1 P |Xi − Yi | a new concept [34], and is typically a functional aspect of 2) Bias: Bias is a term that is frequently used differently a system, however increasingly systems are able to operate by different stakeholders. It is common for data scientists to without human intervention or control. Therefore, if human think of bias as a statistical property which can be positive or interaction becomes optional or impossible, it is important to negative in any given context, and ethicists to think of bias an consider how controllable an AI system is for its presumptive unfair outcome. In the context of an overall AI system bias human operator. Controllability can be considered to be the is both, it is a property of data and an algorithm. Bias also ability to move a system from an arbitrary initial state, into another desired state, by exercising a control function, within adversarial attacks and this remains a promising research a certain number of steps and within the required time. avenue [19], [20]. The importance of explainability and comprehensibility is set out below. E. Robustness and adversarial examples F. Transparency The environment in which AI must operate may be subject to change through natural perturbations and volatility, drift in the The DIN quality pillars introduce the term comprehensibility, characteristics of input data over time or malicious intention of which measures the degree to which a stakeholder with defined an adversary. The term AI Robustness attempts to capture the needs can comprehend the reasons for an AI component’s extent to which an AI system will safely maintain its intended outputs. This is synonymous with explainability. There are behaviour in such scenarios. This is distinct from context wider concerns than explainability relating to transparency. In completeness which does not focus on unanticipated changes order to a system to be transparent it is necessary to understand in input distributions. Robustness is however to some extent the provenance and quality of input data, including labels, captured under the catch-all term context coverage. Ensuring annotations and rules. robustness poses some of the most difficult and important Kuwajima & Ishikawa [7] when considering quality models questions in AI and raises a number of issues which we will for AI systems, identify transparency as comprising traceability, introduce below. explainability and communication. This again has a focus on 1) Distributional Change: Perhaps the most common chal- explainability, but includes the ability to communicate the lenging issue in AI is how to maintain the desired behaviour information. Creel [21] identifies transparency in three different of a system when the input distribution changes over time. If forms: the test data has the same statistical properties as the training • Functional transparency. Knowledge of the whole opera- set then we can expect a well specified AI system to work tion of the system. correctly. However, when encountering new and unexpected • Structural transparency. Knowledge of the implementation. input the situation can be much more difficult. As an example, • Run transparency. Knowledge of how the system was run in “Concrete Problems in AI Safety” [16] an AI robot cleaner in a particular instance. is posited which was trained to clean an empty office. On Functional transparency implies that a human is able to encountering a pet dog for the first time it might attempt to wash understand the system sufficiently well to accurately predict it giving unpredictable results. In a more serious settings such how an algorithm will behave given particular inputs. This is as when trading on the stock market or in military applications clearly possible with traditional systems, but with more complex these consequences could be disastrous. In general, when algorithms comprising multiple layers of neural networks and the testing distribution differs from the training distribution other AI components, it can become nearly impossible. AI systems might not only perform in unexpected ways but Structural transparency becomes more important the less they may also report that they have been functioning without it is possible to gain functional transparency. This may problems. This therefore makes the diagnosis of faults in the be understanding the implementation in code, or could be AI system problematic. documentation of the provenance of training data, and statistical 2) Adversarial Inputs: Attempts to fool AI systems date analysis done on that data to reduce concerns about accuracy back at least 15 years to the early days of spam filters. Those and bias. wanting to send bulk unsolicited email started to find ways Run transparency is the same as post-hoc explainability. to avoid the linear classifiers used to filter them out. Since Explainability can be ex-ante (before the system has made a the resurgence of deep neural networks the importance of decision) or ex-post (after a system has a made a decision). adversarial techniques has become of increasing interest and Ex-ante techniques include exploring a dataset to understand importance. It is now well known that computer vision systems and analyse it’s various characteristics. This can lead to the can be fooled to make wildly inaccurate classifications if given identification of class imbalances that heavily influence the a suitably perturbed image [17]. This failure of AI is in fact systems behaviour [22]. In this context explainability is not caused by an unanticipated distributional change in the input a characteristic of the system at all, but a process that is that was not captured in the training set. What sets this apart undertaken. Similarly, various mathematical techniques for is that this difference has been carefully crafted to make the explainability during the modelling process can be conducted AI system give an incorrect response by a malicious adversary. ex-ante, but these merely provide insight into the nature of the However these challenges are not just limited to AI based system for it’s creators. computer vision systems. Every year more and more classes of DeepAI [23] draw a distinction between directly interpretable inputs, including malware detection [18] and natural language system that is intrinsically understood by users, and a system texts [19] are being shown to be susceptible to adversarial that would require an ex-post explanation to understood a attacks. specific prediction. It draws a further distinction between a 3) Maintaining Explainability: If the input distribution is static explanation, and an interactive one that allows users to very different from the training data, the AI system will make drill-down and interrogate. decisions which may be unexpected or undesired. Preliminary Explainability as a quality characteristic of a system applies work now exists to try to use AI explainability to counter only to ex-post explainability. This in itself can take the form of either internal or external users obtaining an explanation, The degree to which a system is flexible given an unseen and communicating it if necessary. Given that, regardless of context requires decomposition of the context into a set of input the method used to obtain an explanation it can be measured values. For example, to envisage an automated system which in terms of the availability of an explanation (µ), the accuracy links location sensors on a user device with climate control in of the explanation (α), and the time in which an operator is the user’s residence, the inputs could be decomposed as follows: able to obtain and/or communicate the explanation (T): Input Data Structure explainability = f (µ, α, T ) S. Climate Control Status boolean Run transparency is a product quality characteristic that R. Room Temperature float affects the usability of an AI system, and functional and Ta . User Minutes Until Arrival float structural transparency most affect the maintainability. Tb . Minutes To Reduce Temp float D. User Heading To Room boolean G. Context coverage, context completeness Dt . User Heading To Room: Duration float The operational environments of traditional software systems We will assume the system has been trained on historical are typically limited in range and context. The need to define data which exhibit a certain data profile. In this hypothetical and quantify the impact of the context is recognized in the example, one can envisage multiple prediction algorithms ISO/IEC 25010 [5] QiU characteristic of context completeness: being used to arrive at a boolean decision that determines [. . . ] degree to which a product can be used whether the system should start cooling the room. Although by specified users to achieve specified goals with consistent stability of the procedure in the context of extensive effectiveness, efficiency, freedom from risk and historical data provides some limited quality assurance, QiU satisfaction in all the intended contexts of use, or by context coverage is not guaranteed. the presence of product properties that support use Although the use case is simple, it is clear that a number in all the intended contexts of use. of the inputs are continuous variables, and as such could Dynamic AI systems are expected to maintain performance have arbitrarily high or low values. It is not possible to when faced with previously unseen data. This objective is measure context completeness within an infinite input domain captured by the ISO/IEC 25010 [5] QiU characteristic of context for an arbitrary AI predictor, even when a reasonable range coverage: of intended use is defined, for example with the temperature [. . . ] degree to which a product or system can be values restricted to the range (−100, +100), the boolean values used with effectiveness, efficiency, freedom from risk 0 to 1, and the arbitrary floats with predefined limits. and satisfaction in both specified contexts of use and Given context completeness is the degree the system main- in contexts beyond those initially explicitly identified. tains performance in an unseen context, it can be measured as [emphasis added] the maximum root mean square error (RMSE) for a regression problem (or α and β for a classification problem, see Functional Context coverage can be expressed simply as the ability to Correctness), that can be obtained by varying any input. maintain stability and effectiveness with previously unseen Continuing to use Xi and Yi to represent ground truth datasets or inputs. Stability can be bought at the cost of and predicted values respectively, and with n representing statistical bias; by the use of relatively inflexible AI models. the number of observations, RMSE can be defined as follows: Such procedures will lead to under-fitting, where predictions rP are stable but not accurate enough. By contrast, complex AI (Xi − Yi )2 models with high-dimensional parameter sets and a low level of RM SE = n statistical bias may be affected by minor features of the training data, which can lead to “over-fitting” and high variance. For H. Societal and Ethical Risk Mitigation supervised learning, a model that is trained to give very accurate AI systems usually intend to treat everyone differently, that is results for one dataset will achieve less accurate predictions part of their purpose and one reason there is a risk of disparate with new input. There is a direct trade-off between bias and impact [24]. Issues relating to bias in AI systems sometimes variance [14]. result from propagating existing unfairness in society, known To evaluate a traditional system’s context coverage, it would as a societal bias or historical bias [25], or can relate to unfair be reasonable to use boundary value analysis or equivalence outcomes resulting exclusively from system implementation. partitioning [15], perturb data inputs, and measure performance. There are metrics that are used to quantify unfairness [26] but However this is based on the assumption that traditional the actual nature of unfairness that is to be avoided is derived programming works using explicit logic and will respond from the legal, ethical and cultural context of use. equivalently or proportionally to different inputs. This is not Freedom from risk metrics [5] assess the degree that the the case with many AI systems that use statistical techniques quality of the AI system mitigates potential risk to users, to arrive at particular outputs. Deep learning in particular has a organisations and the wider community. Existing quality models very large internal feature space which bears little resemblance focus on health, safety and the environment, however for AI to the input domain. systems it is necessary to consider the wider risks to the rights and freedoms of members of society. The metrics for health by delivering and verifying requirements, and in theory, and safety, for instance, tend to relate to reported injuries or stakeholders then trust a system because it is objectively hazards. In the context of disparate impact [24] such reporting trustworthy. Given the broad scope of the verification, it is very may not be forthcoming without public investigations such as likely that new techniques, business models and certification those by Pro Publica [27]. bodies will spring up in this area. Fairness is a concept that varies by culture and context, However, it is not clear that trustworthiness is an independent however in the context of AI and automated decision-making, quality characteristic in its own right, rather it appears to be a the most prevalent example is that an attribute of person or superset of a particular set of measurable quality characteristics. object unfairly influences the output. Whilst there are many Garbuk [33] proposed that a functional characteristics vector possible metrics for fairness [28], the most generic and flexible could be composed of quality measurements, with appropriate way to measure it is through counterfactual fairness [29], as it weightings, and that this could be compared to standards for supports consideration of inputs that are unrelated to fairness particular AI tasks. These standards would contain measurement in data used for training, however it is unclear how this could methods, minimum quantity of data involved in evaluation, and be implemented on an existing system, or independent of the minimum observed quality characteristics permitted for a explainability methods. Given an existing system, there is specific task. no difference between the metrics used to measure fairness V. P ROPOSED C HARACTERISTICS NOT INCLUDED other than those outlined in the above discussion on bias. Nevertheless, it remains a recommended characteristic for A. Privacy a quality model, because the groups that are measured and Privacy issues are far from unique to AI systems, they acceptable may be different in the context of fairness. can relate to any system that processes personal data. There are numerous pieces of regulation that specifically cover IV. T RUSTWORTHINESS algorithmic decision making [35], which is far more common The definitions of quality and trustworthiness can be and complex in the context of AI systems, in comparison to considered different but related, however there are different traditional systems. GDPR [36] is the most commonly cited2 , as definitions of trustworthiness. The EU HLEG on AI defines it provides a right to request a human makes a decision, where [30] trustworthy AI as a system has made a decision that could have a substantive lawful (respecting all applicable laws and reg- affect on a data subject. ulations), ethical (respecting ethical principles and Beyond the relevance of explainability, there is no obvious values) and robust (both from a technical and social unique and novel quality characteristics of privacy relating to perspective). AI systems. Competing views on this are emerging from the international B. Collaborability standards community, the first is that trustworthiness in artificial Some research [7] has suggested that collaborability should intelligence is the quality of being dependable and reliable be included as an extension to the usability of an AI system. [10]. We can contrast that definition to quality, which is However, metrics are not proposed, and there is limited other defined by ISO/IEC as conformance to specified requirements literature that covers this topic. [5]. A difference between these definitions of trustworthiness and quality, is the need for requirements to be specified by VI. C ONCLUSION stakeholders, and the verifiability of them. Taking into account In this paper we have explored numerous aspects of quality the EU view, there is a requirement for systems to deliver for AI systems, their measurements, and their relationship with against unstated legal, ethical and social requirements as well trustworthiness. This work is not exhaustive, due to the volume as technical ones. A second definition is under development and diversity of use cases that AI is being applied to. It is within ISO/IEC [31] that defines ICT trustworthiness as a notable that most of the measures proposed, with the exception demonstrable likelihood that the system performs of controllability and run transparency, are statistical metrics according to designed behavior under a typical set of intended to operate across a group of outputs. This speaks conditions as evidenced by its characteristics, such to the statistical nature of sub-symbolic AI systems. Whilst as safety, security, privacy, reliability and resilience. most of the measures are not new, they are typically used by This definition is based on the NIST Framework for Cyber- system developers and data scientists during the production Physical Systems [32], and notably includes the work designed, of a system. These candidate metrics can also be used to which implies the specification of requirements. evaluate the holistic quality of deployed systems, for which Verification is the process of confirmation, through the 2 It is often said that GDPR [36] provides a right to an explanation of how provision of objective evidence, that specified requirements an algorithm reached a particular decision. This is not the case. Whilst it is have been fulfilled. So system owners can make a system discussed in the recitals, it is not present in the articles of the regulation as it trustworthy by specifying verifiable requirements, including was removed during the legislative process [37]. Nevertheless, explainable AI is a significant focus for industry as it allows for algorithms that are not well consideration for legal, ethical and social issues. Engineers, be understood to be analysed in order to find metamorphic relationships between they developers or testers, can make a system trustworthy groups of system inputs and outputs. the evaluators may or may not have access to the logic, design, training data or parameters associated with a particular system. R EFERENCES [1] Y. Zhuang, F. Wu, C. Chen, and Y. Pan, ‘Challenges and opportunities: [21] K. A. Creel, Transparency in Complex Computational Systems, Philoso- from big data to knowledge in AI 2.0’, Frontiers Inf Technol Electronic phy of Science, p. 709729, Apr. 2020, doi: 10.1086/709729. Eng, vol. 18, no. 1, pp. 3–14, Jan. 2017, doi: 10.1631/FITEE.1601883. [22] B. Khaleghi, The How of Explainable AI: Pre-modelling Explainability, [2] M. Ryan, ‘In AI We Trust: Ethics, Artificial Intelligence, and Reliability’, Medium, Aug. 15, 2019. https://towardsdatascience.com/the-how-of- Sci Eng Ethics, Jun. 2020, doi: 10.1007/s11948-020-00228-y. explainable-ai-pre-modelling-explainability-699150495fe4 (accessed Aug. [3] F. Ishikawa and N. Yoshioka, ‘How Do Engineers Perceive Difficulties 20, 2020). in Engineering of Machine-Learning Systems? - Questionnaire Survey’, [23] V. Arya et al., ‘One Explanation Does Not Fit All: A Toolkit in 2019 IEEE/ACM Joint 7th International Workshop on Conducting and Taxonomy of AI Explainability Techniques’, arXiv:1909.03012 Empirical Studies in Industry (CESI) and 6th International Workshop [cs, stat], Sep. 2019, Accessed: Aug. 20, 2020. [Online]. Available: on Software Engineering Research and Industrial Practice (SER&IP), http://arxiv.org/abs/1909.03012. Montreal, QC, Canada, May 2019, doi: 10.1109/CESSER-IP.2019.00009. [24] S. Barocas and A. D. Selbst, Big Data’s Disparate Impact, SSRN Journal, [4] ‘ISTQB Glossary’. https://glossary.istqb.org/en/search/oracle (accessed 2016, doi: 10.2139/ssrn.2477899. Sep. 23, 2020). [25] H. Suresh and J. V. Guttag, A Framework for Understanding Un- [5] ISO 25010. https://iso25000.com/index.php/en/iso-25000-standards/iso- intended Consequences of Machine Learning, arXiv:1901.10002 [cs, 25010 (accessed Aug. 20, 2020). stat], Feb. 2020, Accessed: Feb. 22, 2020. [Online]. Available: [6] “ISO - ISO/IEC 25012:2008 - Software engineering — Software product http://arxiv.org/abs/1901.10002. Quality Requirements and Evaluation (SQuaRE) — Data quality model.” [26] S. Verma and J. Rubin, ‘Fairness definitions explained’, in Proceedings https://www.iso.org/standard/35736.html (accessed Nov. 02, 2020). of the International Workshop on Software Fairness - FairWare ’18, [7] H. Kuwajima and F. Ishikawa, Adapting SQuaRE for Quality Assessment Gothenburg, Sweden, 2018, pp. 1–7, doi: 10.1145/3194770.3194776. of Artificial Intelligence Systems, p. 6 [27] ‘Machine Bias — ProPublica’. https://www.propublica.org/article/machine- [8] IEC blog - IEC and ISO joint committee on AI adds focus on related bias-risk-assessments-in-criminal-sentencing (accessed Sep. 23, 2020). data ecosystem. https://blog.iec.ch/2020/05/iec-and-iso-joint-committee- [28] S. Verma and J. Rubin, ‘Fairness definitions explained’, in Proceedings on-ai-adds-focus-on-related-data-ecosystem/ (accessed Aug. 20, 2020). of the International Workshop on Software Fairness - FairWare ’18, [9] DIN SPEC 92001-1 - 2019-04 - Beuth.de. Gothenburg, Sweden, 2018, pp. 1–7, doi: 10.1145/3194770.3194776. https://www.beuth.de/en/technical-rule/din-spec-92001-1/303650673 [29] Matt J. Kusner, Joshua R. Loftus, Chris Russell, and Ricardo Silva. 2017. (accessed Aug. 20, 2020). Counter-factual Fairness. In Advances in Neural Information Processing [10] “ISO - ISO/IEC TR 24028:2020 - Information technology — Artificial Systems. intelligence — Overview of trustworthiness in artificial intelligence”. [30] ETHICS GUIDELINES FOR TRUSTWORTHY AI, High-Level Expert https://www.iso.org/standard/77608.html. Group on Artificial Intelligence, EU. [11] Banerjee A, Chitnis UB, Jadhav SL, Bhawalkar JS, Chaudhury S. Hypoth- [31] “ISO/IEC WD TS 24462,” ISO. https://www.iso.org/cms/render/live/en esis testing, type I and type II errors. Ind Psychiatry J. 2009;18(2):127-131. /sites/isoorg/contents/data/standard/07/88/78828.html (accessed Nov. 02, doi:10.4103/0972-6748.62274 2020). [12] [P. Besse, E. del Barrio, P. Gordaliza, J.-M. Loubes, and L. Risser, ‘A [32] E. R. Griffor, C. Greer, D. A. Wollman, and M. J. Burns, “Framework survey of bias in Machine Learning through the prism of Statistical Parity for cyber-physical systems: volume 2, working group reports,” National for the Adult Data Set’, arXiv:2003.14263 [cs, stat], Apr. 2020, Accessed: Institute of Standards and Technology, Gaithersburg, MD, NIST SP Aug. 20, 2020. [Online]. Available: http://arxiv.org/abs/2003.14263. 1500-202, Jun. 2017. doi: 10.6028/NIST.SP.1500-202. [13] K. Yang and J. Stoyanovich, ‘Measuring Fairness in Ranked Outputs’, [33] A. Kuleshov, “Formalizing AI System Parameters in Standardization of in Proceedings of the 29th International Conference on Scientific and AI”, 2018 International Conference on Artificial Intelligence Applications Statistical Database Management, Chicago IL USA, Jun. 2017, pp. 1–6, and Innovations (IC-AIAI), Nicosia, Cyprus, 2018, pp. 51-54, doi: doi: 10.1145/3085504.3085526. 10.1109/IC-AIAI.2018.8674446. [14] James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to [34] M. A. PK, M. R. Sheriff, and D. Chatterjee, ‘Measure of quality of finite- Statistical Learning, vol. 103. New York, NY: Springer New York, 2013. dimensional linear systems: A frame-theoretic view’, arXiv:1902.04548 [15] S. C. Reid, “An empirical analysis of equivalence partitioning, boundary [cs, math], Feb. 2019, Accessed: Sep. 30, 2020. [Online]. Available: value analysis and random testing,” Proceedings Fourth International http://arxiv.org/abs/1902.04548. Software Metrics Symposium, Albuquerque, NM, USA, 1997, pp. 64-73, [35] A. Chaudhuri, A. L. Smith, A. Gardner, L. Gu, M. B. Salem, and doi: 10.1109/METRIC.1997.637166. M. Lévesque, ‘Regulatory frameworks relating to data privacy and [16] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John algorithmic decision making in the context of emerging standards on Schulman, and Dan Mané. Concrete problems in ai safety. arXiv algorithmic bias’, p. 6. preprint arXiv:1606.06565, 2016. [36] European Parliament. REGULATION (EU) 2016/679 OF THE EURO- [17] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine PEAN PARLIAMENT AND OF THE COUNCIL on the protection of learning at scale. arXiv preprint arXiv:1611.01236, 2016. natural persons with regard to the processing of personal data and on the [18] Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, free movement of such data, and repealing Directive 95/46/EC (General and Patrick McDaniel. Adversarial examples for malware detection. In Data Protection Regulation), 2016. European Symposium on Research in Computer Security, pages 62–79. [37] Wachter, Sandra and Mittelstadt, Brent and Floridi, Luciano, Why a Right Springer, 2017. to Explanation of Automated Decision-Making Does Not Exist in the Gen- [19] Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani eral Data Protection Regulation (December 28, 2016). International Data Srivastava, and Kai-Wei Chang. Generating natural language adversarial Privacy Law, 2017, Available at SSRN: https://ssrn.com/abstract=2903469 examples. In Proceedings of the 2018 Conference on Empirical Methods or http://dx.doi.org/10.2139/ssrn.2903469 in Natural Language Processing, pages 2890–2896, 2018. [38] R. Berk, H. Heidari, S. Jabbari, M. Kearns, and A. Roth, Fairness in [20] Ninghao Liu, Hongxia Yang, and Xia Hu. Adversarial detection Criminal Justice Risk Assessments: The State of the Art, p. 43. with model interpretation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1803–1811, 2018.