=Paper= {{Paper |id=Vol-2800/paper-01 |storemode=property |title=Quality Characteristics of Artificially Intelligent Systems |pdfUrl=https://ceur-ws.org/Vol-2800/paper-01.pdf |volume=Vol-2800 |authors=Adam Leon Smith,Raphaël Clifford |dblpUrl=https://dblp.org/rec/conf/apsec/SmithC20 }} ==Quality Characteristics of Artificially Intelligent Systems== https://ceur-ws.org/Vol-2800/paper-01.pdf
       Quality Characteristics of Artificially Intelligent
                           Systems
                                    Adam Leon Smith                                                   Raphaël Clifford
                                      Dragonfly                                                  Email: raphael@clifford.net
                            Email: adam@wearedragonfly.co



   Abstract—This paper explores the quality characteristics of                        adaptability, transparency, societal and ethical risk mitigation
systems using artificial intelligence components, referencing ex-                     as non-functional characteristics. This paper gives examples of
isting work in this area from research, the private sector                            measures that can be used, but it not intended to be exhaustive
and the international standards community. It proposes a new
quality model that considers existing work, and where applicable,                     on this matter.
proposes measures that may be appropriate for each quality
characteristic.
                                                                                                            II. E XISTING WORK
   Index Terms—artificial intelligence, machine learning, quality,                    A. SQuaRE
systems engineering, quality management, testing
                                                                                          ISO/IEC 25010 [5] is a popular standard in software and
                            I. I NTRODUCTION                                          systems quality management defining system and software
                                                                                      quality models, alongside ISO/IEC 25012 [6] that defines a
   Artificial Intelligence (AI) can be loosely defined as applying
                                                                                      data quality model. ISO/IEC 25010 has been identified by prac-
acquired knowledge to make decisions, in contrast to using
                                                                                      titioners [7] as requiring modification for AI systems. ISO/IEC
explicit logic. This presents both opportunities and problems
                                                                                      have also commenced a project to create a standardised model
[1] for the field of software and systems quality. One of
                                                                                      [8], that as an International Standard, consistent with ISO/IEC
the most challenging aspects is the the probabilistic nature
                                                                                      25010, will drive efficiency in industry. This is expected to be
of sub-symbolic systems, and difficulties in reproducing and
                                                                                      published in 2023.
explaining results. Another challenge is convincing everyone
that AI systems can be trusted with important decisions, and                          B. DIN Spec 92001-1
some researchers propose that what AI systems actually need                              The DIN SPEC 92001-1 [9] is a freely available standard pub-
to achieve is reliability [2], that is related to quality.                            lished in April 2019 by the German standardisation body (DIN).
   Artificial intelligence can include symbolic rule-based expert                     It aims to provide an outline of AI lifecycle process and quality
knowledge systems and sub-symbolic (statistical) machine                              requirements. It outlines three quality pillars: functionality
learning systems. Machine learning is the most common AI                              and performance, robustness and comprehensibility. This paper
method, and it is difficult to specify quality, and analyse how                       refers to functionality and performance as functional correctness
to test. Research in Japan involving 278 machine learning                             and completeness in order to stay consistent with existing ISO
engineers identified the biggest new challenges they face                             standards [5]. Similarly, this paper refers to comprehensibility
integrating machine learning is in decision making with                               as transparency. Nevertheless, the scope of the three quality
customers and testing/quality assurance. Further, they identify                       pillars covered in the DIN SPEC are also covered in this
the lack of a test oracle1 , and imperfection as the top causes                       proposed model.
of this [3].
   Should a practitioner wish to define a strategy or approach                        C. ISO/IEC Standards on Artificial Intelligence
to holistically evaluate the quality of an AI system, at present,                        ISO/IEC are working on a large number of new standards
it requires review of many scholarly articles in order to                             relating to AI, including those that relate to quality, verification
identify the relevant properties. It is therefore important that a                    and validation. The first relevant standard to the quality topic
standard quality model for AI systems is developed to support                         is a technical report giving an overview of trustworthiness in
practitioners.                                                                        AI [10]. This identifies vulnerabilities in AI systems including
   This paper reviews existing quality models in the context of                       security and privacy threats, bias, unpredictability, opaqueness,
AI, that is acquiring knowledge, applying the knowledge and                           and more. Some of these vulnerabilities map directly on the
producing decisions. Robustness and context completeness are                          requirements for a quality model for AI.
introduced as characteristics that relate to the input domain;
bias, functional correctness, and ex-post explainability (run                                        III. P ROPOSED QUALITY MODEL
transparency) as relating to the output decision domain; and                          A. Model Types
   1 In software testing, a test oracle is a source to determine an expected result     Quality cannot be quantified as a single metric. It
to compare with the actual result of the system under test [4]                        requires the definition of characteristics and terminology



Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
that can be used when specifying quality requirements,                 manifests as cognitive biases that exist on the development
and evaluating them. ISO/IEC 25010 defines two models,                 team, and societal biases that exist in historical datasets.
a product quality model and quality-in-use model. The                     Overall, bias in an AI system is a property of the system
former is the characteristics of a static system, and the latter       that results in different treatment for different people, objects
are characteristics of a system with a specified context of            or groups. In this context, it is an accuracy issue that exists
use. To draw an analogy, AI product quality (PQ) may be                in relation to the functional correctness and completeness of
determined once at the point of release, AI quality in use             a system. Bias can be measured using MAE or α and β as
(QiU) can only be determined in each actual context of use. In         described above, but in a way that filters out results for a
traditional software systems, few characteristics overlap. Table       particular cohort of transactions that belong to a specific group.
1, below, lists the quality characteristics discussed in this paper.   In this way the results can be compared between the general
                                                                       population to identify bias. Another approach is to use statistical
 Quality model in      Characteristic     Discussed          sub-      parity [12]. This uses S as a variable that identifies the cohort
 ISO/IEC 25010                            characteristics              under analysis, and S = 1 indicates membership of the relevant
 Product quality        Functional        Functional                   cohort:
                        suitability       correctness                                              P (Y = 1|S = 0)
                                                                                            DI =
 Product quality        Functional        Bias                                                     P (Y = 1|S = 1)
                        suitability
                                                                       Other metrics are required when assessing ranked outputs [13]
 Product quality        Portability   Adaptability
                                                                       or continuous variables, but the principle of comparing the
 Product quality        Security      Robustness to adver-
                                                                       group under analysis to the general population remains.
                                      sarial examples
                                                                          Bias belongs to both the product quality model (product bias),
 Product quality     Usability        Run transparency
                                                                       and the quality in use model (bias in use). This is because it
 Product quality     Usability        Controllability                  is the property of a single system and the data inputs used in
 Product quality     Maintainability Functional                        the production of that system, but it is also a property of the
                                      transparency                     system in actual use, where the inputs may be very different.
 Quality in use      Context cover- Context
                     age              completeness                     C. Adaptability
 Quality in use      Freedom from Societal and ethical                    Adaptability is defined as a product quality characteristic in
                     risk             risk mitigation                  ISO/IEC 25010:
Table 1 - New quality sub-characteristics to be discussed
                                                                                 degree to which a product or system can ef-
                                                                             fectively and efficiently be adapted for different or
B. Functional Suitability
                                                                             evolving hardware, software or other operational or
   1) Functional Correctness: Whilst there are significant                   usage environments
challenges relating to the verification of functional correctness      and is part of the portability characteristic - which refers to
and completeness, how to statistically analyse the results of          the environment. This paper proposes that the definition of
classification and regression systems common in AI, is a mature        adaptability is extended. It is much more the case with AI
topic. Measurement of type I (α) and type II (β) errors is one         systems that the data observed by the system can now be part
common approach [11] to presenting classifier results. For             of the environment, in real-time with reinforcement learning,
example, where H0 represents all negative predictions and H1           or as models are “retrained” with new datasets. This is starkly
represents all type II errors (false negative predictions), the        different to making a change to existing logic, as the model
type II rate can be calculated as:                                     is completely re-baselined rather than incrementally changed,
                             β=H 1                                     and the change may be interactive, dynamic, periodic or even
                               H0
                                                                       in real-time.
  Regression problems also have various metrics, of which                 Adaptability could be defined as the time taken for a system
one of the most common is Mean Absolute Error (MAE), this              to start to react differently based on a change in observed data,
gives no indication about the directionality of performance,           or the ease with which it can be retrained.
simply the scale. Where Yi represents predicted values, and
Xi represents the ground truth, the error can be expressed as:         D. Controllability
                                                                          The degree to which a system can be controlled is not
                  M AE = n1
                              P
                                  |Xi − Yi |
                                                                       a new concept [34], and is typically a functional aspect of
   2) Bias: Bias is a term that is frequently used differently         a system, however increasingly systems are able to operate
by different stakeholders. It is common for data scientists to         without human intervention or control. Therefore, if human
think of bias as a statistical property which can be positive or       interaction becomes optional or impossible, it is important to
negative in any given context, and ethicists to think of bias an       consider how controllable an AI system is for its presumptive
unfair outcome. In the context of an overall AI system bias            human operator. Controllability can be considered to be the
is both, it is a property of data and an algorithm. Bias also          ability to move a system from an arbitrary initial state, into
another desired state, by exercising a control function, within     adversarial attacks and this remains a promising research
a certain number of steps and within the required time.             avenue [19], [20]. The importance of explainability and
                                                                    comprehensibility is set out below.
E. Robustness and adversarial examples
                                                                   F. Transparency
   The environment in which AI must operate may be subject to
change through natural perturbations and volatility, drift in the     The DIN quality pillars introduce the term comprehensibility,
characteristics of input data over time or malicious intention of  which   measures the degree to which a stakeholder with defined
an adversary. The term AI Robustness attempts to capture the       needs   can comprehend the reasons for an AI component’s
extent to which an AI system will safely maintain its intended     outputs.   This is synonymous with explainability. There are
behaviour in such scenarios. This is distinct from context         wider   concerns  than explainability relating to transparency. In
completeness which does not focus on unanticipated changes         order  to a system  to be transparent it is necessary to understand
in input distributions. Robustness is however to some extent       the  provenance    and  quality of input data, including labels,
captured under the catch-all term context coverage. Ensuring       annotations   and  rules.
robustness poses some of the most difficult and important             Kuwajima & Ishikawa [7] when considering quality models
questions in AI and raises a number of issues which we will        for AI systems, identify transparency as comprising traceability,
introduce below.                                                   explainability   and communication. This again has a focus on
   1) Distributional Change: Perhaps the most common chal-         explainability,   but includes the ability to communicate the
lenging issue in AI is how to maintain the desired behaviour       information.  Creel  [21] identifies transparency in three different
of a system when the input distribution changes over time. If      forms:
the test data has the same statistical properties as the training     • Functional transparency. Knowledge of the whole opera-
set then we can expect a well specified AI system to work                tion of the system.
correctly. However, when encountering new and unexpected              • Structural transparency. Knowledge of the implementation.
input the situation can be much more difficult. As an example,        • Run transparency. Knowledge of how the system was run
in “Concrete Problems in AI Safety” [16] an AI robot cleaner             in a particular instance.
is posited which was trained to clean an empty office. On             Functional transparency implies that a human is able to
encountering a pet dog for the first time it might attempt to wash understand the system sufficiently well to accurately predict
it giving unpredictable results. In a more serious settings such how an algorithm will behave given particular inputs. This is
as when trading on the stock market or in military applications clearly possible with traditional systems, but with more complex
these consequences could be disastrous. In general, when algorithms comprising multiple layers of neural networks and
the testing distribution differs from the training distribution other AI components, it can become nearly impossible.
AI systems might not only perform in unexpected ways but              Structural transparency becomes more important the less
they may also report that they have been functioning without it is possible to gain functional transparency. This may
problems. This therefore makes the diagnosis of faults in the be understanding the implementation in code, or could be
AI system problematic.                                             documentation of the provenance of training data, and statistical
   2) Adversarial Inputs: Attempts to fool AI systems date analysis done on that data to reduce concerns about accuracy
back at least 15 years to the early days of spam filters. Those and bias.
wanting to send bulk unsolicited email started to find ways           Run transparency is the same as post-hoc explainability.
to avoid the linear classifiers used to filter them out. Since Explainability can be ex-ante (before the system has made a
the resurgence of deep neural networks the importance of decision) or ex-post (after a system has a made a decision).
adversarial techniques has become of increasing interest and Ex-ante techniques include exploring a dataset to understand
importance. It is now well known that computer vision systems and analyse it’s various characteristics. This can lead to the
can be fooled to make wildly inaccurate classifications if given identification of class imbalances that heavily influence the
a suitably perturbed image [17]. This failure of AI is in fact systems behaviour [22]. In this context explainability is not
caused by an unanticipated distributional change in the input a characteristic of the system at all, but a process that is
that was not captured in the training set. What sets this apart undertaken. Similarly, various mathematical techniques for
is that this difference has been carefully crafted to make the explainability during the modelling process can be conducted
AI system give an incorrect response by a malicious adversary. ex-ante, but these merely provide insight into the nature of the
   However these challenges are not just limited to AI based system for it’s creators.
computer vision systems. Every year more and more classes of          DeepAI [23] draw a distinction between directly interpretable
inputs, including malware detection [18] and natural language system that is intrinsically understood by users, and a system
texts [19] are being shown to be susceptible to adversarial that would require an ex-post explanation to understood a
attacks.                                                           specific prediction. It draws a further distinction between a
   3) Maintaining Explainability: If the input distribution is static explanation, and an interactive one that allows users to
very different from the training data, the AI system will make drill-down and interrogate.
decisions which may be unexpected or undesired. Preliminary           Explainability as a quality characteristic of a system applies
work now exists to try to use AI explainability to counter only to ex-post explainability. This in itself can take the form
of either internal or external users obtaining an explanation,         The degree to which a system is flexible given an unseen
and communicating it if necessary. Given that, regardless of        context requires decomposition of the context into a set of input
the method used to obtain an explanation it can be measured         values. For example, to envisage an automated system which
in terms of the availability of an explanation (µ), the accuracy    links location sensors on a user device with climate control in
of the explanation (α), and the time in which an operator is        the user’s residence, the inputs could be decomposed as follows:
able to obtain and/or communicate the explanation (T):
                                                                      Input                                       Data Structure
                 explainability = f (µ, α, T )                        S. Climate Control Status                      boolean
   Run transparency is a product quality characteristic that          R. Room Temperature                             float
affects the usability of an AI system, and functional and             Ta . User Minutes Until Arrival                 float
structural transparency most affect the maintainability.              Tb . Minutes To Reduce Temp                     float
                                                                      D. User Heading To Room                        boolean
G. Context coverage, context completeness                             Dt . User Heading To Room: Duration             float
   The operational environments of traditional software systems We will assume the system has been trained on historical
are typically limited in range and context. The need to define data which exhibit a certain data profile. In this hypothetical
and quantify the impact of the context is recognized in the example, one can envisage multiple prediction algorithms
ISO/IEC 25010 [5] QiU characteristic of context completeness: being used to arrive at a boolean decision that determines
          [. . . ] degree to which a product can be used           whether the system should start cooling the room. Although
      by specified users to achieve specified goals with           consistent stability of the procedure in the context of extensive
      effectiveness, efficiency, freedom from risk and             historical data provides some limited quality assurance, QiU
      satisfaction in all the intended contexts of use, or by      context coverage is not guaranteed.
      the presence of product properties that support use              Although the use case is simple, it is clear that a number
      in all the intended contexts of use.                         of the inputs are continuous variables, and as such could
Dynamic AI systems are expected to maintain performance have arbitrarily high or low values. It is not possible to
when faced with previously unseen data. This objective is measure context completeness within an infinite input domain
captured by the ISO/IEC 25010 [5] QiU characteristic of context for an arbitrary AI predictor, even when a reasonable range
coverage:                                                          of intended use is defined, for example with the temperature
          [. . . ] degree to which a product or system can be      values  restricted to the range (−100, +100), the boolean values
      used with effectiveness, efficiency, freedom from risk       0  to 1, and the arbitrary floats with predefined limits.
      and satisfaction in both specified contexts of use and           Given  context completeness is the degree the system main-
      in contexts beyond those initially explicitly identified.    tains  performance    in an unseen context, it can be measured as
      [emphasis added]                                             the  maximum     root  mean square error (RMSE) for a regression
                                                                   problem (or α and β for a classification problem, see Functional
   Context coverage can be expressed simply as the ability to
                                                                   Correctness), that can be obtained by varying any input.
maintain stability and effectiveness with previously unseen
                                                                       Continuing to use Xi and Yi to represent ground truth
datasets or inputs. Stability can be bought at the cost of
                                                                   and predicted values respectively, and with n representing
statistical bias; by the use of relatively inflexible AI models.
                                                                   the number of observations, RMSE can be defined as follows:
Such procedures will lead to under-fitting, where predictions                                     rP
are stable but not accurate enough. By contrast, complex AI                                             (Xi − Yi )2
models with high-dimensional parameter sets and a low level of                         RM SE =
                                                                                                            n
statistical bias may be affected by minor features of the training
data, which can lead to “over-fitting” and high variance. For H. Societal and Ethical Risk Mitigation
supervised learning, a model that is trained to give very accurate     AI systems usually intend to treat everyone differently, that is
results for one dataset will achieve less accurate predictions part of their purpose and one reason there is a risk of disparate
with new input. There is a direct trade-off between bias and impact [24]. Issues relating to bias in AI systems sometimes
variance [14].                                                     result from propagating existing unfairness in society, known
   To evaluate a traditional system’s context coverage, it would as a societal bias or historical bias [25], or can relate to unfair
be reasonable to use boundary value analysis or equivalence outcomes resulting exclusively from system implementation.
partitioning [15], perturb data inputs, and measure performance. There are metrics that are used to quantify unfairness [26] but
However this is based on the assumption that traditional the actual nature of unfairness that is to be avoided is derived
programming works using explicit logic and will respond from the legal, ethical and cultural context of use.
equivalently or proportionally to different inputs. This is not        Freedom from risk metrics [5] assess the degree that the
the case with many AI systems that use statistical techniques quality of the AI system mitigates potential risk to users,
to arrive at particular outputs. Deep learning in particular has a organisations and the wider community. Existing quality models
very large internal feature space which bears little resemblance focus on health, safety and the environment, however for AI
to the input domain.                                               systems it is necessary to consider the wider risks to the rights
and freedoms of members of society. The metrics for health by delivering and verifying requirements, and in theory,
and safety, for instance, tend to relate to reported injuries or stakeholders then trust a system because it is objectively
hazards. In the context of disparate impact [24] such reporting trustworthy. Given the broad scope of the verification, it is very
may not be forthcoming without public investigations such as likely that new techniques, business models and certification
those by Pro Publica [27].                                           bodies will spring up in this area.
   Fairness is a concept that varies by culture and context,             However, it is not clear that trustworthiness is an independent
however in the context of AI and automated decision-making, quality characteristic in its own right, rather it appears to be a
the most prevalent example is that an attribute of person or superset of a particular set of measurable quality characteristics.
object unfairly influences the output. Whilst there are many Garbuk [33] proposed that a functional characteristics vector
possible metrics for fairness [28], the most generic and flexible could be composed of quality measurements, with appropriate
way to measure it is through counterfactual fairness [29], as it weightings, and that this could be compared to standards for
supports consideration of inputs that are unrelated to fairness particular AI tasks. These standards would contain measurement
in data used for training, however it is unclear how this could methods, minimum quantity of data involved in evaluation, and
be implemented on an existing system, or independent of the minimum observed quality characteristics permitted for a
explainability methods. Given an existing system, there is specific task.
no difference between the metrics used to measure fairness
                                                                              V. P ROPOSED C HARACTERISTICS NOT INCLUDED
other than those outlined in the above discussion on bias.
Nevertheless, it remains a recommended characteristic for A. Privacy
a quality model, because the groups that are measured and                Privacy issues are far from unique to AI systems, they
acceptable may be different in the context of fairness.              can relate to any system that processes personal data. There
                                                                     are numerous pieces of regulation that specifically cover
                    IV. T RUSTWORTHINESS                             algorithmic decision making [35], which is far more common
   The definitions of quality and trustworthiness can be and complex in the context of AI systems, in comparison to
considered different but related, however there are different traditional systems. GDPR [36] is the most commonly cited2 , as
definitions of trustworthiness. The EU HLEG on AI defines it provides a right to request a human makes a decision, where
[30] trustworthy AI as                                               a system has made a decision that could have a substantive
          lawful (respecting all applicable laws and reg-            affect on a data subject.
      ulations), ethical (respecting ethical principles and              Beyond the relevance of explainability, there is no obvious
      values) and robust (both from a technical and social           unique      and novel quality characteristics of privacy relating to
      perspective).                                                  AI    systems.
Competing views on this are emerging from the international B. Collaborability
standards community, the first is that trustworthiness in artificial     Some research [7] has suggested that collaborability should
intelligence is the quality of being dependable and reliable be included as an extension to the usability of an AI system.
[10]. We can contrast that definition to quality, which is However, metrics are not proposed, and there is limited other
defined by ISO/IEC as conformance to specified requirements literature that covers this topic.
[5]. A difference between these definitions of trustworthiness
and quality, is the need for requirements to be specified by                                      VI. C ONCLUSION
stakeholders, and the verifiability of them. Taking into account         In this paper we have explored numerous aspects of quality
the EU view, there is a requirement for systems to deliver for AI systems, their measurements, and their relationship with
against unstated legal, ethical and social requirements as well trustworthiness. This work is not exhaustive, due to the volume
as technical ones. A second definition is under development and diversity of use cases that AI is being applied to. It is
within ISO/IEC [31] that defines ICT trustworthiness as a            notable that most of the measures proposed, with the exception
          demonstrable likelihood that the system performs           of controllability and run transparency, are statistical metrics
      according to designed behavior under a typical set of          intended to operate across a group of outputs. This speaks
      conditions as evidenced by its characteristics, such           to the statistical nature of sub-symbolic AI systems. Whilst
      as safety, security, privacy, reliability and resilience.      most of the measures are not new, they are typically used by
This definition is based on the NIST Framework for Cyber- system developers and data scientists during the production
Physical Systems [32], and notably includes the work designed, of a system. These candidate metrics can also be used to
which implies the specification of requirements.                     evaluate the holistic quality of deployed systems, for which
   Verification is the process of confirmation, through the             2 It is often said that GDPR [36] provides a right to an explanation of how
provision of objective evidence, that specified requirements an algorithm reached a particular decision. This is not the case. Whilst it is
have been fulfilled. So system owners can make a system discussed in the recitals, it is not present in the articles of the regulation as it
trustworthy by specifying verifiable requirements, including was removed during the legislative process [37]. Nevertheless, explainable AI
                                                                     is a significant focus for industry as it allows for algorithms that are not well
consideration for legal, ethical and social issues. Engineers, be understood to be analysed in order to find metamorphic relationships between
they developers or testers, can make a system trustworthy groups of system inputs and outputs.
the evaluators may or may not have access to the logic, design,
training data or parameters associated with a particular system.
                                R EFERENCES
 [1] Y. Zhuang, F. Wu, C. Chen, and Y. Pan, ‘Challenges and opportunities:            [21] K. A. Creel, Transparency in Complex Computational Systems, Philoso-
     from big data to knowledge in AI 2.0’, Frontiers Inf Technol Electronic               phy of Science, p. 709729, Apr. 2020, doi: 10.1086/709729.
     Eng, vol. 18, no. 1, pp. 3–14, Jan. 2017, doi: 10.1631/FITEE.1601883.            [22] B. Khaleghi, The How of Explainable AI: Pre-modelling Explainability,
 [2] M. Ryan, ‘In AI We Trust: Ethics, Artificial Intelligence, and Reliability’,          Medium, Aug. 15, 2019. https://towardsdatascience.com/the-how-of-
     Sci Eng Ethics, Jun. 2020, doi: 10.1007/s11948-020-00228-y.                           explainable-ai-pre-modelling-explainability-699150495fe4 (accessed Aug.
 [3] F. Ishikawa and N. Yoshioka, ‘How Do Engineers Perceive Difficulties                  20, 2020).
     in Engineering of Machine-Learning Systems? - Questionnaire Survey’,             [23] V. Arya et al., ‘One Explanation Does Not Fit All: A Toolkit
     in 2019 IEEE/ACM Joint 7th International Workshop on Conducting                       and Taxonomy of AI Explainability Techniques’, arXiv:1909.03012
     Empirical Studies in Industry (CESI) and 6th International Workshop                   [cs, stat], Sep. 2019, Accessed: Aug. 20, 2020. [Online]. Available:
     on Software Engineering Research and Industrial Practice (SER&IP),                    http://arxiv.org/abs/1909.03012.
     Montreal, QC, Canada, May 2019, doi: 10.1109/CESSER-IP.2019.00009.               [24] S. Barocas and A. D. Selbst, Big Data’s Disparate Impact, SSRN Journal,
 [4] ‘ISTQB Glossary’. https://glossary.istqb.org/en/search/oracle (accessed               2016, doi: 10.2139/ssrn.2477899.
     Sep. 23, 2020).                                                                  [25] H. Suresh and J. V. Guttag, A Framework for Understanding Un-
 [5] ISO 25010. https://iso25000.com/index.php/en/iso-25000-standards/iso-                 intended Consequences of Machine Learning, arXiv:1901.10002 [cs,
     25010 (accessed Aug. 20, 2020).                                                       stat], Feb. 2020, Accessed: Feb. 22, 2020. [Online]. Available:
 [6] “ISO - ISO/IEC 25012:2008 - Software engineering — Software product                   http://arxiv.org/abs/1901.10002.
     Quality Requirements and Evaluation (SQuaRE) — Data quality model.”              [26] S. Verma and J. Rubin, ‘Fairness definitions explained’, in Proceedings
     https://www.iso.org/standard/35736.html (accessed Nov. 02, 2020).                     of the International Workshop on Software Fairness - FairWare ’18,
 [7] H. Kuwajima and F. Ishikawa, Adapting SQuaRE for Quality Assessment                   Gothenburg, Sweden, 2018, pp. 1–7, doi: 10.1145/3194770.3194776.
     of Artificial Intelligence Systems, p. 6                                         [27] ‘Machine Bias — ProPublica’. https://www.propublica.org/article/machine-
 [8] IEC blog - IEC and ISO joint committee on AI adds focus on related                    bias-risk-assessments-in-criminal-sentencing (accessed Sep. 23, 2020).
     data ecosystem. https://blog.iec.ch/2020/05/iec-and-iso-joint-committee-         [28] S. Verma and J. Rubin, ‘Fairness definitions explained’, in Proceedings
     on-ai-adds-focus-on-related-data-ecosystem/ (accessed Aug. 20, 2020).                 of the International Workshop on Software Fairness - FairWare ’18,
 [9] DIN         SPEC         92001-1         -     2019-04        -      Beuth.de.        Gothenburg, Sweden, 2018, pp. 1–7, doi: 10.1145/3194770.3194776.
     https://www.beuth.de/en/technical-rule/din-spec-92001-1/303650673                [29] Matt J. Kusner, Joshua R. Loftus, Chris Russell, and Ricardo Silva. 2017.
     (accessed Aug. 20, 2020).                                                             Counter-factual Fairness. In Advances in Neural Information Processing
[10] “ISO - ISO/IEC TR 24028:2020 - Information technology — Artificial                    Systems.
     intelligence — Overview of trustworthiness in artificial intelligence”.          [30] ETHICS GUIDELINES FOR TRUSTWORTHY AI, High-Level Expert
     https://www.iso.org/standard/77608.html.                                              Group on Artificial Intelligence, EU.
[11] Banerjee A, Chitnis UB, Jadhav SL, Bhawalkar JS, Chaudhury S. Hypoth-            [31] “ISO/IEC WD TS 24462,” ISO. https://www.iso.org/cms/render/live/en
     esis testing, type I and type II errors. Ind Psychiatry J. 2009;18(2):127-131.        /sites/isoorg/contents/data/standard/07/88/78828.html (accessed Nov. 02,
     doi:10.4103/0972-6748.62274                                                           2020).
[12] [P. Besse, E. del Barrio, P. Gordaliza, J.-M. Loubes, and L. Risser, ‘A          [32] E. R. Griffor, C. Greer, D. A. Wollman, and M. J. Burns, “Framework
     survey of bias in Machine Learning through the prism of Statistical Parity            for cyber-physical systems: volume 2, working group reports,” National
     for the Adult Data Set’, arXiv:2003.14263 [cs, stat], Apr. 2020, Accessed:            Institute of Standards and Technology, Gaithersburg, MD, NIST SP
     Aug. 20, 2020. [Online]. Available: http://arxiv.org/abs/2003.14263.                  1500-202, Jun. 2017. doi: 10.6028/NIST.SP.1500-202.
[13] K. Yang and J. Stoyanovich, ‘Measuring Fairness in Ranked Outputs’,              [33] A. Kuleshov, “Formalizing AI System Parameters in Standardization of
     in Proceedings of the 29th International Conference on Scientific and                 AI”, 2018 International Conference on Artificial Intelligence Applications
     Statistical Database Management, Chicago IL USA, Jun. 2017, pp. 1–6,                  and Innovations (IC-AIAI), Nicosia, Cyprus, 2018, pp. 51-54, doi:
     doi: 10.1145/3085504.3085526.                                                         10.1109/IC-AIAI.2018.8674446.
[14] James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to               [34] M. A. PK, M. R. Sheriff, and D. Chatterjee, ‘Measure of quality of finite-
     Statistical Learning, vol. 103. New York, NY: Springer New York, 2013.                dimensional linear systems: A frame-theoretic view’, arXiv:1902.04548
[15] S. C. Reid, “An empirical analysis of equivalence partitioning, boundary              [cs, math], Feb. 2019, Accessed: Sep. 30, 2020. [Online]. Available:
     value analysis and random testing,” Proceedings Fourth International                  http://arxiv.org/abs/1902.04548.
     Software Metrics Symposium, Albuquerque, NM, USA, 1997, pp. 64-73,               [35] A. Chaudhuri, A. L. Smith, A. Gardner, L. Gu, M. B. Salem, and
     doi: 10.1109/METRIC.1997.637166.                                                      M. Lévesque, ‘Regulatory frameworks relating to data privacy and
[16] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John                     algorithmic decision making in the context of emerging standards on
     Schulman, and Dan Mané. Concrete problems in ai safety. arXiv                        algorithmic bias’, p. 6.
     preprint arXiv:1606.06565, 2016.                                                 [36] European Parliament. REGULATION (EU) 2016/679 OF THE EURO-
[17] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine                  PEAN PARLIAMENT AND OF THE COUNCIL on the protection of
     learning at scale. arXiv preprint arXiv:1611.01236, 2016.                             natural persons with regard to the processing of personal data and on the
[18] Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes,                  free movement of such data, and repealing Directive 95/46/EC (General
     and Patrick McDaniel. Adversarial examples for malware detection. In                  Data Protection Regulation), 2016.
     European Symposium on Research in Computer Security, pages 62–79.                [37] Wachter, Sandra and Mittelstadt, Brent and Floridi, Luciano, Why a Right
     Springer, 2017.                                                                       to Explanation of Automated Decision-Making Does Not Exist in the Gen-
[19] Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani                     eral Data Protection Regulation (December 28, 2016). International Data
     Srivastava, and Kai-Wei Chang. Generating natural language adversarial                Privacy Law, 2017, Available at SSRN: https://ssrn.com/abstract=2903469
     examples. In Proceedings of the 2018 Conference on Empirical Methods                  or http://dx.doi.org/10.2139/ssrn.2903469
     in Natural Language Processing, pages 2890–2896, 2018.                           [38] R. Berk, H. Heidari, S. Jabbari, M. Kearns, and A. Roth, Fairness in
[20] Ninghao Liu, Hongxia Yang, and Xia Hu. Adversarial detection                          Criminal Justice Risk Assessments: The State of the Art, p. 43.
     with model interpretation. In Proceedings of the 24th ACM SIGKDD
     International Conference on Knowledge Discovery & Data Mining,
     pages 1803–1811, 2018.