Opening the Software Engineering Toolbox for the
                    Assessment of Trustworthy AI
                     Mohit Kumar Ahuja1 and Mohamed-Bachir Belaid1 and Pierre Bernab1 and
                            Mathieu Collet1 and Arnaud Gotlieb1 and Chhagan Lal1 and
                       Dusica Marijan1 and Sagar Sen1 and Aizaz Sharif1 and Helge Spieker1


Abstract. Trustworthiness is a central requirement for the accep-                Having a definition of trustworthy formulates a goal for the de-
tance and success of human-centered artificial intelligence (AI). To          velopment of AI systems. The second step is to evaluate if a sys-
deem an AI system as trustworthy, it is crucial to assess its be-             tem fulfills the definition sufficiently and can be deemed trustworthy.
haviour and characteristics against a gold standard of Trustwor-              This evaluation should be transparent and accessible to understand
thy AI, consisting of guidelines, requirements, or only expectations.         its results, robust and reproducible, and both automated and generic
While AI systems are highly complex, their implementations are still          as much as possible to allow a low barrier for application to new
based on software. The software engineering community has a long-             AI systems. Since there is no single trustworthiness criterion or even
established toolbox for the assessment of software systems, espe-             metric, a single evaluation technique is not sufficient or maybe even
cially in the context of software testing. In this paper, we argue for        possible. The trustworthiness assessment has to consist of multiple
the application of software engineering and testing practices for the         techniques, each appropriate for some of the requirements of trust-
assessment of trustworthy AI. We make the connection between the              worthy AI and each robust and mature enough to be reliable.
seven key requirements as defined by the European Commission’s                   Tools and techniques for the assessment of trustworthy AI can be
AI high-level expert group and established procedures from software           taken from the established methods of software engineering research
engineering and raise questions for future work.                              and especially the subarea of software testing. For 50 years, these
                                                                              communities have proposed methods for the realization and assess-
                                                                              ment of large-scale, complex software systems. While the criteria for
1     INTRODUCTION                                                            trustworthy AI do cover more than technical aspects, the AI system
Artificial Intelligence (AI) has increasing relevance for many aspects        itself is still mostly a software system. Even though their are differ-
of the current and future everyday life. Many of these aspects inter-         ences in their engineering, many of the software engineering prin-
fere directly with the personal space of humans, their perception, ac-        ciples apply to them or are transferable [1, 5]. Recently, motivated
tions, and, more generally, their data, both online and offline. Due to       through the recent breakthroughs of AI and especially deep learning,
this close integration, it is therefore crucial to develop the AI systems     the software engineering community has increased the attention on
in a human-centered fashion such that they are trustworthy and can            machine learning, both as a tool within software engineering and an
be accepted by providers, who develop and deploy the AI systems,              area for the application of software engineering principles.
users, who operate the AI systems, regulatory bodies, who oversee                Through the remainder of this short paper, we argue to open the
the usage and effects of the AI systems, and affected humans, who             software engineering toolbox with its wide range of methods for the
are act in cooperation with or next to the AI systems or who’s data is        realization and assessment of trustworthy AI systems. Following the
subject to processing via the AI systems.                                     structure of the key requirements for trustworthy AI [12], we link
   To define the extent and more specific definition of a trustworthy         existing techniques with the goals for the fulfillment of these re-
AI, a high-level expert group (AI-HLEG) that was set up by the Eu-            quirements. It is important to note, that even though there are already
ropean Commission, identified guidelines and requirements for an AI           many methods available, the research on trustworthy AI is by far not
system that need to be sufficiently fulfilled to be regarded as trustwor-     complete. Our current toolbox, however, provides a strong starting
thy [12]. On the highest level, an AI system is deemed trustworthy            position but needs adjustments and further experiences to be adapted
if it behaves according to four ethical principles: respect for human         for the specific characteristics of modern AI.
autonomy, prevention of harm, fairness, and explicability [12, p. 12];
on a more technical level, requirements have been formulated that             2   TRUSTWORTHY AI
are supposed ”to be continuously evaluated and addressed through-             This section discusses an overview of approaches related to the key
out the AI system’s life cycle” [12, p. 15].                                  requirements for Trustworthy AI from the HLEG’s Ethics Guidelines
1                                                                             from software engineering and adjacent subfields. We aim to anal-
     Simula Research Laboratory, Dept. of Validation Intelligence for Au-
    tonomous Software Systems, Oslo, Norway, {mohit, bachir, pierbernabe,     yse how to map system engineering onto the requirements, and to
    mathieu, arnaud, chhagan, dusica, sagar, aizaz, helge}@simula.no Fund-    show examples for techniques, case studies, that have already been
    ing: This work has received funding from the European Union under grant   explored. At the same time allows a discussion of existing techniques
    agreement no. 825619 (AI4EU), the EU landmark project to develop a Eu-
    ropean AI on-demand platform and ecosystem. Copyright c 2020 for this
                                                                              to identify where future research is required or areas where the soft-
    paper by its authors. Use permitted under Creative Commons License At-    ware engineering toolbox might be insufficient to properly address
    tribution 4.0 International (CC BY 4.0).                                  aspects of a given requirement.
                                                                           task. For the same inputs, it is expected that all systems provide sim-
                Test Inputs                                Test Inputs
                                                                           ilar outputs and if a system diverges it is an indicator of faulty be-
                                                                           haviour. The advantage of differential testing is that the specific test
                                                                           oracles for the inputs, i.e. the precise expected outputs, are not re-
                                                         Transformation    quired which allows easier setup of the test cases, especially when
  System 1       System 2        ...   System n                            defining the test oracles is too costly or complex. DeepXplore [21]
                                                            System         first explored differential testing for deep learning. The paper pro-
                                                                           poses a controlled way to generate test inputs, similar to adversarial
             Compare Outputs
                                                          Metamorphic      examples, that are likely to identify diverging behaviours and showed
                                                           Relation
                                                                           promising results on multiple datasets and models.
                                                                              Metamorphic testing also alleviates the problem of defining pre-
              Fault-Generating                          Fault-Generating
                    Inputs                                    Inputs       cise test oracles. Here, new test cases are generated with the help of
         (a) Differential Testing                      (b) Metamorphic     metamorphic relations. These relations allow to describe a property
                                                       Testing             of the behaviour, e.g. the output, when a change in the input is made.
                                                                           For example, for an AI-based HR system to rank resumes of appli-
      Figure 1. Schematic Overview of two testing techniques for Deep      cants, adding relevant keywords should improve the ranking, even
                           Learning Systems                                though there is no precise definition of the final expected ranking. In
                                                                           the context of testing AI, metamorphic testing has been applied for
                                                                           testing of autonomous driving systems [30], image classifiers [10],
2.1     Human agency and oversight                                         or ranking algorithms [19].

The first of the requirements is the necessity for the AI to support hu-
man autonomy and the option for the human to inspect and influence         2.3    Privacy and data governance
the AI’s actions. Human agency directly affects the collaboration be-
tween AI and human and to support this interaction, it is important to     Privacy protection of individuals who contribute with their personal
take appropriate design measures, such as ergonomic and accessible         data towards development of AI is of paramount importance in
user interfaces (UI) and an excellent user experience (UX). Human          human-centered AI. Any party that curates datasets needs to ensure
oversight requires the inspection of the AI decision making, either        that the data does not provide means of re-identifying individuals
by having interpretable models or having access to design decision         while, at the same time, being effective at predicting patterns of busi-
documents, source code, or data, depending on the level of expertise       ness/societal value. Secure data-intensive systems storing personal
of the inspecting party.                                                   data typically contain identifying, quasi-identifying, non-identifying
   It is also one of the main challenges in AI to find a perfect balance   and sensitive attributes about individuals.
between enhancing human agency and preserving a degree of respon-             Software tools such as ARX [22] can anonymize and perform re-
sibility [11]. Some ”black box” AI techniques prevent the human            identification risk analysis on large datasets to quantify the risk of
from understanding the embraced process and thus prevent him from          prosecutor, journalist, and marketer attacks before the data is used in
the control. We believe that software engineering and testing frame-       AI. ARX can be used to anonymize data based on criteria [8] such as
works can contribute in achieving a better degree of human under-          k-anonymization (personal attributes are suppressed or generalized
standing and control of AI techniques. Software testing techniques         until each row is identical with at least k-1 other rows), l-diversity
are often based around the goal to design the simplest test cases to       (entails reducing granularity of data), and t-closeness (a refined re-
determine a system’s quality. Having these tests for AI systems will       duction of granularity by maintaining an underlying data distribu-
improve the ability to understand the AI behaviour and its deviations      tion). However, in specific cases, quasi-identifying attributes such as
from it. While there is already work on applying and adopting cur-         the birth date of an individual are required to train AI models. There-
rent testing techniques on AI [29], future work is required to ease        fore, controlled fuzzification of quasi-identifying attributes [27] can
their capabilities and expressiveness for human oversight.                 minimize the risk of re-identification while maintaining underlying
                                                                           patterns of interest in the data. For instance, in cervical cancer screen-
                                                                           ing, attributes such as birth date or screening exam date can be per-
2.2     Technical Robustness and safety                                    turbed within certain bounds. This is primarily due to the fact that the
                                                                           human papillomavirus has an average latency period of 3 months.
The technical robustness of AI systems is central to their reliability.    Therefore, database commands can fuzzy all dates to the 15th of a
While performing well in their main performance metrics, e.g. the          month (middle), and move months by ±2 months without affecting
classification accuracy, additional safety, and robustness metrics and     disease progression patterns and increasing risk of re-identification.
the resilience to attacks often remain open challenges [20]. Of par-
ticular relevance are adversarial inputs which are specially crafted
to attack an AI system, for example, to cause misclassification or to      2.4    Transparency
extract internal information about training data.
   These challenges have recently been identified by several software      The transparency of an AI system is closely related to its inter-
testing techniques and have been adopted towards the testing of AI         pretability and explainability, but also to the documentation of its
systems, especially for deep learning. To highlight two techniques         purpose and how it has been designed. An approach for transparency
that have been successfully applied towards the testing of deep learn-     documentation is the concept of model cards [18], which aims to pro-
ing, we briefly discuss differential and metamorphic testing (see Fig-     vide accessible overview of a model for people of different expertise,
ure 1). In differential testing, a system is evaluated by comparing its    including all of developers, testers, and technical end-users, similar
behaviour against a set of reference implementations for the same          to a package insert in pharmaceutical products.
   Lower level measures for transparency can be achieved via strict
traceability and static analysis [26] to allow the documentation of                  High Level of Manual Effort

system behaviour, e.g. in autonomous vehicles [3] in combination                                                                       Questionnaire &
                                                                                          Requirements Engineering & Analysis          Self-Assessment
with requirements engineering [7]. These techniques allow higher
transparency of the AI during development, evaluation, and certifica-
                                                                                              Process Control & Monitoring:
tion tasks, where they serve mostly technical needs for the develop-                              Code & Data Review                  Manual Inspection
ment and integration of the AI component.
                                                                                           Software Testing of Trustworthy AI

2.5    Diversity, non-discrimination and fairness                                                                                     Scoring & Labeling
                                                                                     High Level of Automation
Adequate diversity in data to train AI systems is necessary to avoid
discrimination and maintain fairness in human-centered AI. History
has taught us that bias in using personal data has harmed several                  Figure 2. Concepts from the Software Engineering toolbox for the
generations of ethnic minorities. Lundy Braun [4] reports the im-                preparation, monitoring and evaluation of trustworthiness in AI projects.
plications of biased data in spirometers that measure a person’s lung
function after a forced exhale. The predicted values of a lung’s forced
vital capacity (in litres of air exhaled) for black people for over a cen-   name a few. All these approaches however require careful data col-
tury been lower than white people. One of the reasons was that the           lection after a target audience has been exposed to an AI system.
data was collected from black men working in cotton fields where
lint from cotton severely damaged lung function. This has resulted
                                                                             2.7      Accountability
in black people receiving very little help from medical insurance
companies for several generations. Even today, race and not socio-           Access to personal data used in AI systems should be controlled by
economic factor is used as a parameter to predict lung capacity in           its owner in human-centered AI. The owner can give consent of use
spirometers used worldwide. This unfortunate trend continues in AI           and take away access to personal data whenever he/she wants to. This
systems where a recent study [15] shows that millions of black peo-          implies that the AI system would need to be re-trained with or with-
ple are victims of biased decision making in health care systems.            out a specific person’s data. The proof of this operation should be
   Data needs to be carefully curated for training AI systems such           made known to the owner to ensure accountability. The blockchain
that variation in human attributes such as different ethnic groups,          has the potential to facilitate the accountability of such transac-
genders, ages, weights, heights, geographical areas, and medical his-        tions between a data owner and an AI system. The blockchain is a
tories are taking into account for unbiased decision making. How-            distributed ledger which was initially designed to record financial
ever, discovering if a data set satisfies all possible combinations of       transactions. Numerous models of using the blockchain and smart
attributes is often computationally intractable. Combinatorial inter-        contracts have now been proposed for data access control [9] and
action testing (CIT) of software has been very effective in finding          AI [23]. Tal Rapke [16] suggests that people own and access their
over 95% of all faults in a wide range of software systems using a           health and life record on a decentralized blockchain that does not rely
very small set of tests covering all 2-wise/pairwise combinations of         on a central storage facility. This will liberate organizations from the
features [14]. CIT has been extended to verify if data in a large re-        liability of storing personal data. The data will reside on the latest
lational database contains all pairwise interactions between attribute       secure technology and using verifiable cryptography and owners of
values of interest [24]. Verifying the presence of all pairwise interac-     the data will be empowered to decide who they share their data with.
tions in human attribute values in data set can clarify limitations or
guarantee adequate diversity in human-centered AI systems.
   The importance of fairness in software received attention as a ded-
                                                                             3     THE SOFTWARE ENGINEERING TOOLBOX
icated topic in software engineering research [28] with close connec-        The discussion of the key requirements on trustworthy AI [12] shows
tions for the assessment via software testing methodology [6].               that there are many challenges to be addressed, but also a set of meth-
                                                                             ods available that can embraced and extended. As a general approach
                                                                             towards these challenges, we propose to adopt three main consid-
2.6    Societal and environmental well-being
                                                                             erations (see Figure 2): First, since the expectations on trustworthy
Human-centered AI systems need to benefit society and not cause              AI cannot be presented as a strict set of guidelines and rules only,
harm. It is necessary to see an AI system as not merely a software           it is recommended to understand their impact on the AI that is de-
system but as a socio-technical system where the interaction between         veloped. Performing thorough requirements analysis helps to gather
people the system is used to evaluate its benefit. Learning from epi-        these requirements in a systematic way [2] and to formalize the re-
demiology, we can evaluate an AI system as if it were an intervention        quirements’ impact on the AI including final acceptance criteria and
on the public. For instance, in [25], the authors visualize the paths a      whether they can be assessed automatically or require manual inter-
patient takes after different screening exams for cervical cancer. Sim-      vention. One method to guide the requirements analysis at this point
ilarly, there is a need to understand how decisions made by the AI           could be to formulate checklists for each of the requirements, e.g.
system affect the decisions made by people and the paths they take in        similar to this proposition for fairness [17].
life. Are people making healthier life choices, environmentally con-            Second, the realization of a trustworthy AI should be continuously
scious, or giving a helping hand in society after an AI intervention?        accompanied by regular monitoring instruments. The goal of this
Evidence-based software engineering [13] inspired by epidemiology            monitoring is to ensure the awareness of trustworthiness measures
and clinical studies presents numerous approaches to evaluate the            during development. These monitoring instruments can include ded-
impact of AI on people. These approaches include randomized con-             icated questions to consider during code and data reviews, as well as
trolled trials, observational studies, and focus group discussions to        retrospective meetings.
   Third, automated testing should be used to allow automated, re-              [8] Anil Prakash Dangi and Ravindar Mogili, ‘Privacy preservation mea-
peated, and comparable assessment of trustworthiness. Where pos-                    sure using t-closeness with combined l-diversity and k-anonymity’, In-
                                                                                    ternational Journal of Advanced Research in Computer Science and
sible, testable acceptance criteria should be defined or test process               Electronics Engineering (IJARCSEE), 1(8), (2012).
that can quantify the behaviour of the AI system. For example, the              [9] George Drosatos and Eleni Kaldoudi, ‘Blockchain applications in the
technical robustness of an AI systems against adversarial inputs can                biomedical domain: a scoping review’, Computational and structural
be assessed through automatic techniques.                                           biotechnology journal, (2019).
   Finally, in all cases does the qualitative and quantitative summary         [10] A. Dwarakanath, M. Ahuja, S. Sikand, RM Rao, RP Bose, N. Dubash,
                                                                                    and S. Podder, ‘Identifying implementation bugs in machine learning
of the results, e.g. via a score or a badge to attest the quality of an             based image classifiers using metamorphic testing’, in ACM Interna-
AI system, provides valuable information to the different stakeholder               tional Symposium on Software Testing and Analysis (ISSTA), (2018).
groups, e.g. the providers, their customers, or the users. A common            [11] L. Floridi, J. Cowls, M. Beltrametti, R. Chatila, P. Chazerand,
scoring scheme, similar to the maturity levels in engineering projects,             V. Dignum, C. Luetge, R. Madelin, U. Pagallo, F. Rossi, et al.,
                                                                                    ‘Ai4peoplean ethical framework for a good ai society: Opportunities,
could allow for comparability and accessibility of the results.                     risks, principles, and recommendations’, Minds and Machines, 28(4),
                                                                                    (2018).
                                                                               [12] High-Level Expert Group on Artificial Intelligence, ‘Ethics Guidelines
4   CONCLUSION                                                                      for Trustworthy AI’, Technical report, European Commission, (2019).
                                                                               [13] Barbara A Kitchenham, Tore Dyba, and Magne Jorgensen, ‘Evidence-
The realization of trustworthy AI systems is one of the big chal-                   based software engineering’, in International Conference on Software
lenges for the success of ethical and human-centered AI. This has                   Engineering (ICSE), pp. 273–281, (2004).
been acknowledged by both politics [12] and academia. For the im-              [14] D Richard Kuhn and Michael J Reilly, ‘An investigation of the appli-
plementation of trustworthiness principles, we argue for the adoption               cability of design of experiments to software testing’, in 27th Annual
                                                                                    NASA Goddard/IEEE Software Engineering Workshop, (2002).
of methods and technologies from software engineering. Software                [15] Heidi Ledford, ‘Millions of black people affected by racial bias in
engineering has a long-standing tradition on the principled construc-               health-care algorithms.’, Nature, 574(7780), (2019).
tion of complex systems and has already much of the fundamental                [16] Laure A Linn and Martha B Koo, ‘Blockchain for health data and its
work available, as shown throughout this paper.                                     potential use in health it and health care related research’, in ONC/NIST
   Still, further work is necessary to cover all the requirements on                Use of Blockchain for Healthcare and Research Workshop, (2016).
                                                                               [17] Michael A Madaio, Luke Stark, Jennifer Wortman Vaughan, and Hanna
trustworthy AI and to provide the tools and guidelines necessary for                Wallach, ‘Co-Designing Checklists to Understand Organizational Chal-
the widespread realization of trustworthy AI. Are the current soft-                 lenges and Opportunities around Fairness in AI’, in CHI, (2020).
ware engineering tools sufficient to assess AI systems? Or do we               [18] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy
need to develop dedicated tools? How can we converge on a set of                    Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and
                                                                                    Timnit Gebru, ‘Model cards for model reporting’, in Conference on
acceptance criteria for trustworthiness? How many of the require-                   Fairness, Accountability, and Transparency - FAT*, (2019).
ments can effectively be assessed in a mostly automated way? What              [19] Christian Murphy, Gail Kaiser, Lifeng Hu, and Leon Wu, ‘Properties of
are the challenges for assessing trustworthy AI by non-specialists or               Machine Learning Applications for Use in Metamorphic Testing’, 20th
external users? How can we present the results in an accessible way?                International Conference on Software Engineering and Knowledge En-
   The software engineering community has already taken up the                      gineering (SEKE), (2008).
                                                                               [20] Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael P.
challenge of software engineering for AI/ML, but often with a focus                 Wellman, ‘SoK: Security and Privacy in Machine Learning’, in IEEE
on the general system engineering, maintenance requirements, and                    European Symposium on Security and Privacy (EuroS&P), (2018).
general validation. However, as the requirements discussed in this             [21] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana, ‘Deepxplore:
paper showed, the engineering efforts need to cast a wider need and                 Automated whitebox testing of deep learning systems’, in Symposium
                                                                                    on Operating Systems Principles, (2017).
address more concerns in the context of trustworthy AI. This will be           [22] Fabian Prasser and Florian Kohlmayer, ‘Putting statistical disclosure
an interdisciplinary challenge and the software engineering toolbox                 control into practice: The arx data anonymization tool’, in Medical
can be of central relevance during its development.                                 Data Privacy Handbook, (2015).
                                                                               [23] Khaled Salah, M Habib Ur Rehman, Nishara Nizamuddin, and Ala
                                                                                    Al-Fuqaha, ‘Blockchain for ai: Review and open research challenges’,
REFERENCES                                                                          IEEE Access, 7, (2019).
                                                                               [24] Sagar Sen, Dusica Marijan, Carlo Ieva, Astrid Grime, and Atle Sander,
[1] Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Har-              ‘Modeling and verifying combinatorial interactions to test data inten-
    ald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and                    sive systems: Experience at the norwegian customs directorate’, IEEE
    Thomas Zimmermann, ‘Software Engineering for Machine Learning:                  Transactions on Reliability, 66(1), (2016).
    A Case Study’, in International Conference on Software Engineering:        [25] Sagar Sen, Manoel Horta Ribeiro, Racquel C De Melo Minardi, Wagner
    Software Engineering in Practice, (2019).                                       Meira, and Mari Nygård, ‘Portinari: a data exploration tool to personal-
[2] Hrvoje Belani, Marin Vukovic, and Željka Car, ‘Requirements Engi-              ize cervical cancer screening’, in International Conference on Software
    neering Challenges in Building AI-Based Complex Systems’, in Inter-             Engineering: Software Engineering in Society (ICSE-SEIS), (2017).
    national Requirements Engineering Conference Workshops, (2019).            [26] Caterina Urban, ‘Static analysis of data science software’, in Interna-
[3] Markus Borg, Cristofer Englund, and Boris Duran, ‘Traceability and              tional Static Analysis Symposium, pp. 17–23, (2019).
    deep learning-safety-critical systems with traces ending in deep neural    [27] Giske Ursin, Sagar Sen, Jean-Marie Mottu, and Mari Nygård, ‘Protect-
    networks’, Proc. of the Grand Challenges of Traceability: The Next Ten          ing privacy in large datasetsfirst we assess the risk; then we fuzzy the
    Years, (2017).                                                                  data’, Cancer Epidemiology and Prevention Biomarkers, 26(8), (2017).
[4] Lundy Braun, Breathing race into the machine: The surprising career        [28] S. Verma and J. Rubin, ‘Fairness definitions explained’, in IEEE/ACM
    of the spirometer from plantation to genetics, U of Minnesota, 2014.            International Workshop on Software Fairness (FairWare), (2018).
[5] Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, and D. Sculley,     [29] Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu, ‘Machine Learn-
    ‘The ML test score: A rubric for ML production readiness and technical          ing Testing: Survey, Landscapes and Horizons’, IEEE Transactions on
    debt reduction’, in IEEE International Conference on Big Data, (2017).          Software Engineering, (2020).
[6] Yuriy Brun and Alexandra Meliou, ‘Software fairness’, in ACM               [30] Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz
    ESEC/FSE, p. 754759, (2018).                                                    Khurshid, ‘DeepRoad: GAN-based metamorphic testing and input val-
[7] L. M. Cysneiros, M. Raffi, and J. C. Sampaio do Prado Leite, ‘Soft-             idation framework for autonomous driving systems’, in International
    ware transparency as a key requirement for self-driving cars’, in Inter-        Conference on Automated Software Engineering (ASE), (2018).
    national Requirements Engineering Conference (RE), (2018).