Opening the Software Engineering Toolbox for the Assessment of Trustworthy AI Mohit Kumar Ahuja1 and Mohamed-Bachir Belaid1 and Pierre Bernab1 and Mathieu Collet1 and Arnaud Gotlieb1 and Chhagan Lal1 and Dusica Marijan1 and Sagar Sen1 and Aizaz Sharif1 and Helge Spieker1 Abstract. Trustworthiness is a central requirement for the accep- Having a definition of trustworthy formulates a goal for the de- tance and success of human-centered artificial intelligence (AI). To velopment of AI systems. The second step is to evaluate if a sys- deem an AI system as trustworthy, it is crucial to assess its be- tem fulfills the definition sufficiently and can be deemed trustworthy. haviour and characteristics against a gold standard of Trustwor- This evaluation should be transparent and accessible to understand thy AI, consisting of guidelines, requirements, or only expectations. its results, robust and reproducible, and both automated and generic While AI systems are highly complex, their implementations are still as much as possible to allow a low barrier for application to new based on software. The software engineering community has a long- AI systems. Since there is no single trustworthiness criterion or even established toolbox for the assessment of software systems, espe- metric, a single evaluation technique is not sufficient or maybe even cially in the context of software testing. In this paper, we argue for possible. The trustworthiness assessment has to consist of multiple the application of software engineering and testing practices for the techniques, each appropriate for some of the requirements of trust- assessment of trustworthy AI. We make the connection between the worthy AI and each robust and mature enough to be reliable. seven key requirements as defined by the European Commission’s Tools and techniques for the assessment of trustworthy AI can be AI high-level expert group and established procedures from software taken from the established methods of software engineering research engineering and raise questions for future work. and especially the subarea of software testing. For 50 years, these communities have proposed methods for the realization and assess- ment of large-scale, complex software systems. While the criteria for 1 INTRODUCTION trustworthy AI do cover more than technical aspects, the AI system Artificial Intelligence (AI) has increasing relevance for many aspects itself is still mostly a software system. Even though their are differ- of the current and future everyday life. Many of these aspects inter- ences in their engineering, many of the software engineering prin- fere directly with the personal space of humans, their perception, ac- ciples apply to them or are transferable [1, 5]. Recently, motivated tions, and, more generally, their data, both online and offline. Due to through the recent breakthroughs of AI and especially deep learning, this close integration, it is therefore crucial to develop the AI systems the software engineering community has increased the attention on in a human-centered fashion such that they are trustworthy and can machine learning, both as a tool within software engineering and an be accepted by providers, who develop and deploy the AI systems, area for the application of software engineering principles. users, who operate the AI systems, regulatory bodies, who oversee Through the remainder of this short paper, we argue to open the the usage and effects of the AI systems, and affected humans, who software engineering toolbox with its wide range of methods for the are act in cooperation with or next to the AI systems or who’s data is realization and assessment of trustworthy AI systems. Following the subject to processing via the AI systems. structure of the key requirements for trustworthy AI [12], we link To define the extent and more specific definition of a trustworthy existing techniques with the goals for the fulfillment of these re- AI, a high-level expert group (AI-HLEG) that was set up by the Eu- quirements. It is important to note, that even though there are already ropean Commission, identified guidelines and requirements for an AI many methods available, the research on trustworthy AI is by far not system that need to be sufficiently fulfilled to be regarded as trustwor- complete. Our current toolbox, however, provides a strong starting thy [12]. On the highest level, an AI system is deemed trustworthy position but needs adjustments and further experiences to be adapted if it behaves according to four ethical principles: respect for human for the specific characteristics of modern AI. autonomy, prevention of harm, fairness, and explicability [12, p. 12]; on a more technical level, requirements have been formulated that 2 TRUSTWORTHY AI are supposed ”to be continuously evaluated and addressed through- This section discusses an overview of approaches related to the key out the AI system’s life cycle” [12, p. 15]. requirements for Trustworthy AI from the HLEG’s Ethics Guidelines 1 from software engineering and adjacent subfields. We aim to anal- Simula Research Laboratory, Dept. of Validation Intelligence for Au- tonomous Software Systems, Oslo, Norway, {mohit, bachir, pierbernabe, yse how to map system engineering onto the requirements, and to mathieu, arnaud, chhagan, dusica, sagar, aizaz, helge}@simula.no Fund- show examples for techniques, case studies, that have already been ing: This work has received funding from the European Union under grant explored. At the same time allows a discussion of existing techniques agreement no. 825619 (AI4EU), the EU landmark project to develop a Eu- ropean AI on-demand platform and ecosystem. Copyright c 2020 for this to identify where future research is required or areas where the soft- paper by its authors. Use permitted under Creative Commons License At- ware engineering toolbox might be insufficient to properly address tribution 4.0 International (CC BY 4.0). aspects of a given requirement. task. For the same inputs, it is expected that all systems provide sim- Test Inputs Test Inputs ilar outputs and if a system diverges it is an indicator of faulty be- haviour. The advantage of differential testing is that the specific test oracles for the inputs, i.e. the precise expected outputs, are not re- Transformation quired which allows easier setup of the test cases, especially when System 1 System 2 ... System n defining the test oracles is too costly or complex. DeepXplore [21] System first explored differential testing for deep learning. The paper pro- poses a controlled way to generate test inputs, similar to adversarial Compare Outputs Metamorphic examples, that are likely to identify diverging behaviours and showed Relation promising results on multiple datasets and models. Metamorphic testing also alleviates the problem of defining pre- Fault-Generating Fault-Generating Inputs Inputs cise test oracles. Here, new test cases are generated with the help of (a) Differential Testing (b) Metamorphic metamorphic relations. These relations allow to describe a property Testing of the behaviour, e.g. the output, when a change in the input is made. For example, for an AI-based HR system to rank resumes of appli- Figure 1. Schematic Overview of two testing techniques for Deep cants, adding relevant keywords should improve the ranking, even Learning Systems though there is no precise definition of the final expected ranking. In the context of testing AI, metamorphic testing has been applied for testing of autonomous driving systems [30], image classifiers [10], 2.1 Human agency and oversight or ranking algorithms [19]. The first of the requirements is the necessity for the AI to support hu- man autonomy and the option for the human to inspect and influence 2.3 Privacy and data governance the AI’s actions. Human agency directly affects the collaboration be- tween AI and human and to support this interaction, it is important to Privacy protection of individuals who contribute with their personal take appropriate design measures, such as ergonomic and accessible data towards development of AI is of paramount importance in user interfaces (UI) and an excellent user experience (UX). Human human-centered AI. Any party that curates datasets needs to ensure oversight requires the inspection of the AI decision making, either that the data does not provide means of re-identifying individuals by having interpretable models or having access to design decision while, at the same time, being effective at predicting patterns of busi- documents, source code, or data, depending on the level of expertise ness/societal value. Secure data-intensive systems storing personal of the inspecting party. data typically contain identifying, quasi-identifying, non-identifying It is also one of the main challenges in AI to find a perfect balance and sensitive attributes about individuals. between enhancing human agency and preserving a degree of respon- Software tools such as ARX [22] can anonymize and perform re- sibility [11]. Some ”black box” AI techniques prevent the human identification risk analysis on large datasets to quantify the risk of from understanding the embraced process and thus prevent him from prosecutor, journalist, and marketer attacks before the data is used in the control. We believe that software engineering and testing frame- AI. ARX can be used to anonymize data based on criteria [8] such as works can contribute in achieving a better degree of human under- k-anonymization (personal attributes are suppressed or generalized standing and control of AI techniques. Software testing techniques until each row is identical with at least k-1 other rows), l-diversity are often based around the goal to design the simplest test cases to (entails reducing granularity of data), and t-closeness (a refined re- determine a system’s quality. Having these tests for AI systems will duction of granularity by maintaining an underlying data distribu- improve the ability to understand the AI behaviour and its deviations tion). However, in specific cases, quasi-identifying attributes such as from it. While there is already work on applying and adopting cur- the birth date of an individual are required to train AI models. There- rent testing techniques on AI [29], future work is required to ease fore, controlled fuzzification of quasi-identifying attributes [27] can their capabilities and expressiveness for human oversight. minimize the risk of re-identification while maintaining underlying patterns of interest in the data. For instance, in cervical cancer screen- ing, attributes such as birth date or screening exam date can be per- 2.2 Technical Robustness and safety turbed within certain bounds. This is primarily due to the fact that the human papillomavirus has an average latency period of 3 months. The technical robustness of AI systems is central to their reliability. Therefore, database commands can fuzzy all dates to the 15th of a While performing well in their main performance metrics, e.g. the month (middle), and move months by ±2 months without affecting classification accuracy, additional safety, and robustness metrics and disease progression patterns and increasing risk of re-identification. the resilience to attacks often remain open challenges [20]. Of par- ticular relevance are adversarial inputs which are specially crafted to attack an AI system, for example, to cause misclassification or to 2.4 Transparency extract internal information about training data. These challenges have recently been identified by several software The transparency of an AI system is closely related to its inter- testing techniques and have been adopted towards the testing of AI pretability and explainability, but also to the documentation of its systems, especially for deep learning. To highlight two techniques purpose and how it has been designed. An approach for transparency that have been successfully applied towards the testing of deep learn- documentation is the concept of model cards [18], which aims to pro- ing, we briefly discuss differential and metamorphic testing (see Fig- vide accessible overview of a model for people of different expertise, ure 1). In differential testing, a system is evaluated by comparing its including all of developers, testers, and technical end-users, similar behaviour against a set of reference implementations for the same to a package insert in pharmaceutical products. Lower level measures for transparency can be achieved via strict traceability and static analysis [26] to allow the documentation of High Level of Manual Effort system behaviour, e.g. in autonomous vehicles [3] in combination Questionnaire & Requirements Engineering & Analysis Self-Assessment with requirements engineering [7]. These techniques allow higher transparency of the AI during development, evaluation, and certifica- Process Control & Monitoring: tion tasks, where they serve mostly technical needs for the develop- Code & Data Review Manual Inspection ment and integration of the AI component. Software Testing of Trustworthy AI 2.5 Diversity, non-discrimination and fairness Scoring & Labeling High Level of Automation Adequate diversity in data to train AI systems is necessary to avoid discrimination and maintain fairness in human-centered AI. History has taught us that bias in using personal data has harmed several Figure 2. Concepts from the Software Engineering toolbox for the generations of ethnic minorities. Lundy Braun [4] reports the im- preparation, monitoring and evaluation of trustworthiness in AI projects. plications of biased data in spirometers that measure a person’s lung function after a forced exhale. The predicted values of a lung’s forced vital capacity (in litres of air exhaled) for black people for over a cen- name a few. All these approaches however require careful data col- tury been lower than white people. One of the reasons was that the lection after a target audience has been exposed to an AI system. data was collected from black men working in cotton fields where lint from cotton severely damaged lung function. This has resulted 2.7 Accountability in black people receiving very little help from medical insurance companies for several generations. Even today, race and not socio- Access to personal data used in AI systems should be controlled by economic factor is used as a parameter to predict lung capacity in its owner in human-centered AI. The owner can give consent of use spirometers used worldwide. This unfortunate trend continues in AI and take away access to personal data whenever he/she wants to. This systems where a recent study [15] shows that millions of black peo- implies that the AI system would need to be re-trained with or with- ple are victims of biased decision making in health care systems. out a specific person’s data. The proof of this operation should be Data needs to be carefully curated for training AI systems such made known to the owner to ensure accountability. The blockchain that variation in human attributes such as different ethnic groups, has the potential to facilitate the accountability of such transac- genders, ages, weights, heights, geographical areas, and medical his- tions between a data owner and an AI system. The blockchain is a tories are taking into account for unbiased decision making. How- distributed ledger which was initially designed to record financial ever, discovering if a data set satisfies all possible combinations of transactions. Numerous models of using the blockchain and smart attributes is often computationally intractable. Combinatorial inter- contracts have now been proposed for data access control [9] and action testing (CIT) of software has been very effective in finding AI [23]. Tal Rapke [16] suggests that people own and access their over 95% of all faults in a wide range of software systems using a health and life record on a decentralized blockchain that does not rely very small set of tests covering all 2-wise/pairwise combinations of on a central storage facility. This will liberate organizations from the features [14]. CIT has been extended to verify if data in a large re- liability of storing personal data. The data will reside on the latest lational database contains all pairwise interactions between attribute secure technology and using verifiable cryptography and owners of values of interest [24]. Verifying the presence of all pairwise interac- the data will be empowered to decide who they share their data with. tions in human attribute values in data set can clarify limitations or guarantee adequate diversity in human-centered AI systems. The importance of fairness in software received attention as a ded- 3 THE SOFTWARE ENGINEERING TOOLBOX icated topic in software engineering research [28] with close connec- The discussion of the key requirements on trustworthy AI [12] shows tions for the assessment via software testing methodology [6]. that there are many challenges to be addressed, but also a set of meth- ods available that can embraced and extended. As a general approach towards these challenges, we propose to adopt three main consid- 2.6 Societal and environmental well-being erations (see Figure 2): First, since the expectations on trustworthy Human-centered AI systems need to benefit society and not cause AI cannot be presented as a strict set of guidelines and rules only, harm. It is necessary to see an AI system as not merely a software it is recommended to understand their impact on the AI that is de- system but as a socio-technical system where the interaction between veloped. Performing thorough requirements analysis helps to gather people the system is used to evaluate its benefit. Learning from epi- these requirements in a systematic way [2] and to formalize the re- demiology, we can evaluate an AI system as if it were an intervention quirements’ impact on the AI including final acceptance criteria and on the public. For instance, in [25], the authors visualize the paths a whether they can be assessed automatically or require manual inter- patient takes after different screening exams for cervical cancer. Sim- vention. One method to guide the requirements analysis at this point ilarly, there is a need to understand how decisions made by the AI could be to formulate checklists for each of the requirements, e.g. system affect the decisions made by people and the paths they take in similar to this proposition for fairness [17]. life. Are people making healthier life choices, environmentally con- Second, the realization of a trustworthy AI should be continuously scious, or giving a helping hand in society after an AI intervention? accompanied by regular monitoring instruments. The goal of this Evidence-based software engineering [13] inspired by epidemiology monitoring is to ensure the awareness of trustworthiness measures and clinical studies presents numerous approaches to evaluate the during development. These monitoring instruments can include ded- impact of AI on people. These approaches include randomized con- icated questions to consider during code and data reviews, as well as trolled trials, observational studies, and focus group discussions to retrospective meetings. Third, automated testing should be used to allow automated, re- [8] Anil Prakash Dangi and Ravindar Mogili, ‘Privacy preservation mea- peated, and comparable assessment of trustworthiness. Where pos- sure using t-closeness with combined l-diversity and k-anonymity’, In- ternational Journal of Advanced Research in Computer Science and sible, testable acceptance criteria should be defined or test process Electronics Engineering (IJARCSEE), 1(8), (2012). that can quantify the behaviour of the AI system. For example, the [9] George Drosatos and Eleni Kaldoudi, ‘Blockchain applications in the technical robustness of an AI systems against adversarial inputs can biomedical domain: a scoping review’, Computational and structural be assessed through automatic techniques. biotechnology journal, (2019). Finally, in all cases does the qualitative and quantitative summary [10] A. Dwarakanath, M. Ahuja, S. Sikand, RM Rao, RP Bose, N. Dubash, and S. Podder, ‘Identifying implementation bugs in machine learning of the results, e.g. via a score or a badge to attest the quality of an based image classifiers using metamorphic testing’, in ACM Interna- AI system, provides valuable information to the different stakeholder tional Symposium on Software Testing and Analysis (ISSTA), (2018). groups, e.g. the providers, their customers, or the users. A common [11] L. Floridi, J. Cowls, M. Beltrametti, R. Chatila, P. Chazerand, scoring scheme, similar to the maturity levels in engineering projects, V. Dignum, C. Luetge, R. Madelin, U. Pagallo, F. Rossi, et al., ‘Ai4peoplean ethical framework for a good ai society: Opportunities, could allow for comparability and accessibility of the results. risks, principles, and recommendations’, Minds and Machines, 28(4), (2018). [12] High-Level Expert Group on Artificial Intelligence, ‘Ethics Guidelines 4 CONCLUSION for Trustworthy AI’, Technical report, European Commission, (2019). [13] Barbara A Kitchenham, Tore Dyba, and Magne Jorgensen, ‘Evidence- The realization of trustworthy AI systems is one of the big chal- based software engineering’, in International Conference on Software lenges for the success of ethical and human-centered AI. This has Engineering (ICSE), pp. 273–281, (2004). been acknowledged by both politics [12] and academia. For the im- [14] D Richard Kuhn and Michael J Reilly, ‘An investigation of the appli- plementation of trustworthiness principles, we argue for the adoption cability of design of experiments to software testing’, in 27th Annual NASA Goddard/IEEE Software Engineering Workshop, (2002). of methods and technologies from software engineering. Software [15] Heidi Ledford, ‘Millions of black people affected by racial bias in engineering has a long-standing tradition on the principled construc- health-care algorithms.’, Nature, 574(7780), (2019). tion of complex systems and has already much of the fundamental [16] Laure A Linn and Martha B Koo, ‘Blockchain for health data and its work available, as shown throughout this paper. potential use in health it and health care related research’, in ONC/NIST Still, further work is necessary to cover all the requirements on Use of Blockchain for Healthcare and Research Workshop, (2016). [17] Michael A Madaio, Luke Stark, Jennifer Wortman Vaughan, and Hanna trustworthy AI and to provide the tools and guidelines necessary for Wallach, ‘Co-Designing Checklists to Understand Organizational Chal- the widespread realization of trustworthy AI. Are the current soft- lenges and Opportunities around Fairness in AI’, in CHI, (2020). ware engineering tools sufficient to assess AI systems? Or do we [18] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy need to develop dedicated tools? How can we converge on a set of Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru, ‘Model cards for model reporting’, in Conference on acceptance criteria for trustworthiness? How many of the require- Fairness, Accountability, and Transparency - FAT*, (2019). ments can effectively be assessed in a mostly automated way? What [19] Christian Murphy, Gail Kaiser, Lifeng Hu, and Leon Wu, ‘Properties of are the challenges for assessing trustworthy AI by non-specialists or Machine Learning Applications for Use in Metamorphic Testing’, 20th external users? How can we present the results in an accessible way? International Conference on Software Engineering and Knowledge En- The software engineering community has already taken up the gineering (SEKE), (2008). [20] Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael P. challenge of software engineering for AI/ML, but often with a focus Wellman, ‘SoK: Security and Privacy in Machine Learning’, in IEEE on the general system engineering, maintenance requirements, and European Symposium on Security and Privacy (EuroS&P), (2018). general validation. However, as the requirements discussed in this [21] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana, ‘Deepxplore: paper showed, the engineering efforts need to cast a wider need and Automated whitebox testing of deep learning systems’, in Symposium on Operating Systems Principles, (2017). address more concerns in the context of trustworthy AI. This will be [22] Fabian Prasser and Florian Kohlmayer, ‘Putting statistical disclosure an interdisciplinary challenge and the software engineering toolbox control into practice: The arx data anonymization tool’, in Medical can be of central relevance during its development. Data Privacy Handbook, (2015). [23] Khaled Salah, M Habib Ur Rehman, Nishara Nizamuddin, and Ala Al-Fuqaha, ‘Blockchain for ai: Review and open research challenges’, REFERENCES IEEE Access, 7, (2019). [24] Sagar Sen, Dusica Marijan, Carlo Ieva, Astrid Grime, and Atle Sander, [1] Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Har- ‘Modeling and verifying combinatorial interactions to test data inten- ald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and sive systems: Experience at the norwegian customs directorate’, IEEE Thomas Zimmermann, ‘Software Engineering for Machine Learning: Transactions on Reliability, 66(1), (2016). A Case Study’, in International Conference on Software Engineering: [25] Sagar Sen, Manoel Horta Ribeiro, Racquel C De Melo Minardi, Wagner Software Engineering in Practice, (2019). Meira, and Mari Nygård, ‘Portinari: a data exploration tool to personal- [2] Hrvoje Belani, Marin Vukovic, and Željka Car, ‘Requirements Engi- ize cervical cancer screening’, in International Conference on Software neering Challenges in Building AI-Based Complex Systems’, in Inter- Engineering: Software Engineering in Society (ICSE-SEIS), (2017). national Requirements Engineering Conference Workshops, (2019). [26] Caterina Urban, ‘Static analysis of data science software’, in Interna- [3] Markus Borg, Cristofer Englund, and Boris Duran, ‘Traceability and tional Static Analysis Symposium, pp. 17–23, (2019). deep learning-safety-critical systems with traces ending in deep neural [27] Giske Ursin, Sagar Sen, Jean-Marie Mottu, and Mari Nygård, ‘Protect- networks’, Proc. of the Grand Challenges of Traceability: The Next Ten ing privacy in large datasetsfirst we assess the risk; then we fuzzy the Years, (2017). data’, Cancer Epidemiology and Prevention Biomarkers, 26(8), (2017). [4] Lundy Braun, Breathing race into the machine: The surprising career [28] S. Verma and J. Rubin, ‘Fairness definitions explained’, in IEEE/ACM of the spirometer from plantation to genetics, U of Minnesota, 2014. International Workshop on Software Fairness (FairWare), (2018). [5] Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, and D. Sculley, [29] Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu, ‘Machine Learn- ‘The ML test score: A rubric for ML production readiness and technical ing Testing: Survey, Landscapes and Horizons’, IEEE Transactions on debt reduction’, in IEEE International Conference on Big Data, (2017). Software Engineering, (2020). [6] Yuriy Brun and Alexandra Meliou, ‘Software fairness’, in ACM [30] Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz ESEC/FSE, p. 754759, (2018). Khurshid, ‘DeepRoad: GAN-based metamorphic testing and input val- [7] L. M. Cysneiros, M. Raffi, and J. C. Sampaio do Prado Leite, ‘Soft- idation framework for autonomous driving systems’, in International ware transparency as a key requirement for self-driving cars’, in Inter- Conference on Automated Software Engineering (ASE), (2018). national Requirements Engineering Conference (RE), (2018).