AI-Blueprint for Deep Neural Networks Ernest Wozniak1, Henrik Putzer1,2, Carmen Cârlan1 1 fortiss GmbH, Guerickestr. 25, 80807 Munich, Germany, lastname@fortiss.org 2 cogitron GmbH, Stefaniweg 4, 85652 Pliening, Germany, henrik.putzer@cogitron.de Abstract This work advocates a different approach, where AI is con- Development of trustworthy (e.g., safety and/or security crit- sidered as a 3rd kind technology (next to software (SW) and ical) hardware/software-based systems needs to rely on well- hardware (HW)), which requires its own process model to defined process models. However, engineering trustworthy ensure trustworthiness. This is because AI, in particular Ma- systems implemented with artificial intelligence (AI) is still chine Learning (ML) is a new, data-driven technology, poorly discussed. This is, to large extend, due to the stand- point in which AI is a technique applied within software en- which requires a specific engineering process with specific gineering. This work follows a different viewpoint in which tailored methods for assuring trustworthiness. Such a struc- AI represents a 3rd kind technology (next to software and tured engineering process will be introduced by this paper, hardware), with close connections to software. Consequently, while also discussing the integration of this process into the the contribution of this paper is the presentation of a process overall system lifecycle, as presented in the VDE-AR-E model, tailored to AI engineering. Its objective is to support the development of trustworthy systems, for which parts of 2842-61 standard [VDE-AR-E 2842-61:2020]. their safety and/or security critical functionality are imple- This paper is structured as follows. Foundations section pre- mented with AI. As such, it considers methods and metrics at sents the VDE-AR-E 2842-61 standard which states the different AI development phases that shall be used to achieve main context for this work. Next section presents a generic higher confidence in the satisfaction of trustworthiness prop- process model called AI-Blueprint, upon which, process erties of a developed system. models tailored to specific AI techniques (e.g. Deep Neural Networks - DNNs, Reinforcement Learning) can be built. Introduction After that, a specific instance of AI-Blueprint for DNNs is provided, with the follow-up section showing its application A common deficiency of safety standards like ISO 26262 for a development of CNN (Convolutional Neural Net- [ISO 26262:2018], in the automotive domain, is that they do work)-based pedestrians’ detection component. Finally, re- not account explicitly for Artificial Intelligence (AI) tech- lated work is presented, i.e. process models for the develop- nology [Putzer, H.J.]. However, opposed to older rumors – ment of AI components, especially in the context of the de- safety standards do not prohibit the use of AI, they just do velopment of safety and security critical systems. In the last not provide any guidelines on how to use this technology. section, we draw some conclusions and discuss future work. Actually, lately, AI has reached high attention among the automotive, healthcare, or defense industry. This is due to the capabilities of AI during design (shorter time to market; Foundations implementation of implicit requirements) and during opera- Fig. 1 presents the reference lifecycle defined in VDE-AR- tion (improved performance). To achieve the vision in E 2842-61 standard. The standard will consist of six parts which AI not only supports, but also provides safety and/or where the three of them are already published. The reference security critical functionality, systems must be assured by lifecycle can be used as a reference for a process model that evidence for trustworthy behavior of AI components. supports the development and assurance of a concrete trust- There is a tendency that AI is regarded as software. It is sug- worthy system. Trustworthiness is considered as a more ge- gested that the application of existing standards in their cur- neric concept that combines a user defined and potentially rent form is adequate and that the reuse of already available project specific set of aspects. These aspects include but are processes, methods and metrics specific to software compo- not limited to (functional) safety, security, privacy, usabil- nents can be used for AI. ity, ethical and legal compliance, reliability, availability, Copyright © 2021 for this paper by its authors. Use permitted under Crea- tive Commons License Attribution 4.0 International (CC BY 4.0). Fig. 2: AI-blueprint for DNN The development phase at system level provides as inputs to the AI-Blueprint the system and trustworthiness require- Fig. 1: Reference Lifecycle of VDE-AR-E 2842-61 ments, together with the desired Trustworthiness Perfor- maintainability, and (intended) functionality. The reference mance Level (TPL). TPL is a risk classification scheme sim- lifecycle defines the logical flow of assurance activities and ilar to the Automotive Safety Integrity Level (ASIL) with is inspired by the structure of the ISO 26262 safety lifecycle. the exception that it concerns trustworthiness, not only However, it is domain-independent. Detailed description of safety. It appoints selection of qualitative methods and met- the phases can be found in [Putzer, H.J.]. In this work we rics (M&M-s), in a systematic approach, to achieve certain focus only on the development phase dedicated to the Tech- TPL level. nological Element (see Fig. 1). The scope of this phase is to The AI-Blueprint outputs the AI element and the value of provide guidance for the implementation of elements based UCI (Uncertainty-related Confidence Indicator), which are on a single technology (e.g. SW, HW, and especially AI). provided back to the development at system level. UCI is a With a suitable process interface, when using the VDE-AR- quantitative indicator that describes the achieved confidence E 2842-61 standard, these process models can be borrowed in the trustworthiness of AI component, which can be com- from other suitable standards. For example, in automotive, bined (in a statistically principled way) to calculate failure SW or HW based components can be developed following rate at the system level [Zhao, X.]. It represents a quantita- the V models defined in ISO 26262. However, there is no tive guarantee that a component can deliver as part of the standardized process model that can be used for AI compo- trustworthiness contract established with the rest of a sys- nents. Consequently, the following two sections present a tem. Desired value of UCI is defined via assigned TPL. UCI concept of AI-Blueprint, which acts as a template for con- conceptually can be compared to the idea of  expressing structing process models for specific AI technologies, such random hardware failures (ISO 26262 part 5). as the one presented later, i.e. a process model for DNNs. AI-Blueprint for DNN The AI-Blueprint In this section, we instantiate the concept of AI-Blueprint The development of AI components does not fit into exist- for a certain technology, namely DNN. For each phase in ing process models (e.g. like for classical software) due to the process model, we discuss its objectives to be achieved the specific nature of the AI data-driven implementation. and the methods and metrics (M&M-s) that can be used to Even inside the field of AI, different methodologies and so- ensure higher trustworthiness. Significant area of research is lution concepts can have very specific requirements towards the underlying process model. This urges for a new ap- proach in which specific characteristics of certain AI tech- nology are targeted by specific process model. In this paper, we introduce the concept of AI-Blueprint. It is a template process that shall be refined for a specific AI technique. The AI-Blueprint is characterized by Input and Output Interfaces, Structure (development phases) and Qualifications (e.g. used for the first time, or proven, i.e. used with success in many projects). The execution of an instance of the AI-Blueprint provides an AI element charac- terized by a predefined quality level, including guarantees Fig. 3: Left DNN Blueprint Branch M&M-s for Trustworthi- to meet defined trustworthiness requirements. ness Assurance to verify a design. The set of possible M&M-s that shall in- crease confidence in trustworthiness of an AI element (i.e. DNN) and which regards data preparation has been inten- sively researched. For instance, data shall account for corner cases or adversarial examples. Specific use-cases may be obtained through synthetic data generated using for instance Generative Adversarial Networks (GANs) [Esteban, C]. Further, Variational Autoencoders (VAE) may be applied to see whether data falls within the ODD (Operational Design Domain) [Vernekar, S.]. M&M-s can also be used to im- prove the quality of data labeling. For example, labeling in- accuracies may be circumvented by providing datasets la- Fig. 4: Right DNN Blueprint Branch M&M-s for Trustwor- beled by different teams and/or technologies. thiness Assurance devoted to the identification of methods (apart from stand- NN Design Phase ard testing on validation and test set), and metrics that can This is a phase that outputs as a main artefact the DNN de- be used to reason about the trustworthiness of DNNs. Their sign. The main difference between DNN design and DNN usage (or not) is part of the systematic approach to achieve model is that the later contains also information about certain TPL level. Additionally, a subset of them will con- trained weights. Consequently, the main objectives of this tribute to the estimation of UCI. This however requires pro- phase are the specification of design-related requirements vision of a “bridge” between M&M-s and estimates of UCI, and the development of a model that satisfies them. The similarly as it could possibly be done for SW (see [Rushby, M&M-s at this phase shall contribute to the higher trustwor- J ]). Fig. 3 and Fig. 4 present examples of M&M-s grouped thiness of AI element by making the design robust to failures along the development phases. Still an open problem are the (e.g. usage of redundancy) or noisy data (e.g. uncertainty requirements imposed on their usage, depending on the as- calculation with MC-dropout), contributing to generaliza- signed TPL. For example, ISO 26262 part 6 provides corre- tion property (e.g. design guidelines to select appropriate ac- sponding set of methods to assure confidence in the fulfill- tivation function, etc.), and other non-functional properties ment of assurance objectives for a SW component. These which impact trustworthiness. methods (e.g. usage of strongly typed programming lan- guages, formal verification, etc.) in the context of a particu- Implementation & Training Phase lar ASIL level (A, B, C or D) are highly recommended, rec- This phase considers as an input the DNN design and the ommended, or have no recommendation for/against their us- training dataset. In order to assure higher confidence in the age. Similar set of rules shall be also worked out for this DNN training, the NN developer shall follow good practices DNN blueprint. This is currently left for a future work. for coding (e.g. for high-criticality functionality only strongly typed languages may have to be permitted). Other Initiation Phase M&M-s can be used to optimize the training. These include During this phase, the team that will develop the DNN com- but are not limited to: cropping, subsampling, scaling, etc. ponent is assembled. Then, all requirements allocated to Higher trust can also be achieved by defining and following DNN are collected and harmonized. These are product re- criteria for the training platform (e.g. level of coherence with lated and trustworthiness requirements specified during the a final execution platform). The output artefact of this phase system-level development phase. Further, the acceptance is a DNN model, which is the main input to the verification criteria for DNN are defined. M&M-s that can be used at and validation phases. this phase refer to requirements engineering. Training Verification Phase Data Preparation Phase This phase is part of the training procedure with the main The first objective of this phase is to derive data-related re- purpose of controlling and verifying it. Based on the valida- quirements from system-level requirements. The second ob- tion dataset and predefined validation metrics, the evolving jective of this phase is to gather proper data (accordingly to DNN model is verified after each epoch. A negative out- the requirements) and to group it into training, validation, come of the verification may require changes in the training and test sets. Validation set is used during the training, but (e.g. resignation from subsampling), or may require changes with a purpose to assess the model convergence after each of hyperparameters, defined during the design phase. epoch. This set can be further used during the design phase Design Verification who is using a wheelchair or a means of conveyance pro- This phase aims at the verification of DNN by investigating pelled by human power other than a bicycle) are properly possible problems related to the NN design. The NN devel- detected and AIR02: ODD is defined through the European oper shall evaluate the result of the training, using validation roads. Further, we also derive trustworthiness AI require- dataset, and assess possible problems that may have arisen ments, purposed mainly to counteract identified hazards, due to bad design decisions (e.g., inappropriate activation e.g.: AITR01: the DNN shall output for each detected rele- function). The NN developer may request follow-up itera- vant pedestrian a bounding box with accurately estimated tions over the epochs or, if needed, the redesign of NN hy- size and position accuracy in the velocity dependent detec- perparameters (return to Design Phase). tion zone, in all situations the Ego Vehicle may encounter, while being in the ODD; AITR02: pedestrians occluded up NN Verification Phase to 95% shall be properly identified; and AITR03: the DNN The IV&V (Independent Verification and Validation) engi- component shall not output false positives in the detection neer shall execute a set of tests, in order to judge on the suc- zone more than once in a sequence of 5 video frames. Next cess or failure of the NN generalization, brittleness, robust- to these requirements, there is also a value of the TPL, which ness or efficiency. The judgement should be primarily based is assigned at the system level. On a scale from A to D (high- on the measured accuracy of NN over the test data set and/or est trustworthiness criticality level), in case there are no re- identified and/or generated adversarial examples and/or cor- dundant components to pedestrians detection component, ner cases. Tests may also involve faults injection or endur- the assigned TPL value is D due to the high criticality of ance tests to measure sustainability of an NN. functionality that it provides. NN Validation, Deployment, and Release Example of Activities to fulfill Phases Objectives During this final phase the IV&V engineer shall assess This subsection presents examples of activities that can be whether the AI element complies with all product and trust- executed over the different phases of the process model for worthiness requirements. The NN shall be then integrated CNN in order to meet the objectives of the phases and pro- with hardware and/or software libraries in order to be de- vide as a final outcome an AI element together with a trust- ployed in the overall system. The resulting AI element shall worthiness guarantee expressed by the value of UCI. be validated while running on the target platform. The final During the Initiation Phase system-level requirements are objective of this phase is to calculate the UCI in order to refined. In the example of pedestrian recognition, examples assess compliance of the AI element with an initially as- of the refined product requirements are: AIR01 → signed TPL. If the obtained UCI value does not correspond AIR01.01: pedestrians of min. width (20 pixels) and min. to the TPL level, redesign decisions either at the AI level height (20 pixels) shall be classified; AIR02 → AIR02.01: (e.g., design changes, collection of additional data) or at the ODD shall consider right-lane and left-lane traffic. system level (e.g., introduction of a redundant element to Further, we refine the AI trustworthiness requirements as decrease assigned TPL level) shall be planned and executed. follows: AITR01 → AITR01.01: the value of mean aver- age precision shall be greater or equal to 97.9%; and AITR02 → AITR02.01: the data samples shall include ex- Practical Example amples with a sufficient range of levels of occlusion giving partial view of pedestrians at crossings. The objective of this section is to showcase the traversal Next, to fulfill further objectives of this phase, the team de- over the proposed DNN process model for developing a veloping the AI element has to be assembled and, if neces- Convolutional Neural Network (CNN) for pedestrian detec- sary, the DNN blueprint needs to be adjusted to reflect fur- tion. The overall context for this use-case, i.e., the System ther identified needs expressed by the team. of Interest (SoI) is a pedestrian collision avoidance system. The first activity during the Data Preparation Phase is to This system entails several components, SW or HW based, look over the requirements and process those that impact the among which there is the AI component with the main re- data gathering and labeling activities. For instance, AITR02 quirement to detect pedestrians (i.e., 2D bounding box de- has an implicit impact on the data because, to properly train tection of pedestrians) based on the analysis of video data and test the model, data shall contain pedestrians with dif- acquired from a single camera. ferent levels of occlusion (up to 95%). There could also exist data related concerns explicitly expressed, such as AIR03: Input from the System-level the data samples shall include a sufficient range of examples From the system-level requirements, we derive AI func- reflecting the effects of identified system failure modes, tional requirements, such as: AIR01: relevant (defined via AIR04: the format of each data sample shall be representa- reachability zone) pedestrians (any person who is afoot or tive of that which is captured using sensors deployed on the ego vehicle or AIR05: the data samples shall include suffi- could perform design investigation to analyze neurons acti- cient range of pedestrians within the scope of the ODD. The vation. This contributes to the explainability of how pedes- data shall be then gathered, labeled and properly stored, ac- trians are identified and may allow to prune those neurons, cording to the identified requirements. which do not play any role in the decision process. Pruning The Design Phase shall first analyze requirements which may also be used to limit the number of neurons, to possibly may have implications on a CNN design. Examples of such eliminate problems of overfitting. M&M-s that shall be ap- requirements are: AIR06: the DNN shall be robust against plied can be classified as white-box because they refer to all types of foreseeable noise, AIR07: a diagnosis function internal characteristics of CNNs. Functional Verification shall exist in order to detect distributional shift in the envi- Phase activities shall also target verification of key trust- ronment (out of ODD detection) or AIR08: plausibility worthiness concerns, however more from the grey/black- checks of detected bounding boxes are necessary (e.g. pe- box perspective. Here not only CNN itself is verified, but destrians usually do not fly). The designer shall then specify also the data. For example, if the CNN does not detect pe- the CNN in terms of the number of convolutional layers, destrians on a wheelchair, most likely the CNN was never kernel size, number of fully connected layers, neurons in fed with such training examples. Explainability could be each layer, loss function, and other hyperparameters, to pro- further enhanced by using attention based methods. For ex- vide a design that will best serve the intended purpose. The ample, heat maps (grey-box method) may reveal those fea- analyzed requirements shall also steer design activities. For tures from the image which are used to identify pedestrians. instance, AIR06 requests robustness to various types of These may be different body parts, or maybe just vertical noise which can occur in the input data. The presence of lines. The result highly depends on the level of features an- noise may also be problematic even during the training as it notation performed during the labeling. If it is not detailed may lead to overfitting. Further, AIR07 may be handled by enough, the NN tester may request extended feature annota- either introducing additional component (in such case the tion to be performed at the data preparation phase. solution would affect system level) based on VAE, which The validation is executed in the last phase, i.e., NN Vali- can identify distributional shift, or the CNN itself may use dation, Deployment, and Release. Its main activity centers MC-dropout to calculate uncertainties, where high uncer- at the validation of the requirements provided as an input to tainty may result from out of ODD input. AIR08 may be the initiation phase, and their refined versions elaborated at accommodated by design through additional knowledge in- that phase. For instance, to validate AIR01.01 one has to jected into the CNN (neural symbolic integration) that re- identify input images within the test set on which there are jects labels that based on human knowledge make no sense. pedestrians with height and width close to 20 pixels and see The first activity of the Implementation and Training whether these are properly detected. In case they are not, ei- Phase is to provide a code for the CNN. The coding activity ther requirements shall be changed or the training data shall can follow standard SW development practices recom- be checked to identify whether enough samples was there to mended in ISO 26262 part 6. However, certain differences train the model for this requirement. such as tooling, libraries or available programming lan- guages (Python or preferably C++) create additional chal- Output provided to the System-level lenges to this activity. Next, the training platform needs to The artefact output by the DNN process model is a trained be selected with justification and training related parameters and verified CNN model for pedestrians’ detection. Next to shall be provided, e.g. batch size = 1024, number of epochs it, the UCI value is computed, which accounts for M&M-s = 4, number of iterations = 10, learning rate = 0.001 or decay being used throughout the blueprint and their efficiency in factor = 0.9. Then, according to predefined parameters the minimizing risks of possible hazards which may occur. training shall be assessed during the Training Verification Phase. The parameters may be tuned if necessary (e.g. model does not converge) or early stopping triggered if the Related Work model has learned to extract all the meaningful relationships The fact that there is a need for a dedicated process model from the data before starting to model the noise. for the development of AI components within safety and/or Having the trained model, verification and validation activ- security critical systems has been underlined more than 20 ities can be performed. Verification shall primarily investi- years ago by Rodvold, who proposed a formal development gate key trustworthiness concerns which are specific to methodology and a validation technique for AI trustworthi- CNNs. These are robustness, brittleness, efficiency, ODD ness assurance [Rodvold, D.M]. While the phases in the pro- coverage, distributional shift, unknown behavior in rare sit- cess model proposed by Rodvold resemble the phases of our uations (corner cases or adversarial examples). Verification proposed AI-Blueprint, Rodvold does not discuss the met- starts at Design Verification Phase in which performed ac- tivities shall encounter possible problems regarding men- tioned properties in relation to the design. For instance, one rics and corresponding methods that can be used for the im- framework for the development of trustworthy autono- plementation and the verification of the considered trust- mous/cognitive systems, regulated in the upcoming VDE- worthiness requirements. AR-E 2842-61 standard. This work however is still in its Microsoft presents a nine-stage ML workflow for the devel- early phases. As a future work, recommendations for spe- opment of AI-based applications [Amershi, S]. The work- cific metrics and methods, advertised along the DNN blue- flow is claimed to be used by multiple teams inside Mi- print, should be established, based on the TPL levels as- crosoft, for diverse applications, and to have been integrated signed to AI element. Also, current research status regarding into overall, preexisting agile software engineering pro- feasibility or performance of these methods should be inves- cesses. Amershi et. al. categorize the workflow stages as tigated, to eliminate those which cannot be used while de- data-oriented (e.g., collection, cleaning, and labeling) and veloping industry size DNNs. Next, further research on how model-oriented (e.g., model requirements, feature engineer- to calculate the value for the newly introduced UCI concept ing, training, evaluation, deployment, and monitoring). is necessary. Finally, having the concept of AI-Blueprint, While these stages are similar to the ones in our proposed new blueprints, such as for reinforcement learning, neural AI-Blueprint, the Microsoft workflow does not consider ac- symbolic integration could be derived. tivities specific for trustworthiness assurance, as their work- flow is only intended to be used for the implementation of non-critical functionality. References Ashmore et. al. present a process model for ML components Putzer, H.J. and Wozniak, E., 2020. A Structured Approach to in critical systems, consisting of four phases: Data Manage- Trustworthy Autonomous/Cognitive Systems. arXiv preprint ment, Model Learning, Model Verification and Model De- arXiv:2002.08210. ployment [Ashmore, R]. For each phase in the model, they ISO 26262 Road vehicles – Functional safety, 2018 define the assurance-related desiderata (i.e., objectives) and VDE-AR-E 2842-61 - Design and Trustworthiness of autono- they discuss how state-of-the-art methods may contribute to mous/cognitive systems, 2020 the achievement of those desiderata. The work presented in Esteban, C., Hyland, S.L. and Rätsch, G., 2017. Real-valued (med- this paper is complementary to the work of Ashmore et. al.. ical) time series generation with recurrent conditional gans. arXiv preprint arXiv:1706.02633. First, AI-Blueprint for DNN elaborates more on the valida- Vernekar, S., Gaurav, A., Abdelzad, V., Denouden, T., Salay, R. tion and verification of AI components, having separate and Czarnecki, K., 2019. Out-of-distribution detection in classifi- phases for design verification, verification of functional re- ers via generation. arXiv preprint arXiv:1910.04241. quirements and validation of the AI component w.r.t. trust- Rodvold, D.M., 1999, July. A software development process worthiness requirements. Second, instead of proposing a model for artificial neural networks in critical applications. general AI process model, we advocate the need for both a In IJCNN'99. International Joint Conference on Neural Networks. higher-level template for process models guiding the devel- Proceedings (Cat. No. 99CH36339) (Vol. 5, pp. 3317-3322). opment of AI components (i.e., AI-Blueprint), and more IEEE. concrete process models for particular AI technologies (e.g., Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B. and Zimmermann, T., 2019, May. Soft- AI-Blueprint for DNNs). Third, we discuss the process mod- ware engineering for machine learning: A case study. In 2019 els in the context of the overall system lifecycle. IEEE/ACM 41st International Conference on Software Engineer- Toreini et. al. examine the qualities technologies should ing: Software Engineering in Practice (ICSE-SEIP) (pp. 291-300). have to support trust in AI-based systems, but from the per- IEEE. spective of social sciences [Toreini, E]. They present an in- Ashmore, R., Calinescu, R. and Paterson, C., 2019. Assuring the teresting, but abstract machine learning pipeline, whose machine learning lifecycle: Desiderata, methods, and challenges. phases could be aligned to those in the AI-Blueprint for arXiv preprint arXiv:1905.04223. DNNs. Nevertheless, they do not offer detailed description Toreini, E., Aitken, M., Coopamootoo, K., Elliott, K., Zelaya, C.G. and van Moorsel, A., 2020, January. The relationship between trust of each of the phases, neither they deliberate on specific in AI and trustworthy machine learning technologies. In Proceed- M&M-s and how they shall be used to increase the confi- ings of the 2020 Conference on Fairness, Accountability, and dence in trustworthy solution. Transparency (pp. 272-283). Rushby, J., 2009, November. Software verification and system as- surance. In 2009 Seventh IEEE International Conference on Soft- Conclusions and Future Work ware Engineering and Formal Methods (pp. 3-10). IEEE. This paper presented the concept of AI-Blueprint and an ex- Zhao, X., Robu, V., Flynn, D., Salako, K. and Strigini, L., 2019, October. Assessing the safety and reliability of autonomous vehi- ample of how to use this blueprint for tailoring a process cles from road testing. In 2019 IEEE 30th International Sympo- model for a certain AI technology (i.e., DNNs), with the sium on Software Reliability Engineering (ISSRE) (pp. 13-23). scope of supporting trustworthiness assurance. We also dis- IEEE. cussed how the proposed AI-Blueprint fits in an overall