A Hierarchical HAZOP-Like Safety Analysis for Learning-Enabled Systems Yi Qi1,* , Philippa Ryan Conmy2 , Wei Huang1 , Xingyu Zhao1 and Xiaowei Huang1 1 Department of Computer Science, University of Liverpool, Liverpool, L69 3BX, U.K. 2 Adelard Part of NCC Group, London, N1 7UX, U.K. Abstract Hazard and Operability Analysis (HAZOP) is a powerful safety analysis technique with a long history in industrial process control domain. With the increasing use of Machine Learning (ML) components in cyber physical systems—so called Learning- Enabled Systems (LESs), there is a recent trend of applying HAZOP-like analysis to LESs. While it shows a great potential to reserve the capability of doing sufficient and systematic safety analysis, there are new technical challenges raised by the novel characteristics of ML that require retrofit of the conventional HAZOP technique. In this regard, we present a new Hierarchical HAZOP-Like method for LESs (HILLS). To deal with the complexity of LESs, HILLS first does “divide and conquer” by stratifying the whole system into three levels, and then proceeds HAZOP on each level to identify (latent-)hazards, causes, security threats and mitigation (with new nodes and guide words). Finally, HILLS attempts at linking and propagating the causal relationship among those identified elements within and across the three levels via both qualitative and quantitative methods. We examine and illustrate the utility of HILLS by a case study on Autonomous Underwater Vehicles, with discussions on assumptions and extensions to real-world applications. HILLS, as a first HAZOP-like attempt on LESs that explicitly considers ML internal behaviours and its interactions with other components, not only uncovers the inherent difficulties of doing safety analysis for LESs, but also demonstrates a good potential to tackle them. Keywords Safety analysis, HAZOP, learning-enabled system, trustworthy AI, AI safety, hazard identification, autonomous underwater vehicle, machine learning security, deviation analysis, robotics and autonomous system, cyber physical system 1. Introduction for the whole system can be identified [3]. In recent years, increasingly sophisticated mathemati- After initially developed to support the chemical process cal modelling processes from Machine Learning (ML) are industries (by Lawley [1]), Hazard and Operability Anal- being used to analyse complex data and then embedded ysis (HAZOP) has been successfully and widely applied into cyber physical systems—so called Learning-Enabled in the past 50 years. It is generally acknowledged to be Systems (LESs). How to ensure the safety of LESs has an effective yet simple method to systematically iden- become an enormous challenge [4, 5, 6]. As LESs are dis- tify safety hazards. HAZOP is a prescriptive analysis ruptively novel, they require new and advanced analysis procedure designed to study the system operability by for the complex requirements on their safe and reliable analysing the effects of any deviation from its design function [7]. Such analysis needs to be tailored to fully intent [2]. A HAZOP does semi-formal, systematic, and evaluate the new character of ML [8, 9], making con- critical examination of the process and engineering inten- ventional methods including HAZOP and HAZOP-like tions of the process design. The potential for hazards or variants (e.g., CHAZOP [10] and PES-HAZOP [11] that operability problems are thus assessed, and malfunction are respectively introduced for computer-based and pro- of individual components and associated consequences grammable electronic systems) obsolete. Moreover, LESs exhibit unprecedented complexity, while past experience The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety (AISafety suggests that HAZOP should be continuously retrofitted 2022), July 24-25, 2022, Vienna, Austria * Corresponding author. to accommodate more complex systems [12], consider- $ yiqi@liverpool.ac.uk (Y. Qi); pmrc@adelard.com (P. R. Conmy); ing quantitative analysis frameworks [13, 14] and human huang23@liverpool.ac.uk (W. Huang); factors [15]. To the best of our knowledge, there is no xingyu.zhao@liverpool.ac.uk (X. Zhao); HAZOP-like safety analysis dedicated for LESs that takes xiaowei.huang@liverpool.ac.uk (X. Huang) into account ML characters while preserving the sim- € https://github.com/YiQi0318 (Y. Qi); https://www.adelard.com/people/philippa-ryan.html (P. R. Conmy); plicity and effectiveness of HAZOP (comparing to other https://intranet.csc.liv.ac.uk/~wh1923/ (W. Huang); conventional safety analysis methods [16]), which moti- https://www.xzhao.me/ (X. Zhao); https://cgi.csc.liv.ac.uk/~xiaowei/ vates this research. (X. Huang) In this paper, we introduce a new Hierarchical HAZOP-  0000-0002-3474-349X (X. Zhao); 0000-0001-6267-0366 Like method for LESs (HILLS). HILLS first stratifies (X. Huang) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License the complex LESs into three levels—System Level, ML- Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 Lifecycle Level and Inner-ML Level, then applies HAZOP separately on each level to identify safety elements of improvement of the system. More details are given for interest, namely causes, mitigation, hazards (or latent- each step of HAZOP as what follows. hazards for latent levels that cannot directly lead to mishaps) and security threats. When applying HAZOP on Form HAZOP team To perform HAZOP, a team of the ML related levels, we revise HAZOP to cope with ML specialists is formed according to the project scope and characteristics, e.g., by introducing new ways of defining aims. These experts have extensive experience, expert nodes and new sets of guide words. We also identify knowledge and understand the overall procedures of the causes of hazards from the ML development process (mod- system deeply, such as operations, maintenance and en- elled by the ML-Lifecycle level) to reflect its data-driven gineering design. nature (e.g., how data is collected, processed, etc). Fur- thermore, we attempt to address the challenge of how Identify system elements The HAZOP team will for- to link and propagate those identified safety elements mally represent the system under study by identifying within and across three levels, then propose both qualita- the elements. Each element is called a Node, represent- tive and quantitative (an initial Bayesian Belief Network ing an operational function. Then, nodes and interactions (BN) solution) methods to model the casual relationships. between nodes (e.g., data/control flows) collectively form To examine the effectiveness and demonstrate the use the system representation under analysis. case of HILLS, we finally conduct a case study on Au- tonomous Underwater Vehicles (AUVs), with discussions Consider deviations of operational parameters on assumptions adopted and extensions to real-world HAZOP assumes that a problem can only arise when applications. there are some Deviations from the intent design. HA- The key contributions of this work include: ZOP searches for deviations in the system representation. a) A first HAZOP-like safety analysis for LESs that The deviation on a node is expressed as the combination explicitly considers ML characters (including security of Guide Words and process Attributes . threats and the data-driven nature in the development Each guide word is a short word to create the imagina- process) and reduces the complexity by hierarchical de- tion of a deviation of the design/process intent. The most sign. commonly used guide words are: no, more, less, as well b) New considerations of dividing nodes in the system as, part of, other than, and so on. Guide words provide a representation and a set of new guide words that adapt systematic and consistent means of brainstorming poten- the traditional HAZOP for levels regarding ML models. tial deviations to normal operations. Each guide word has c) A first attempt at linking/propagating identified a specific meaning, e.g., no means the complete negation causes, mitigation, (latent-)hazards and security threats of the design intention, early means something occurred across ML levels. earlier than intended time. Attributes are closely related d) Key challenges identified as research questions that to nodes, and are usually the subject of the action being are generic to safety analysis for LESs in future research. performed. The definition of attributes relies on expert knowledge. 2. Preliminaries: HAZOP Identify hazards, causes and mitigation Where HAZOP is an inductive hazard assessment method that the result of a deviation would be a danger to work- is conducted by an expert team. It systematically inves- ers or to the production process, a potential problem is tigates each element in the system with the goal to find found. Hazard (H) is a source of potential damage, harm the potential situation that could cause the element to or adverse health effects on something/someone, while pose hazards or limit the system’s normal operations. mishaps are damages or harms on something/someone. There are four basic steps to perform the HAZOP: Cause (C) is the reasons why the deviation could occur. It is possible that several causes are identified for one de- • Define the project scope and aims, and form an viation. Mitigation (M) helps to reduce the occurrence expert team. frequency of the deviations or to mitigate their conse- • Identify system elements and model the system quences. Hazards, causes, and mitigation are usually as a system representation. assigned with their respective IDs. • Consider possible deviation of operational param- eters. • Identify hazards, causes and mitigation solutions. 3. Problem Statement Once the four steps are completed, team members Given HAZOP was not originally designed for LESs, in- may generate additional safety requirements if necessary evitably new problems arise when attempting to apply to mitigate or prevent the identified issues, leading to HAZOP on LESs. These problems are formalized as a set of research questions (RQs) proposed in this section. We RQ4: How to establish the relationship between first present the rationale behind those RQs (i.e., justifica- identified safety elements across levels? For sim- tion of how we have come to the RQs) and then articulate plicity, HAZOP is expected to be applied separately what would be the expected solution to each RQ. to each level of a hierarchical system representation. Therefore, to get the safety analysis of the whole com- RQ1: How to reduce the complexity of LESs so that plex system, it is necessary to study the relationship HAZOP can be effectively applied to? HAZOP is a between identified safety elements—namely causes, mit- semi-formalised analytical method, used to identify the igation, hazards (and latent-hazards)—across different hazard scenarios of a defined process, and it has been levels. Then, based on the nature of the relationship (e.g., successfully used on relatively simple systems. When fac- causal or not, quantitative or qualitative, probabilistic ing a complex system, HAZOP often cannot play its role or deterministic), proper formalism should be used to well. LESs exhibit unprecedented complexity, rendering establish and express such relationship of those hazard directly applying HAZOP to LESs infeasible. Therefore, analysis results collected from each level. we need to reduce the complexity in the system represen- tation. A simple yet effective solution is by “divide and conquer”, e.g., stratifying a complex system into multi- 4. Running Example ple levels. In this regard, a promising solution to RQ1 is to propose a hierarchical system representation, so that HAZOP can be effectively applied. RQ2: How to define nodes in each level, especially for novel levels regarding ML? We assume that HA- ZOP can effectively handle a single level system represen- tation, as we expect to introduce a hierarchical structure in the RQ1 solution. The second step of HAZOP is to divide nodes at each level (presuming we already have a group of experts as the HAZOP team). Past experience shows that division of nodes can be based on the func- tionalities of components in the system [17], so we may continue using such traditional method for those non-ML Figure 1: Workflow diagram of the running example related levels. However, when there are ML components in the system under analysis, it is difficult for the tradi- We present a running example from the SOLITUDE tional division method of nodes to be directly applied. project1 , which conducts safety analysis on an AUV that Therefore, RQ2 is raised to explore the novel definition autonomously finds a dock and performs the docking of “functionalities” at ML-related levels. task. The workflow of the scenario is given in Figure 1. The robot starts when received the user’s command. RQ3: Will there be any new guide words related to Once started, it uses sensors (e.g., cameras) to receive ML? Guide word is one of the key compositions of a data. Data is transmitted and preprocessed before feeding deviation. The team of experts is responsible for iden- into the YOLO model for object detection and localisa- tifying guide words that fit the scope of their analysis, tion. The localisation result is further utilised for path while common guide words used were No, Less/More, planning. In addition, the above normal workflow may Slower/Faster, Early/Late, etc. However, the existing suffer from external attacks on some stages, including set of guide words is unproven for use in ML applica- data transmission, data preparation, and path planning. tions, so this RQ aims at determining the effectiveness We remark that, the scenario in the project is more com- and new meanings of known guide words for ML related plex, including utilising deep reinforcement learning for levels, and checking whether there might be missing motion planning, but for the space limit, this paper only guide words. Although we expect most of the known focuses on the perception component. guide words can still be applicable, they might miss some deviations given the new characteristics of ML. Thus, prospective new guide words may be introduced, they 5. Proposed Method might miss some deviations given the new characteris- tics of ML. Thus, prospective new guide words may be In this section, we present the HILLS method, and com- introduced. pare it with HAZOP. HILLS is inheriting from HAZOP 1 https://github.com/Solitude-SAMR/UWV_RAM the basic structure composition and definitions of ele- different functions, and they will be categorized as dif- ments, with extensions that are suitable for LESs. The ferent Nodes. Consider the running example in Figure 1, tables and figures presented in this section are partial for “blue blocks” represent the functional areas of the run- illustrative purpose only, cf. the complete HILLS analysis ning example, which means that our nodes can be set results based on the SOLITUDE project at the GitHub according to these blocks. An example of setting nodes repository1 . is provided in Table 1. We note, the setting of nodes is specific to the system under investigation. E.g., the node 5.1. Hierarchical HAZOP “Labeling” was not included in Figure 1. Some guide words originated from, e.g., the chemical As shown in Figure 2, HILLS has a three-level structure, industry can still be used in LESs. Attributes related to including system level, ML-lifecycle level and inner-ML the LES are used together with the guide words to express level. We analyse each level individually in this subsec- deviations. tion, and discuss their relations in Section 5.2. Note, the Example 1 At system level, we discovered several haz- HILLS structure discussed here is generic (for illustra- ards from the running example, some of them are sum- tion purpose), and may be subject to adaptation when marised in Table 2. E.g., one of the hazards is “erratic working with specific systems. trajectory”, suggesting that the robot moves into an unsafe area. This hazard is associated with a deviation “no action” where “no” is the guide word and “action” is the attribute (when the AUV takes no actions in the water, the distur- bance of current makes it difficult for the robot to maintain a stable trajectory). One of the causes of the hazard is “no data from sensor”, which can be mitigated by, e.g., the use of an acoustic guidance system as a duplicated perception component based on another sensor. Example 2 Some hazards, such as “erratic trajectory”, may appear in different nodes, which suggests that they may occur more often, and thus may have the higher pri- ority to be mitigated after considering the severity of con- sequences as well. Example 3 One hazard can be mitigated in different ways. For example, we identified several mitigation solu- tions for the “erratic trajectory”, most of which focus on Figure 2: The 3 level hierarchical structure of HILLS early prevention, such as “maximum safe distance main- tained if uncertain” and “camera health monitor”. HILLS aims to exhaustively cover all potential hazards. Table 1 In the running example, the possible causes of crashes Nodes in each level in SOLITUDE example or failing to turn directions when facing obstacles may include “no data from sensors (instantaneous or perma- Level Node Description nent)”, and “misclassification”, corresponding to the er- System level Node 1 User rors in hardware and software components, respectively. System level Node 2 Hardware components However, the hazards, causes or mitigation may not be System level Node 3 Data transmission fully identifiable at this level. For example, there are ML-lifecycle level Node 4 Data collection other mitigation solutions for the cause “misclassifica- ML-lifecycle level Node 5 Labeling tion” that need to consider how the ML component is ML-lifecycle level Node 6 Data preprocessing trained and constructed. However, the system level alone ML-lifecycle level Node 7 Hyperparameter setting cannot naturally include relevant nodes for this purpose. ML-lifecycle level Node 8 Model deployment This motivates us to consider other levels (as discussed Inner-ML level Node 9 Feature Extracting Inner-ML level Node 10 Object Detection below). ML-lifecycle level Node 11 Localisation 5.1.2. ML-Lifecycle level The key motivation for the ML-lifecycle level is to handle 5.1.1. System level the complexity arising from the integration of ML compo- nents into an LES, considering mainly the human factors HILLS at the system level largely follows HAZOP. Hard- and security threats involved in the development process ware, software, and ML components of an LES represent Table 2 System level analysis (partial) Node Deviation Hazard Cause Mitigation Data transmission (Flow from camera to classifier) No action Erratic trajectory No data from sensor (transient) Acoustic guidance system Data transmission (Flow from camera to classifier) No action Erratic trajectory No data from sensor (transient) Situational awareness (route mapped and planned in advance) Data transmission (Flow from camera to classifier) No action Erratic trajectory No data from sensor (transient) Maximum safe distance maintained if uncertain Data transmission (Flow from camera to classifier) No action Insufficient energy/power No data from sensor (permanent) Camera health monitor (e.g. sanity check for blank images) Data transmission (Data flow) Part of action Erratic trajectory Corrupted sensor data Reliable camera (robust to environment etc.) Data transmission (Data value) Wrong value Loss of communication Hardware breakdown Hardware monitor Data transmission (Data value) Wrong value Loss of communication Information conflict/lag Maximum safe distance maintained if uncertain of ML models. Thus, deviations from this level cannot be of data preparation. Aforementioned mistakes are direct identified if analysis was only conducted at the system human errors. There are also adversarial attacks that level. On the other hand, the hazards at system level can lead to significant drop in performance, which are may be attributed to the hazards at ML-lifecycle level, classified as security threats. Some examples are shown e.g., the low prediction accuracy of ML component may in Table 4. be caused by the polluted data in the data collection or Example 4 On the node “data collection”, there is a insufficient epochs of training. For the running example, threat “data poisoning”, which occurs because the input through the analysis at the ML-lifecycle level, we know data is contaminated. A suggested mitigation is to deploy that the low accuracy of the results may be caused by a detector based on data provenance. inaccurate labeling. We remark that, deviations identi- Example 5 For ML components, we identified mitiga- fied at non system level are called Latent-hazards (LH), tion, e.g., “classifier reliability for critical objects >X” [18], as they pose indirect hazards from latent levels with no to reduce misclassifications with safety impacts. hardware components being interacted and thus cannot Example 6 For the latent-hazards “low prediction accu- directly lead to mishaps. racy”, its causes include “users make mistakes on labelling”, Table 3 presents a set of guide words that are required “data itself is missing”, and “data itself is incomplete”, each at this level. These guide words are redefined from the of which has their suggested mitigation (cf. Table 4). existing guide words in HAZOP. Table 3 includes both Example 7 There is a deviation “attack”, whose threats their original meanings (in HAZOP) and new meanings are various attacks, e.g., evasion attack, backdoor attack, (in HILLS). “part of” represents a qualitative modification and data poisoning attack. Their respective cause is usu- in the original meaning, and in HILLS it may mean the ally that a certain entity in the training or inference of an incompleteness of the structures, definitions, or settings. ML model (e.g., input instance, model structure, training, For “Less” and “More”, considering that we are concerned dataset) is perturbed, modified, or contaminated. Their about data flow and data value, their new meanings refer respective mitigation can be very specific (cf. Table 4), e.g., to the amount of data rather than, e.g., the water volume. the backdoor detector in [19] for tree ensemble classifiers. Table 3 5.1.3. Inner-ML level Redefined guide words in the ML-Lifecycle level ML components such as YOLO are composed of one or Guide word Part of more ML models, each of which is formed of a set of Original Meaning functional layers. Even after a thorough analysis of all Qualitative modification New Meaning possible deviations (with mitigation solutions) in the ML Incomplete definition or setting Guide word Less development process modelled by our ML-lifecycle level, Original Meaning the ML components may not perform as expected, e.g., Too little additive volume added New Meaning A less amount of datathe convolutional layers fail to extract features accurately, Guide word More and the fully connected layers fail to make reliable classi- Original Meaning Too much additive volume added fications. Thus, safety analysis on the internal structure New Meaning A large amount of data of an ML component is required. At the inner-ML level, HILLS takes the method of extracting basic layers of an Safety analysis at the ML-lifecycle level can exhibit ML component to form a model for analysis. To cater new latent-hazards, as shown in Table 4. While ML mod- for different complexity of the ML component, two ex- els are subject to security issues, we believe malicious traction methods are proposed. The first one deals with attacking behaviors should also be considered as security simple models with up to 5 layers. It follows the layer Threats (T). Human factors are considered because ML structure and considers each layer to represent a sepa- development is a human-centered process, which makes rate functionality. Consequently, each layer is defined possible some human related errors such as labelling er- as a node in the system representation. The second one rors, part of operations were forgotten and the omission deals with more complex, larger models by abstracting a Table 4 ML-lifecycle level analysis (partial) Node Deviation Latent-hazard & Threat Cause Mitigation Labeling (Manually label data) Wrong label Low prediction accuracy Users make mistake with labeling Keep classifier accuracy/reliability for critical objects >X Labeling (Manually label data) Wrong label Low prediction accuracy Users make mistake with labeling Sanity check for ground truth and label attribute Labeling (Manually label data) Incapable label Low prediction accuracy Data itself is incomplete Keep classifier accuracy/reliability for critical objects >X Labeling (Manually label data) Incapable label Low prediction accuracy Data itself is incomplete Sanity check for ground truth and label attribute Data collection Attacked Data Poisoning Input data is contaminated Detection based on data provenance Data preprocessing Part of data washing Incorrect data ranges Data washing incomplete Consistency Check (e.g. Value range) Hyperparameter setting Wrong setting Inappropriate hyperparameter User make mistake with setting Sanity check to hyperparameter Hyperparameter setting Wrong setting Inappropriate hyperparameter Unsuitable hyperparameter for setting Continuing monitor to hyperparameter Model deployment Attacked Robustness Attacks Insert a calculated disturbance into the input data Defensive Distillation Model deployment Attacked Backdoor Insert disturbance into the input data XAI explain to input Localisation No Localisation Lose estimation of position Hardware (sensors) breakdown Situational awareness (route mapped and planned in advance) Localisation No Localisation Lose estimation of position Hardware mismatch Common time to synchronise data and results Localisation Wrong Localisation Misposition Slip rate too large Situational awareness (route mapped and planned in advance) Localisation Wrong Localisation Misposition Combination miss between hardware and ML Common time to synchronise data and results model into several functional blocks and every block may 5.1.4. Further Considerations on Use Cases of contain a number of layers. Our analysis in the running HILLS example follows the second method. HAZOP is to provide a systematic, critical examination of the process (and engineering intent) of a new or existing Table 5 facility, and should normally be done before the system is New guide words of ML-Lifecycle and inner-ML levels officially put into service [22]. Nevertheless, we believe Guide words Meaning that HILLS can still be applied after the occurrence of an accident, in particular the recent technologies have Wrong Wrong setting or data value enabled the recording of system executions through, e.g., Invalid Invalid data value or data flow, possibly direct observation, recorded video, or snapshot images. conflicting with other components HILLS may use the recordings to identify related causes Incomplete Incomplete data value and hazards. Perturbed Data was perturbed by external attackers Moreover, we note the following points when using Incapable Part of data can not be labeled HILLS. First, when dealing with an LES, we focus on the workflow or the pipeline diagram of the entire system, to identify nodes according to the method we explained ear- We identified several new guide words, as shown in lier. The analysis at the system level can help us identify Table 5, which are highly relevant to the setup of the the hazards sourced from the ML components, to enable ML component and data flow. It is worth noting that the the analysis at the lower levels. “Perturbed” is a special guide word that is needed when Second, guide words will be combined with the at- considering the existence of an external attacker. tributes of each node to form deviations. This will pro- Example 8 Deviations containing “perturbed” are usu- ceed sequentially following the level structure of HILLS, ally proprietary attacks, e.g., we record “perturbed dataset” i.e., the deviations at the system level will be identified as “attack” and the threat as “data poisoning” (cf. Table 4). first, followed by the ML-lifecycle level, and the inner-ML As shown in Table 6, HILLS performs analysis inside level. an ML model, which in general is closely related to the Third, before looking for (latent-)hazards, causes, and internal structure of the model. mitigation at each level, we are based on a reasonable Example 9 When the ML component has wrong output, assumption that mitigation solutions of higher levels are we can get from the inner-ML level analysis that this may easier than lower levels. That said, HILLS may not need be related to the setting of the hyperparameter. Explainable to be conducted at the inner-ML level, and can stop when AI (XAI) methods may help users to, e.g., locate which layer all hazards are found and mitigated at other levels. of neurons contribute the most to the wrong ML behaviours [20] and detect backdoors [21]. Example 10 At the inner-ML level, we focus on the ML 5.2. Relations Between Levels model structure itself. E.g., unsuitable parameter setting Up to now, we have identified the nodes, attributes, guide in activation functions or pooling layers also make specific words, (latent-)hazards, threats, causes, and mitigation latent-hazards. It also leads to wrong outputs or losing part solutions for individual levels in the HILLS framework. of information of figures (cf. Table 6). We also notice that the relations between these elements can be very complicated. This calls for a formal analysis of the relations. While formalising the relations between levels is a significant challenge, and there might not be Table 6 Inner-ML level analysis (partial) Node Deviation Latent-hazard & Threat Cause Mitigation Feature extracting Imprecise extracting Wrong outputs Less layers Using deeper layers Feature extracting Wrong extracting Wrong outputs Wrong hyperparameter setting Using Explainable AI (XAI) to locate Feature extracting Wrong extracting Wrong outputs Unsuitable kernel size setting Kernel size need to match dataset size Feature extracting Wrong extracting Dying ReLU problem Learning rate setting too large Choosing suitable learning rate for ReLU (activation function) Feature extracting Wrong extracting Losing information of figures Unsuitable parameter setting in pooling layer Evaluate whether need pooling layer Feature extracting Wrong extracting Losing information of figures Unsuitable parameter setting in pooling layer Choose an appropriate pooling type one best way, we propose to study them both qualitatively 5.2.2. Quantitative Analysis and quantitatively. A BN is a graphical model that presents probabilistic re- lationships between a set of variables by determining 5.2.1. Qualitative Analysis causal relationships between them [23]. It is also a pow- Qualitative analysis studies the connections between lev- erful tool for knowledge representation and reasoning els, with the guide words as entry points. The guide under uncertainty, visually presenting probabilistic rela- words and the deviations may have the following con- tionships between a set of variables [24]. Actually, BN has nections. already been used to study the relation between latent First of all, the same guide words at a level have strong features learned by a deep neural network [25]. While associations, even if they are combined with different using BN to express relationship of elements is not a new attributes. Second, if a guide word is the same between idea in traditional safety analysis [26, 27, 28]. We take the different levels, the one in the higher level may contribute relationship between several elements at the ML-lifecycle as the main reason for the latent-hazard of the lower level. level and the inner-ML level as an example to explore the Example 11 We use “no” as an example. We can get possibility of using BN to represent it. This is an idea of a deviation “no action” at the system level, and have the quantitatively expressing relationships, since the higher deviation “no localisation” in the ML-lifecycle level. Given level contains some abstract concepts, it is difficult to they share the same guide word, we should consider whether represent in variables. Even if we assume that abstract the “no localisation” has a causality relation with the “no concepts are represented using variables, it is hard to action”. present Conditional Probability Tables (CPTs) as a pre- Moreover, it is assumed that there is an inclusive re- requisite for BN to start. All parameters used to quantify lationship between the guide words of the higher level BN must be obtained based on system background and and lower level, such as “no” and “part of”, or there are expert knowledge. similar meanings, such as “invalid” or “incompatible”. The existence of a guide word with an inclusive rela- tionship suggests that for the latent-hazard found in the lower level, its cause may belong to the higher level. Example 12 If we choose “No action” at system level and “Part of definition” at the ML-lifecycle level (e.g., images without defined labels), then we may establish an inclusive relationship between “No” and “Part of”. Example 13 We use “invalid data value” and “incom- Figure 3: A BN fragment (with illustrative probabilities) patible data value” as examples, “incompatible data value” may lead to the low accuracy of output or no results, it has Figure 3 shows a fragment of the BN model for the a similar meaning with “invalid data value”. running example, considering several security threats Selecting guide words is arguably a quite subjective between the ML-lifecycle level and the inner-ML level. activity that experts may use different guide words with The nodes of a BN can represent threats (𝑇 𝑙.𝑖), causes similar semantics to identify the same cause. To this end, (𝐶𝑙.𝑖), or mitigation (𝑀 𝑙.𝑖), where variable 𝑙 ∈ {1, 2, 3} the proposed way of establishing relationships across ranges over the levels in HILLS and 𝑖 is the index of the levels can only cope with the ideal case in which identi- threat/cause/mitigation at a level. E.g., 𝑇 2.𝑖 is the 𝑖-th cal guide words are used. Alternative methods are still threat at ML-lifecycle level. needed for other cases, which forms our future work. Besides, we need to assign CPT to each non-leaf node of the BN, and assign a prior probability to the leaf or set the observed evidence probability node. It is noted that the expert knowledge is needed for both the construction of the basic structure and the assignment of CPTs. The probabilities used in Figure 3 are for illustrative purposes, process [39] or consider the direct application of the HA- while more enlightening examples can be found in [25]. ZOP to the hierarchical structure of traditional systems Example 14 For threat nodes with no incoming arrows, with no ML components [40]. A hierarchical structure is such as 𝑇 2.𝑖 and 𝑇 3.𝑖, we may set the probability of their needed for its suitability to work with ML components occurrence to 100 percent. (black-box in general, and inside the black-box, it is a Once constructed, we can make probabilistic inference layer-structure with each layer being a simple mathe- on the BN to ensure that the construction is correct w.r.t. matical function). In HILLS, we innovatively consider expert knowledge. The following are two typical exam- the interaction between humans and ML components ples, by applying the d-separation algorithm [29] (for and the internal structure of the ML components. More- determining dependencies of variables in a BN). over, inspired by [41], we investigate how to link and Example 15 There may be multiple children nodes at propagate identified safety elements at different levels. different levels for a parent node. In Figure 3, the threat 𝑇 2.𝑖 has two causes, 𝐶2.𝑎 and 𝐶3.𝑎, at the ML-lifecycle STPA STAMP (Systems-Theoretic Accident Model and level and inner-ML level, respectively. While the two causes Processes) is also a very popular safety analysis method. may be mitigated separately as they belong to different STAMP uses three fundamental concepts from sys- levels, the effectiveness of their respective mitigation might tem theory: Emergence and hierarchy, communication affect the probabilistic inference based on each other’s CPT and control, and process models [42]. STPA (System- (under the condition that the probability for 𝑇 2.1 is not Theoretic Process Analysis) uses such techniques, being observable). based on the STAMP model. STPA pays more attention Example 16 There may be multiple parent nodes for to the overall control loop and process analysis of the a child node. In Figure 3, the mitigation 𝑀 2.𝑎, has two system, and focuses on unsafe control actions and causal causes, 𝐶2.𝑎 and 𝐶2.𝑏, representing that one mitigation factors in a control structure. It is widely used in rail- may support two causes. By observing the effectiveness of way safety assurances [43], cyber safety and security the mitigation (i.e., the CPT of 𝑀 2.𝑎), we will infer how [44], robotics [45] and driver-vehicle interactions [46]. one cause 𝐶2.𝑎 may influence the other cause 𝐶2.𝑏 and STPA is also used to explore a hierarchical structural vice versa. safety analysis framework in [47]. Comparing to STPA, We note, the construction of the BN structure and HAZOP is relatively easier to conduct and clearer to com- CPTs, as well as the above probabilistic inference, should municate, supported by structural decomposition of the be discussed and accepted by domain experts and all system functions [16]. We start with retrofitting HA- stakeholders. We believe BN is potentially a powerful ZOP for LESs, while STPA offers a new perspective to tool for the purpose of modelling probabilistic causality consider the feasibility of hierarchical safety analysis on relationship between elements of ML related levels, while LESs which is our planed future work. how to apply BN in practice in the context of HILLS remains an open challenge. 7. Conclusion 6. Related Work We propose a hierarchical HAZOP-like method, HILLS, for the safety analysis of LESs. Being different from the HAZOP HAZOP is widely used in industrial domains, traditional HAZOP, HILLS analyses LESs in a hierarchical such as nuclear power [30] and chemical industry [31]. way, disentangling the complexity by working with three In recent years, there has been efforts on integrating separate levels first and then establishing their relations HAZOP with other methods [32, 33] to analyse com- via both qualitative and quantitative methods, e.g., BNs. mon causes and system scenarios [34]. A comprehensive HILLS is applied to a practical example of AUVs, with review of those techniques may refer to recent survey the discovery of new guide words as well as new causes papers, e.g. [35]. The application of HAZOP on computer- and mitigation related to ML. based systems first appears in [36]. After that, the expe- In conclusion, HILLS complements HAZOP when rience gained from application of HAZOP and related working with LESs, and is able to identify safety hazards techniques to computer-based systems was summarised and security threats related to ML components through in [37]. There is a recent trend of applying HAZOP-like its structural advantages. analysis to LESs, e.g., in autonomous car context [38]. Hierarchical structure The concept of hierarchy is Acknowledgments not new, but existing papers either focus on the hierarchi- This work is supported by U.K. DSTL through the project cal priority of the analysis order in the HAZOP analysis of Safety Argument for Learning-enabled Autonomous Underwater Vehicles and U.K. EPSRC through End- [12] H. Pasman, W. Rogers, How can we improve hazop, to-End Conceptual Guarding of Neural Architectures our old work horse, and do more with its results? [EP/T026995/1]. This project has received funding an overview of recent developments, Chemical from the European Union’s Horizon 2020 research and in- Engineering Transactions 48 (2016) 829–834. novation programme under grant agreement No 956123. [13] H. Ozog, Hazard identification and quantification, XZ’s contribution to the work is partially supported Chem. Eng. Prog. 83 (1987) 55–64. through Fellowships at the Assuring Autonomy Inter- [14] V. Cozzani, S. Bonvicini, G. Spadoni, S. Zanelli, Haz- national Programme. YQ’s contribution to the work is mat transport: A methodological framework for the supported through Chinese Scholarship Council (CSC). risk analysis of marshalling yards, Journal of Haz- ardous Materials 147 (2007) 412–423. [15] P. Aspinall, Hazops and human factors, in: Insti- References tution of Chemical Engineers Symposium Series, volume 151, 2006, p. 820. [1] H. Lawley, Operability studies and hazard analysis, [16] L. Sun, Y.-F. Li, E. Zio, Comparison of the ha- Chem. Eng. Prog. 70 (1974) 45–56. zop, fmea, fram, and stpa methods for the hazard [2] F. Crawley, B. Tyler, Chapter 3 - the hazop study analysis of automatic emergency brake systems, method, in: F. Crawley, B. Tyler (Eds.), HAZOP: ASCE-ASME Journal of Risk and Uncertainty in En- Guide to Best Practice (3rd Edition), Elsevier, 2015. gineering Systems, Part B: Mechanical Engineering [3] J. Dunjó, V. Fthenakis, J. A. Vílchez, J. Arnaldos, 8 (2022). Hazard and operability (hazop) analysis. a literature [17] D. Slater, The Hazop methodology, 2015. review, Journal of Hazardous Materials 173 (2010) [18] X. Zhao, W. Huang, A. Banks, V. Cox, D. Flynn, 19–32. S. Schewe, X. Huang, Assessing the reliability of [4] D. Lane, D. Bisset, R. Buckingham, G. Pegman, deep learning classifiers through robustness eval- T. Prescott, New foresight review on robotics and uation and operational profiles, in: AISafety’21 autonomous systems, Technical Report No. 2016.1, Workshop at IJCAI’21, volume 2916, ceur-ws.org, LRF, 2016. 2021. [5] X. Zhao, A. Banks, J. Sharp, V. Robu, D. Flynn, [19] W. Huang, X. Zhao, X. Huang, Embedding and M. Fisher, X. Huang, A Safety Framework for Crit- extraction of knowledge in tree ensemble classifiers, ical Systems Utilising Deep Neural Networks, in: Machine Learning 111 (2022) 1925–1958. Computer Safety, Reliability, and Security (Safe- [20] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Comp’20), volume 12234 of LNCS, Springer, Cham, Müller, W. Samek, On pixel-wise explanations for 2020, pp. 244–259. non-linear classifier decisions by layer-wise rele- [6] E. Asaadi, E. Denney, G. Pai, Quantifying assur- vance propagation, PloS one 10 (2015) e0130140. ance in learning-enabled systems, in: SafeComp’20, [21] X. Zhao, W. Huang, X. Huang, V. Robu, D. Flynn, volume 12234 of LNCS, Springer, Cham, 2020, pp. BayLIME: Bayesian local interpretable model- 270–286. agnostic explanations, in: Proc. of the 37th Conf. [7] R. Bloomfield, H. Khlaaf, P. R. Conmy, G. Fletcher, on Uncertainty in Artificial Intelligence, UAI’21, Disruptive innovations and disruptive assurance: PMLR, 2021, pp. 887–896. Assuring machine learning and autonomy, Com- [22] J. Jurkiewicz, J. Nawrocki, M. Ochodek, T. Głowacki, puter 52 (2019) 82–89. Hazop-based identification of events in use cases: [8] E. Alves, D. Bhatt, B. Hall, K. Driscoll, A. Muruge- An empirical study, Empir Software Eng 20 (2015) san, J. Rushby, Considerations in assuring safety 82–109. of increasingly autonomous systems, Technical Re- [23] E. Lee, Y. Park, J. G. Shin, Large engineering project port NASA/CR-2018-220080, NASA, 2018. risk management using a bayesian belief network, [9] S. Burton, I. Habli, T. Lawton, J. McDermid, P. Mor- Expert Systems with Applications 36 (2009) 5880– gan, Z. Porter, Mind the gaps: Assuring the safety 5887. of autonomous systems from an engineering, eth- [24] J. Cheng, R. Greiner, J. Kelly, D. Bell, W. Liu, Learn- ical, and legal perspective, Artificial Intelligence ing bayesian networks from data: An information- 279 (2020) 103201. theory based approach, Artificial intelligence 137 [10] P. Andow, H. G. Britain, E. Safety, Guidance on (2002) 43–90. HAZOP procedures for computer-controlled plants, [25] N. Berthier, A. Alshareef, J. Sharp, S. Schewe, Great Britain, Health and Safety Executive, 1991. X. Huang, Abstraction and symbolic execution of [11] D. J. Burns, R. M. Pitblado, A Modified Hazop deep neural networks with bayesian approximation Methodology For Safety Critical, Springer London, of hidden features (2021). London, 1993. [26] S. Thomas, K. Groth, Toward a hybrid causal framework for autonomous vehicle safety analy- Control Laboratory SCL-009/2003 (2003). sis, Proceedings of the Institution of Mechanical [41] M. Wallace, Modular architectural representation Engineers, Part O: Journal of Risk and Reliability and analysis of fault propagation and transforma- (2021) 1748006X2110433. tion, Electronic Notes in Theoretical Computer [27] E. Denney, G. Pai, I. Habli, Towards measurement Science 141 (2005) 53–71. of confidence in safety cases, in: Int. Symp. on [42] N. Leveson, Engineering a Safer World: Systems Empirical Software Engin. and Measurement, 2011, Thinking Applied to Safety, Engineering systems, pp. 380–383. MIT Press, 2011. [28] X. Zhao, D. Zhang, M. Lu, F. Zeng, A new approach [43] P. Yang, R. Karashima, K. Okano, S. Ogata, Auto- to assessment of confidence in assurance cases, in: mated inspection method for an stamp/stpa - fallen Computer Safety, Reliability, and Security (Safe- barrier trap at railroad crossing -, Procedia Com- Comp’12), volume 7613 of LNCS, Springer, 2012, pp. puter Science 159 (2019) 1165–1174. 79–91. [44] T. Kaneko, Y. Takahashi, T. Okubo, R. Sasaki, Threat [29] D. Koller, N. Friedman, Probabilistic Graphical Mod- analysis using stride with stamp/stpa, in: Proc. of els: Principles and Techniques, Adaptive computa- the Int. Workshop on Evidence-based Security and tion and machine learning, MIT Press, 2009. Privacy in the Wild, 2018. [30] S. Rimkevičius, M. Vaišnoras, E. Babilas, E. Ušpuras, [45] A. Adriaensen, L. Pintelon, F. Costantino, G. D. Hazop application for the nuclear power plants de- Gravio, R. Patriarca, An stpa safety analysis case commissioning projects, Annals of Nuclear Energy study of a collaborative robot application, IFAC- (2016). PapersOnLine 54 (2021) 534–539. 17th IFAC Sym- [31] W. Tian, T. Du, S. Mu, Hazop analysis-based dy- posium on Information Control Problems in Manu- namic simulation and its application in chemical facturing INCOM 2021. processes, Asia-Pacific Journal of Chemical Engi- [46] S. Chen, S. Khastgir, I. Babaev, P. Jennings, Identi- neering 10 (2015) 923–935. fying accident causes of driver-vehicle interactions [32] P. K. Marhavilas, M. Filippidis, G. K. Koulinas, D. E. using system theoretic process analysis (stpa), in: Koulouriotis, An expanded hazop-study with fuzzy- 2020 IEEE Int. Conf. on Systems, Man, and Cyber- ahp (xpa-hazop technique): Application in a sour netics (SMC), 2020, pp. 3247–3253. crude-oil processing plant, Safety science 124 (2020) [47] M. Chaal, O. A. Valdez Banda, J. A. Glomsrud, S. Bas- 104590. net, S. Hirdaris, P. Kujala, A framework to model [33] M. Danko, J. Janošovskỳ, J. Labovskỳ, L. Jelemenskỳ, the stpa hierarchical control structure of an au- Integration of process control protection layer into tonomous ship, Safety Science 132 (2020) 104939. a simulation-based hazop tool, Journal of Loss Pre- vention in the Process Industries 57 (2019) 291–303. [34] E. Roche, W. Dupont, A. Summers, Beyond hazop: Analyzing common cause and system scenarios, Process Safety Progress 38 (2019) e11997. [35] F. Crawley, B. Tyler, HAZOP: Guide to Best Practice, Elsevier Science, 2015. [36] M. Chudleigh, J. Catmur, Safety assessment of com- puter systems using HAZOP and audit techniques, in: SafeComp’92, Elsevier, 1992, pp. 285–292. [37] T. A. Kletz, Hazop–past and future, Reliability Engineering & System Safety 55 (1997) 263–266. [38] B. Kramer, C. Neurohr, M. Büker, E. Böde, M. Frän- zle, W. Damm, Identification and quantification of hazardous scenarios for automated driving, in: International Symposium on Model-Based Safety and Assessment, Springer, 2020, pp. 163–178. [39] M. R. Othman, R. Idris, M. H. Hassim, W. H. W. Ibrahim, Prioritizing HAZOP analysis using ana- lytic hierarchy process (AHP), Clean Technologies and Environmental Policy 18 (2016) 1345–1360. [40] E. Németh, R. Lakner, K. Hangos, I. Cameron, Hier- archical cpn model-based diagnosis using HAZOP knowledge, Technical report of the Systems and