A Hierarchical HAZOP-Like Safety Analysis for
Learning-Enabled Systems
Yi Qi1,* , Philippa Ryan Conmy2 , Wei Huang1 , Xingyu Zhao1 and Xiaowei Huang1
1
    Department of Computer Science, University of Liverpool, Liverpool, L69 3BX, U.K.
2
    Adelard Part of NCC Group, London, N1 7UX, U.K.


                                       Abstract
                                       Hazard and Operability Analysis (HAZOP) is a powerful safety analysis technique with a long history in industrial process
                                       control domain. With the increasing use of Machine Learning (ML) components in cyber physical systems—so called Learning-
                                       Enabled Systems (LESs), there is a recent trend of applying HAZOP-like analysis to LESs. While it shows a great potential to
                                       reserve the capability of doing sufficient and systematic safety analysis, there are new technical challenges raised by the
                                       novel characteristics of ML that require retrofit of the conventional HAZOP technique. In this regard, we present a new
                                       Hierarchical HAZOP-Like method for LESs (HILLS). To deal with the complexity of LESs, HILLS first does “divide and conquer”
                                       by stratifying the whole system into three levels, and then proceeds HAZOP on each level to identify (latent-)hazards, causes,
                                       security threats and mitigation (with new nodes and guide words). Finally, HILLS attempts at linking and propagating the
                                       causal relationship among those identified elements within and across the three levels via both qualitative and quantitative
                                       methods. We examine and illustrate the utility of HILLS by a case study on Autonomous Underwater Vehicles, with discussions
                                       on assumptions and extensions to real-world applications. HILLS, as a first HAZOP-like attempt on LESs that explicitly
                                       considers ML internal behaviours and its interactions with other components, not only uncovers the inherent difficulties of
                                       doing safety analysis for LESs, but also demonstrates a good potential to tackle them.

                                       Keywords
                                       Safety analysis, HAZOP, learning-enabled system, trustworthy AI, AI safety, hazard identification, autonomous underwater
                                       vehicle, machine learning security, deviation analysis, robotics and autonomous system, cyber physical system


1. Introduction                                                                                        for the whole system can be identified [3].
                                                                                                          In recent years, increasingly sophisticated mathemati-
After initially developed to support the chemical process cal modelling processes from Machine Learning (ML) are
industries (by Lawley [1]), Hazard and Operability Anal- being used to analyse complex data and then embedded
ysis (HAZOP) has been successfully and widely applied into cyber physical systems—so called Learning-Enabled
in the past 50 years. It is generally acknowledged to be Systems (LESs). How to ensure the safety of LESs has
an effective yet simple method to systematically iden- become an enormous challenge [4, 5, 6]. As LESs are dis-
tify safety hazards. HAZOP is a prescriptive analysis ruptively novel, they require new and advanced analysis
procedure designed to study the system operability by for the complex requirements on their safe and reliable
analysing the effects of any deviation from its design function [7]. Such analysis needs to be tailored to fully
intent [2]. A HAZOP does semi-formal, systematic, and evaluate the new character of ML [8, 9], making con-
critical examination of the process and engineering inten- ventional methods including HAZOP and HAZOP-like
tions of the process design. The potential for hazards or variants (e.g., CHAZOP [10] and PES-HAZOP [11] that
operability problems are thus assessed, and malfunction are respectively introduced for computer-based and pro-
of individual components and associated consequences grammable electronic systems) obsolete. Moreover, LESs
                                                                                                       exhibit unprecedented complexity, while past experience
The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety (AISafety
                                                                                                       suggests that HAZOP should be continuously retrofitted
2022), July 24-25, 2022, Vienna, Austria
*
  Corresponding author.                                                                                to accommodate more complex systems [12], consider-
$ yiqi@liverpool.ac.uk (Y. Qi); pmrc@adelard.com (P. R. Conmy); ing quantitative analysis frameworks [13, 14] and human
huang23@liverpool.ac.uk (W. Huang);                                                                    factors [15]. To the best of our knowledge, there is no
xingyu.zhao@liverpool.ac.uk (X. Zhao);                                                                 HAZOP-like safety analysis dedicated for LESs that takes
xiaowei.huang@liverpool.ac.uk (X. Huang)
                                                                                                       into account ML characters while preserving the sim-
 https://github.com/YiQi0318 (Y. Qi);
https://www.adelard.com/people/philippa-ryan.html (P. R. Conmy); plicity and effectiveness of HAZOP (comparing to other
https://intranet.csc.liv.ac.uk/~wh1923/ (W. Huang);                                                    conventional safety analysis methods [16]), which moti-
https://www.xzhao.me/ (X. Zhao); https://cgi.csc.liv.ac.uk/~xiaowei/ vates this research.
(X. Huang)                                                                                                In this paper, we introduce a new Hierarchical HAZOP-
 0000-0002-3474-349X (X. Zhao); 0000-0001-6267-0366
                                                                                                       Like method for LESs (HILLS). HILLS first stratifies
(X. Huang)
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License the complex LESs into three levels—System Level, ML-
          Attribution 4.0 International (CC BY 4.0).
    CEUR

          CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                       Lifecycle Level and Inner-ML Level, then applies HAZOP
separately on each level to identify safety elements of       improvement of the system. More details are given for
interest, namely causes, mitigation, hazards (or latent-      each step of HAZOP as what follows.
hazards for latent levels that cannot directly lead to
mishaps) and security threats. When applying HAZOP on         Form HAZOP team To perform HAZOP, a team of
the ML related levels, we revise HAZOP to cope with ML        specialists is formed according to the project scope and
characteristics, e.g., by introducing new ways of defining    aims. These experts have extensive experience, expert
nodes and new sets of guide words. We also identify           knowledge and understand the overall procedures of the
causes of hazards from the ML development process (mod-       system deeply, such as operations, maintenance and en-
elled by the ML-Lifecycle level) to reflect its data-driven   gineering design.
nature (e.g., how data is collected, processed, etc). Fur-
thermore, we attempt to address the challenge of how          Identify system elements The HAZOP team will for-
to link and propagate those identified safety elements        mally represent the system under study by identifying
within and across three levels, then propose both qualita-    the elements. Each element is called a Node, represent-
tive and quantitative (an initial Bayesian Belief Network     ing an operational function. Then, nodes and interactions
(BN) solution) methods to model the casual relationships.     between nodes (e.g., data/control flows) collectively form
To examine the effectiveness and demonstrate the use          the system representation under analysis.
case of HILLS, we finally conduct a case study on Au-
tonomous Underwater Vehicles (AUVs), with discussions
                                                              Consider deviations of operational parameters
on assumptions adopted and extensions to real-world
                                                              HAZOP assumes that a problem can only arise when
applications.
                                                              there are some Deviations from the intent design. HA-
   The key contributions of this work include:
                                                              ZOP searches for deviations in the system representation.
   a) A first HAZOP-like safety analysis for LESs that
                                                              The deviation on a node is expressed as the combination
explicitly considers ML characters (including security
                                                              of Guide Words and process Attributes .
threats and the data-driven nature in the development
                                                                 Each guide word is a short word to create the imagina-
process) and reduces the complexity by hierarchical de-
                                                              tion of a deviation of the design/process intent. The most
sign.
                                                              commonly used guide words are: no, more, less, as well
   b) New considerations of dividing nodes in the system
                                                              as, part of, other than, and so on. Guide words provide a
representation and a set of new guide words that adapt
                                                              systematic and consistent means of brainstorming poten-
the traditional HAZOP for levels regarding ML models.
                                                              tial deviations to normal operations. Each guide word has
   c) A first attempt at linking/propagating identified
                                                              a specific meaning, e.g., no means the complete negation
causes, mitigation, (latent-)hazards and security threats
                                                              of the design intention, early means something occurred
across ML levels.
                                                              earlier than intended time. Attributes are closely related
   d) Key challenges identified as research questions that
                                                              to nodes, and are usually the subject of the action being
are generic to safety analysis for LESs in future research.
                                                              performed. The definition of attributes relies on expert
                                                              knowledge.
2. Preliminaries: HAZOP
                                                              Identify hazards, causes and mitigation Where
HAZOP is an inductive hazard assessment method that           the result of a deviation would be a danger to work-
is conducted by an expert team. It systematically inves-      ers or to the production process, a potential problem is
tigates each element in the system with the goal to find      found. Hazard (H) is a source of potential damage, harm
the potential situation that could cause the element to       or adverse health effects on something/someone, while
pose hazards or limit the system’s normal operations.         mishaps are damages or harms on something/someone.
   There are four basic steps to perform the HAZOP:           Cause (C) is the reasons why the deviation could occur.
                                                              It is possible that several causes are identified for one de-
     • Define the project scope and aims, and form an
                                                              viation. Mitigation (M) helps to reduce the occurrence
       expert team.
                                                              frequency of the deviations or to mitigate their conse-
     • Identify system elements and model the system          quences. Hazards, causes, and mitigation are usually
       as a system representation.                            assigned with their respective IDs.
     • Consider possible deviation of operational param-
       eters.
     • Identify hazards, causes and mitigation solutions.     3. Problem Statement
   Once the four steps are completed, team members            Given HAZOP was not originally designed for LESs, in-
may generate additional safety requirements if necessary      evitably new problems arise when attempting to apply
to mitigate or prevent the identified issues, leading to      HAZOP on LESs. These problems are formalized as a set
of research questions (RQs) proposed in this section. We   RQ4: How to establish the relationship between
first present the rationale behind those RQs (i.e., justifica-
                                                           identified safety elements across levels? For sim-
tion of how we have come to the RQs) and then articulate   plicity, HAZOP is expected to be applied separately
what would be the expected solution to each RQ.            to each level of a hierarchical system representation.
                                                           Therefore, to get the safety analysis of the whole com-
RQ1: How to reduce the complexity of LESs so that plex system, it is necessary to study the relationship
HAZOP can be effectively applied to? HAZOP is a between identified safety elements—namely causes, mit-
semi-formalised analytical method, used to identify the igation, hazards (and latent-hazards)—across different
hazard scenarios of a defined process, and it has been levels. Then, based on the nature of the relationship (e.g.,
successfully used on relatively simple systems. When fac- causal or not, quantitative or qualitative, probabilistic
ing a complex system, HAZOP often cannot play its role or deterministic), proper formalism should be used to
well. LESs exhibit unprecedented complexity, rendering establish and express such relationship of those hazard
directly applying HAZOP to LESs infeasible. Therefore, analysis results collected from each level.
we need to reduce the complexity in the system represen-
tation. A simple yet effective solution is by “divide and
conquer”, e.g., stratifying a complex system into multi-
                                                           4. Running Example
ple levels. In this regard, a promising solution to RQ1 is
to propose a hierarchical system representation, so that
HAZOP can be effectively applied.

RQ2: How to define nodes in each level, especially
for novel levels regarding ML? We assume that HA-
ZOP can effectively handle a single level system represen-
tation, as we expect to introduce a hierarchical structure
in the RQ1 solution. The second step of HAZOP is to
divide nodes at each level (presuming we already have a
group of experts as the HAZOP team). Past experience
shows that division of nodes can be based on the func-
tionalities of components in the system [17], so we may
continue using such traditional method for those non-ML Figure 1: Workflow diagram of the running example
related levels. However, when there are ML components
in the system under analysis, it is difficult for the tradi-
                                                                We present a running example from the SOLITUDE
tional division method of nodes to be directly applied.
                                                             project1 , which conducts safety analysis on an AUV that
Therefore, RQ2 is raised to explore the novel definition
                                                             autonomously finds a dock and performs the docking
of “functionalities” at ML-related levels.
                                                             task. The workflow of the scenario is given in Figure 1.
                                                                The robot starts when received the user’s command.
RQ3: Will there be any new guide words related to Once started, it uses sensors (e.g., cameras) to receive
ML? Guide word is one of the key compositions of a data. Data is transmitted and preprocessed before feeding
deviation. The team of experts is responsible for iden- into the YOLO model for object detection and localisa-
tifying guide words that fit the scope of their analysis, tion. The localisation result is further utilised for path
while common guide words used were No, Less/More, planning. In addition, the above normal workflow may
Slower/Faster, Early/Late, etc. However, the existing suffer from external attacks on some stages, including
set of guide words is unproven for use in ML applica- data transmission, data preparation, and path planning.
tions, so this RQ aims at determining the effectiveness We remark that, the scenario in the project is more com-
and new meanings of known guide words for ML related plex, including utilising deep reinforcement learning for
levels, and checking whether there might be missing motion planning, but for the space limit, this paper only
guide words. Although we expect most of the known focuses on the perception component.
guide words can still be applicable, they might miss some
deviations given the new characteristics of ML. Thus,
prospective new guide words may be introduced, they 5. Proposed Method
might miss some deviations given the new characteris-
tics of ML. Thus, prospective new guide words may be In this section, we present the HILLS method, and com-
introduced.                                                  pare it with HAZOP. HILLS is inheriting from HAZOP
                                                                 1
                                                                     https://github.com/Solitude-SAMR/UWV_RAM
the basic structure composition and definitions of ele-        different functions, and they will be categorized as dif-
ments, with extensions that are suitable for LESs. The         ferent Nodes. Consider the running example in Figure 1,
tables and figures presented in this section are partial for   “blue blocks” represent the functional areas of the run-
illustrative purpose only, cf. the complete HILLS analysis     ning example, which means that our nodes can be set
results based on the SOLITUDE project at the GitHub            according to these blocks. An example of setting nodes
repository1 .                                                  is provided in Table 1. We note, the setting of nodes is
                                                               specific to the system under investigation. E.g., the node
5.1. Hierarchical HAZOP                                        “Labeling” was not included in Figure 1.
                                                                  Some guide words originated from, e.g., the chemical
As shown in Figure 2, HILLS has a three-level structure,       industry can still be used in LESs. Attributes related to
including system level, ML-lifecycle level and inner-ML        the LES are used together with the guide words to express
level. We analyse each level individually in this subsec-      deviations.
tion, and discuss their relations in Section 5.2. Note, the       Example 1 At system level, we discovered several haz-
HILLS structure discussed here is generic (for illustra-       ards from the running example, some of them are sum-
tion purpose), and may be subject to adaptation when           marised in Table 2. E.g., one of the hazards is “erratic
working with specific systems.                                 trajectory”, suggesting that the robot moves into an unsafe
                                                               area. This hazard is associated with a deviation “no action”
                                                               where “no” is the guide word and “action” is the attribute
                                                               (when the AUV takes no actions in the water, the distur-
                                                               bance of current makes it difficult for the robot to maintain
                                                               a stable trajectory). One of the causes of the hazard is “no
                                                               data from sensor”, which can be mitigated by, e.g., the use
                                                               of an acoustic guidance system as a duplicated perception
                                                               component based on another sensor.
                                                                  Example 2 Some hazards, such as “erratic trajectory”,
                                                               may appear in different nodes, which suggests that they
                                                               may occur more often, and thus may have the higher pri-
                                                               ority to be mitigated after considering the severity of con-
                                                               sequences as well.
                                                                  Example 3 One hazard can be mitigated in different
                                                               ways. For example, we identified several mitigation solu-
                                                               tions for the “erratic trajectory”, most of which focus on
Figure 2: The 3 level hierarchical structure of HILLS          early prevention, such as “maximum safe distance main-
                                                               tained if uncertain” and “camera health monitor”.
                                                                  HILLS aims to exhaustively cover all potential hazards.
Table 1                                                        In the running example, the possible causes of crashes
Nodes in each level in SOLITUDE example                        or failing to turn directions when facing obstacles may
                                                               include “no data from sensors (instantaneous or perma-
 Level                 Node        Description                 nent)”, and “misclassification”, corresponding to the er-
 System level          Node 1      User                        rors in hardware and software components, respectively.
 System level          Node 2      Hardware components         However, the hazards, causes or mitigation may not be
 System level          Node 3      Data transmission           fully identifiable at this level. For example, there are
 ML-lifecycle level    Node 4      Data collection             other mitigation solutions for the cause “misclassifica-
 ML-lifecycle level    Node 5      Labeling                    tion” that need to consider how the ML component is
 ML-lifecycle level    Node 6      Data preprocessing          trained and constructed. However, the system level alone
 ML-lifecycle level    Node 7      Hyperparameter setting      cannot naturally include relevant nodes for this purpose.
 ML-lifecycle level    Node 8      Model deployment
                                                               This motivates us to consider other levels (as discussed
 Inner-ML level        Node 9      Feature Extracting
 Inner-ML level        Node 10     Object Detection            below).
 ML-lifecycle level    Node 11     Localisation
                                                               5.1.2. ML-Lifecycle level
                                                       The key motivation for the ML-lifecycle level is to handle
5.1.1. System level                                    the complexity arising from the integration of ML compo-
                                                       nents into an LES, considering mainly the human factors
HILLS at the system level largely follows HAZOP. Hard-
                                                       and security threats involved in the development process
ware, software, and ML components of an LES represent
Table 2
System level analysis (partial)
Node                                                 Deviation        Hazard                      Cause                             Mitigation
Data transmission (Flow from camera to classifier)   No action        Erratic trajectory          No data from sensor (transient)   Acoustic guidance system
Data transmission (Flow from camera to classifier)   No action        Erratic trajectory          No data from sensor (transient)   Situational awareness (route mapped and planned in advance)
Data transmission (Flow from camera to classifier)   No action        Erratic trajectory          No data from sensor (transient)   Maximum safe distance maintained if uncertain
Data transmission (Flow from camera to classifier)   No action        Insufficient energy/power   No data from sensor (permanent)   Camera health monitor (e.g. sanity check for blank images)
Data transmission (Data flow)                        Part of action   Erratic trajectory          Corrupted sensor data             Reliable camera (robust to environment etc.)
Data transmission (Data value)                       Wrong value      Loss of communication       Hardware breakdown                Hardware monitor
Data transmission (Data value)                       Wrong value      Loss of communication       Information conflict/lag          Maximum safe distance maintained if uncertain


of ML models. Thus, deviations from this level cannot be                                           of data preparation. Aforementioned mistakes are direct
identified if analysis was only conducted at the system                                            human errors. There are also adversarial attacks that
level. On the other hand, the hazards at system level                                              can lead to significant drop in performance, which are
may be attributed to the hazards at ML-lifecycle level,                                            classified as security threats. Some examples are shown
e.g., the low prediction accuracy of ML component may                                              in Table 4.
be caused by the polluted data in the data collection or                                              Example 4 On the node “data collection”, there is a
insufficient epochs of training. For the running example,                                          threat “data poisoning”, which occurs because the input
through the analysis at the ML-lifecycle level, we know                                            data is contaminated. A suggested mitigation is to deploy
that the low accuracy of the results may be caused by                                              a detector based on data provenance.
inaccurate labeling. We remark that, deviations identi-                                               Example 5 For ML components, we identified mitiga-
fied at non system level are called Latent-hazards (LH),                                           tion, e.g., “classifier reliability for critical objects >X” [18],
as they pose indirect hazards from latent levels with no                                           to reduce misclassifications with safety impacts.
hardware components being interacted and thus cannot                                                  Example 6 For the latent-hazards “low prediction accu-
directly lead to mishaps.                                                                          racy”, its causes include “users make mistakes on labelling”,
   Table 3 presents a set of guide words that are required                                         “data itself is missing”, and “data itself is incomplete”, each
at this level. These guide words are redefined from the                                            of which has their suggested mitigation (cf. Table 4).
existing guide words in HAZOP. Table 3 includes both                                                  Example 7 There is a deviation “attack”, whose threats
their original meanings (in HAZOP) and new meanings                                                are various attacks, e.g., evasion attack, backdoor attack,
(in HILLS). “part of” represents a qualitative modification                                        and data poisoning attack. Their respective cause is usu-
in the original meaning, and in HILLS it may mean the                                              ally that a certain entity in the training or inference of an
incompleteness of the structures, definitions, or settings.                                        ML model (e.g., input instance, model structure, training,
For “Less” and “More”, considering that we are concerned                                           dataset) is perturbed, modified, or contaminated. Their
about data flow and data value, their new meanings refer                                           respective mitigation can be very specific (cf. Table 4), e.g.,
to the amount of data rather than, e.g., the water volume.                                         the backdoor detector in [19] for tree ensemble classifiers.

Table 3                                                                                            5.1.3. Inner-ML level
Redefined guide words in the ML-Lifecycle level
                                                          ML components such as YOLO are composed of one or
  Guide word                         Part of              more ML models, each of which is formed of a set of
  Original Meaning                                        functional layers. Even after a thorough analysis of all
                                     Qualitative modification
  New Meaning                                             possible deviations (with mitigation solutions) in the ML
                                     Incomplete definition or setting
  Guide word                         Less                 development process modelled by our ML-lifecycle level,
  Original Meaning                                        the ML components may not perform as expected, e.g.,
                                     Too little additive volume added
  New Meaning                        A less amount of datathe convolutional layers fail to extract features accurately,
  Guide word                         More                 and the fully connected layers fail to make reliable classi-
  Original Meaning                   Too much additive volume added
                                                          fications. Thus, safety analysis on the internal structure
  New Meaning                        A large amount of data
                                                          of an ML component is required. At the inner-ML level,
                                                          HILLS takes the method of extracting basic layers of an
   Safety analysis at the ML-lifecycle level can exhibit ML component to form a model for analysis. To cater
new latent-hazards, as shown in Table 4. While ML mod- for different complexity of the ML component, two ex-
els are subject to security issues, we believe malicious traction methods are proposed. The first one deals with
attacking behaviors should also be considered as security simple models with up to 5 layers. It follows the layer
Threats (T). Human factors are considered because ML structure and considers each layer to represent a sepa-
development is a human-centered process, which makes rate functionality. Consequently, each layer is defined
possible some human related errors such as labelling er- as a node in the system representation. The second one
rors, part of operations were forgotten and the omission deals with more complex, larger models by abstracting a
Table 4
ML-lifecycle level analysis (partial)
Node                             Deviation               Latent-hazard & Threat         Cause                                                 Mitigation
Labeling (Manually label data)   Wrong label             Low prediction accuracy        Users make mistake with labeling                      Keep classifier accuracy/reliability for critical objects >X
Labeling (Manually label data)   Wrong label             Low prediction accuracy        Users make mistake with labeling                      Sanity check for ground truth and label attribute
Labeling (Manually label data)   Incapable label         Low prediction accuracy        Data itself is incomplete                             Keep classifier accuracy/reliability for critical objects >X
Labeling (Manually label data)   Incapable label         Low prediction accuracy        Data itself is incomplete                             Sanity check for ground truth and label attribute
Data collection                  Attacked                Data Poisoning                 Input data is contaminated                            Detection based on data provenance
Data preprocessing               Part of data washing    Incorrect data ranges          Data washing incomplete                               Consistency Check (e.g. Value range)
Hyperparameter setting           Wrong setting           Inappropriate hyperparameter   User make mistake with setting                        Sanity check to hyperparameter
Hyperparameter setting           Wrong setting           Inappropriate hyperparameter   Unsuitable hyperparameter for setting                 Continuing monitor to hyperparameter
Model deployment                 Attacked                Robustness Attacks             Insert a calculated disturbance into the input data   Defensive Distillation
Model deployment                 Attacked                Backdoor                       Insert disturbance into the input data                XAI explain to input
Localisation                     No Localisation         Lose estimation of position    Hardware (sensors) breakdown                          Situational awareness (route mapped and planned in advance)
Localisation                     No Localisation         Lose estimation of position    Hardware mismatch                                     Common time to synchronise data and results
Localisation                     Wrong Localisation      Misposition                    Slip rate too large                                   Situational awareness (route mapped and planned in advance)
Localisation                     Wrong Localisation      Misposition                    Combination miss between hardware and ML              Common time to synchronise data and results


model into several functional blocks and every block may                                              5.1.4. Further Considerations on Use Cases of
contain a number of layers. Our analysis in the running                                                      HILLS
example follows the second method.
                                                                                                      HAZOP is to provide a systematic, critical examination of
                                                                                                      the process (and engineering intent) of a new or existing
Table 5                                                                                               facility, and should normally be done before the system is
New guide words of ML-Lifecycle and inner-ML levels                                                   officially put into service [22]. Nevertheless, we believe
  Guide words                                           Meaning                                       that HILLS can still be applied after the occurrence of
                                                                                                      an accident, in particular the recent technologies have
  Wrong                      Wrong setting or data value
                                                                                                      enabled the recording of system executions through, e.g.,
  Invalid
                             Invalid data value or data flow, possibly                                direct observation, recorded video, or snapshot images.
                             conflicting with other components                                        HILLS may use the recordings to identify related causes
  Incomplete                 Incomplete data value                                                    and hazards.
  Perturbed                  Data was perturbed by external attackers                                    Moreover, we note the following points when using
  Incapable                  Part of data can not be labeled                                          HILLS. First, when dealing with an LES, we focus on the
                                                                                                      workflow or the pipeline diagram of the entire system, to
                                                                                                      identify nodes according to the method we explained ear-
   We identified several new guide words, as shown in
                                                                                                      lier. The analysis at the system level can help us identify
Table 5, which are highly relevant to the setup of the
                                                                                                      the hazards sourced from the ML components, to enable
ML component and data flow. It is worth noting that the
                                                                                                      the analysis at the lower levels.
“Perturbed” is a special guide word that is needed when
                                                                                                         Second, guide words will be combined with the at-
considering the existence of an external attacker.
                                                                                                      tributes of each node to form deviations. This will pro-
   Example 8 Deviations containing “perturbed” are usu-
                                                                                                      ceed sequentially following the level structure of HILLS,
ally proprietary attacks, e.g., we record “perturbed dataset”
                                                                                                      i.e., the deviations at the system level will be identified
as “attack” and the threat as “data poisoning” (cf. Table 4).
                                                                                                      first, followed by the ML-lifecycle level, and the inner-ML
   As shown in Table 6, HILLS performs analysis inside
                                                                                                      level.
an ML model, which in general is closely related to the
                                                                                                         Third, before looking for (latent-)hazards, causes, and
internal structure of the model.
                                                                                                      mitigation at each level, we are based on a reasonable
   Example 9 When the ML component has wrong output,
                                                                                                      assumption that mitigation solutions of higher levels are
we can get from the inner-ML level analysis that this may
                                                                                                      easier than lower levels. That said, HILLS may not need
be related to the setting of the hyperparameter. Explainable
                                                                                                      to be conducted at the inner-ML level, and can stop when
AI (XAI) methods may help users to, e.g., locate which layer
                                                                                                      all hazards are found and mitigated at other levels.
of neurons contribute the most to the wrong ML behaviours
[20] and detect backdoors [21].
   Example 10 At the inner-ML level, we focus on the ML 5.2. Relations Between Levels
model structure itself. E.g., unsuitable parameter setting Up to now, we have identified the nodes, attributes, guide
in activation functions or pooling layers also make specific words, (latent-)hazards, threats, causes, and mitigation
latent-hazards. It also leads to wrong outputs or losing part solutions for individual levels in the HILLS framework.
of information of figures (cf. Table 6).                      We also notice that the relations between these elements
                                                              can be very complicated. This calls for a formal analysis
                                                              of the relations. While formalising the relations between
                                                              levels is a significant challenge, and there might not be
Table 6
Inner-ML level analysis (partial)
Node                 Deviation              Latent-hazard & Threat          Cause                                           Mitigation
Feature extracting   Imprecise extracting   Wrong outputs                   Less layers                                     Using deeper layers
Feature extracting   Wrong extracting       Wrong outputs                   Wrong hyperparameter setting                    Using Explainable AI (XAI) to locate
Feature extracting   Wrong extracting       Wrong outputs                   Unsuitable kernel size setting                  Kernel size need to match dataset size
Feature extracting   Wrong extracting       Dying ReLU problem              Learning rate setting too large                 Choosing suitable learning rate for ReLU (activation function)
Feature extracting   Wrong extracting       Losing information of figures   Unsuitable parameter setting in pooling layer   Evaluate whether need pooling layer
Feature extracting   Wrong extracting       Losing information of figures   Unsuitable parameter setting in pooling layer   Choose an appropriate pooling type


one best way, we propose to study them both qualitatively                                     5.2.2. Quantitative Analysis
and quantitatively.
                                                                                              A BN is a graphical model that presents probabilistic re-
                                                                                              lationships between a set of variables by determining
5.2.1. Qualitative Analysis                                                                   causal relationships between them [23]. It is also a pow-
Qualitative analysis studies the connections between lev-                                     erful tool for knowledge representation and reasoning
els, with the guide words as entry points. The guide                                          under uncertainty, visually presenting probabilistic rela-
words and the deviations may have the following con-                                          tionships between a set of variables [24]. Actually, BN has
nections.                                                                                     already been used to study the relation between latent
   First of all, the same guide words at a level have strong                                  features learned by a deep neural network [25]. While
associations, even if they are combined with different                                        using BN to express relationship of elements is not a new
attributes. Second, if a guide word is the same between                                       idea in traditional safety analysis [26, 27, 28]. We take the
different levels, the one in the higher level may contribute                                  relationship between several elements at the ML-lifecycle
as the main reason for the latent-hazard of the lower level.                                  level and the inner-ML level as an example to explore the
   Example 11 We use “no” as an example. We can get                                           possibility of using BN to represent it. This is an idea of
a deviation “no action” at the system level, and have the                                     quantitatively expressing relationships, since the higher
deviation “no localisation” in the ML-lifecycle level. Given                                  level contains some abstract concepts, it is difficult to
they share the same guide word, we should consider whether                                    represent in variables. Even if we assume that abstract
the “no localisation” has a causality relation with the “no                                   concepts are represented using variables, it is hard to
action”.                                                                                      present Conditional Probability Tables (CPTs) as a pre-
   Moreover, it is assumed that there is an inclusive re-                                     requisite for BN to start. All parameters used to quantify
lationship between the guide words of the higher level                                        BN must be obtained based on system background and
and lower level, such as “no” and “part of”, or there are                                     expert knowledge.
similar meanings, such as “invalid” or “incompatible”.
   The existence of a guide word with an inclusive rela-
tionship suggests that for the latent-hazard found in the
lower level, its cause may belong to the higher level.
   Example 12 If we choose “No action” at system level and
“Part of definition” at the ML-lifecycle level (e.g., images
without defined labels), then we may establish an inclusive
relationship between “No” and “Part of”.
   Example 13 We use “invalid data value” and “incom-                                         Figure 3: A BN fragment (with illustrative probabilities)
patible data value” as examples, “incompatible data value”
may lead to the low accuracy of output or no results, it has                                     Figure 3 shows a fragment of the BN model for the
a similar meaning with “invalid data value”.                                                  running example, considering several security threats
   Selecting guide words is arguably a quite subjective                                       between the ML-lifecycle level and the inner-ML level.
activity that experts may use different guide words with                                         The nodes of a BN can represent threats (𝑇 𝑙.𝑖), causes
similar semantics to identify the same cause. To this end,                                    (𝐶𝑙.𝑖), or mitigation (𝑀 𝑙.𝑖), where variable 𝑙 ∈ {1, 2, 3}
the proposed way of establishing relationships across                                         ranges over the levels in HILLS and 𝑖 is the index of the
levels can only cope with the ideal case in which identi-                                     threat/cause/mitigation at a level. E.g., 𝑇 2.𝑖 is the 𝑖-th
cal guide words are used. Alternative methods are still                                       threat at ML-lifecycle level.
needed for other cases, which forms our future work.                                             Besides, we need to assign CPT to each non-leaf node
                                                                                              of the BN, and assign a prior probability to the leaf or set
                                                                                              the observed evidence probability node. It is noted that
                                                                                              the expert knowledge is needed for both the construction
                                                                                              of the basic structure and the assignment of CPTs. The
probabilities used in Figure 3 are for illustrative purposes,    process [39] or consider the direct application of the HA-
while more enlightening examples can be found in [25].           ZOP to the hierarchical structure of traditional systems
   Example 14 For threat nodes with no incoming arrows,          with no ML components [40]. A hierarchical structure is
such as 𝑇 2.𝑖 and 𝑇 3.𝑖, we may set the probability of their     needed for its suitability to work with ML components
occurrence to 100 percent.                                       (black-box in general, and inside the black-box, it is a
   Once constructed, we can make probabilistic inference         layer-structure with each layer being a simple mathe-
on the BN to ensure that the construction is correct w.r.t.      matical function). In HILLS, we innovatively consider
expert knowledge. The following are two typical exam-            the interaction between humans and ML components
ples, by applying the d-separation algorithm [29] (for           and the internal structure of the ML components. More-
determining dependencies of variables in a BN).                  over, inspired by [41], we investigate how to link and
   Example 15 There may be multiple children nodes at            propagate identified safety elements at different levels.
different levels for a parent node. In Figure 3, the threat
𝑇 2.𝑖 has two causes, 𝐶2.𝑎 and 𝐶3.𝑎, at the ML-lifecycle         STPA STAMP (Systems-Theoretic Accident Model and
level and inner-ML level, respectively. While the two causes     Processes) is also a very popular safety analysis method.
may be mitigated separately as they belong to different          STAMP uses three fundamental concepts from sys-
levels, the effectiveness of their respective mitigation might   tem theory: Emergence and hierarchy, communication
affect the probabilistic inference based on each other’s CPT     and control, and process models [42]. STPA (System-
(under the condition that the probability for 𝑇 2.1 is not       Theoretic Process Analysis) uses such techniques, being
observable).                                                     based on the STAMP model. STPA pays more attention
   Example 16 There may be multiple parent nodes for             to the overall control loop and process analysis of the
a child node. In Figure 3, the mitigation 𝑀 2.𝑎, has two         system, and focuses on unsafe control actions and causal
causes, 𝐶2.𝑎 and 𝐶2.𝑏, representing that one mitigation          factors in a control structure. It is widely used in rail-
may support two causes. By observing the effectiveness of        way safety assurances [43], cyber safety and security
the mitigation (i.e., the CPT of 𝑀 2.𝑎), we will infer how       [44], robotics [45] and driver-vehicle interactions [46].
one cause 𝐶2.𝑎 may influence the other cause 𝐶2.𝑏 and            STPA is also used to explore a hierarchical structural
vice versa.                                                      safety analysis framework in [47]. Comparing to STPA,
   We note, the construction of the BN structure and             HAZOP is relatively easier to conduct and clearer to com-
CPTs, as well as the above probabilistic inference, should       municate, supported by structural decomposition of the
be discussed and accepted by domain experts and all              system functions [16]. We start with retrofitting HA-
stakeholders. We believe BN is potentially a powerful            ZOP for LESs, while STPA offers a new perspective to
tool for the purpose of modelling probabilistic causality        consider the feasibility of hierarchical safety analysis on
relationship between elements of ML related levels, while        LESs which is our planed future work.
how to apply BN in practice in the context of HILLS
remains an open challenge.
                                                                 7. Conclusion
6. Related Work                                                  We propose a hierarchical HAZOP-like method, HILLS,
                                                                 for the safety analysis of LESs. Being different from the
HAZOP HAZOP is widely used in industrial domains,                traditional HAZOP, HILLS analyses LESs in a hierarchical
such as nuclear power [30] and chemical industry [31].           way, disentangling the complexity by working with three
In recent years, there has been efforts on integrating           separate levels first and then establishing their relations
HAZOP with other methods [32, 33] to analyse com-                via both qualitative and quantitative methods, e.g., BNs.
mon causes and system scenarios [34]. A comprehensive            HILLS is applied to a practical example of AUVs, with
review of those techniques may refer to recent survey            the discovery of new guide words as well as new causes
papers, e.g. [35]. The application of HAZOP on computer-         and mitigation related to ML.
based systems first appears in [36]. After that, the expe-          In conclusion, HILLS complements HAZOP when
rience gained from application of HAZOP and related              working with LESs, and is able to identify safety hazards
techniques to computer-based systems was summarised              and security threats related to ML components through
in [37]. There is a recent trend of applying HAZOP-like          its structural advantages.
analysis to LESs, e.g., in autonomous car context [38].

Hierarchical structure The concept of hierarchy is Acknowledgments
not new, but existing papers either focus on the hierarchi-
                                                            This work is supported by U.K. DSTL through the project
cal priority of the analysis order in the HAZOP analysis
                                                            of Safety Argument for Learning-enabled Autonomous
Underwater Vehicles and U.K. EPSRC through End- [12] H. Pasman, W. Rogers, How can we improve hazop,
to-End Conceptual Guarding of Neural Architectures              our old work horse, and do more with its results?
[EP/T026995/1].       This project has received funding         an overview of recent developments, Chemical
from the European Union’s Horizon 2020 research and in-         Engineering Transactions 48 (2016) 829–834.
novation programme under grant agreement No 956123. [13] H. Ozog, Hazard identification and quantification,
XZ’s contribution to the work is partially supported            Chem. Eng. Prog. 83 (1987) 55–64.
through Fellowships at the Assuring Autonomy Inter- [14] V. Cozzani, S. Bonvicini, G. Spadoni, S. Zanelli, Haz-
national Programme. YQ’s contribution to the work is            mat transport: A methodological framework for the
supported through Chinese Scholarship Council (CSC).            risk analysis of marshalling yards, Journal of Haz-
                                                                ardous Materials 147 (2007) 412–423.
                                                           [15] P. Aspinall, Hazops and human factors, in: Insti-
References                                                      tution of Chemical Engineers Symposium Series,
                                                                volume 151, 2006, p. 820.
 [1] H. Lawley, Operability studies and hazard analysis,
                                                           [16] L. Sun, Y.-F. Li, E. Zio, Comparison of the ha-
     Chem. Eng. Prog. 70 (1974) 45–56.
                                                                zop, fmea, fram, and stpa methods for the hazard
 [2] F. Crawley, B. Tyler, Chapter 3 - the hazop study
                                                                analysis of automatic emergency brake systems,
     method, in: F. Crawley, B. Tyler (Eds.), HAZOP:
                                                                ASCE-ASME Journal of Risk and Uncertainty in En-
     Guide to Best Practice (3rd Edition), Elsevier, 2015.
                                                                gineering Systems, Part B: Mechanical Engineering
 [3] J. Dunjó, V. Fthenakis, J. A. Vílchez, J. Arnaldos,
                                                                8 (2022).
     Hazard and operability (hazop) analysis. a literature
                                                           [17] D. Slater, The Hazop methodology, 2015.
     review, Journal of Hazardous Materials 173 (2010)
                                                           [18] X. Zhao, W. Huang, A. Banks, V. Cox, D. Flynn,
     19–32.
                                                                S. Schewe, X. Huang, Assessing the reliability of
 [4] D. Lane, D. Bisset, R. Buckingham, G. Pegman,
                                                                deep learning classifiers through robustness eval-
     T. Prescott, New foresight review on robotics and
                                                                uation and operational profiles, in: AISafety’21
     autonomous systems, Technical Report No. 2016.1,
                                                                Workshop at IJCAI’21, volume 2916, ceur-ws.org,
     LRF, 2016.
                                                                2021.
 [5] X. Zhao, A. Banks, J. Sharp, V. Robu, D. Flynn,
                                                           [19] W. Huang, X. Zhao, X. Huang, Embedding and
     M. Fisher, X. Huang, A Safety Framework for Crit-
                                                                extraction of knowledge in tree ensemble classifiers,
     ical Systems Utilising Deep Neural Networks, in:
                                                                Machine Learning 111 (2022) 1925–1958.
     Computer Safety, Reliability, and Security (Safe-
                                                           [20] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R.
     Comp’20), volume 12234 of LNCS, Springer, Cham,
                                                                Müller, W. Samek, On pixel-wise explanations for
     2020, pp. 244–259.
                                                                non-linear classifier decisions by layer-wise rele-
 [6] E. Asaadi, E. Denney, G. Pai, Quantifying assur-
                                                                vance propagation, PloS one 10 (2015) e0130140.
     ance in learning-enabled systems, in: SafeComp’20,
                                                           [21] X. Zhao, W. Huang, X. Huang, V. Robu, D. Flynn,
     volume 12234 of LNCS, Springer, Cham, 2020, pp.
                                                                BayLIME: Bayesian local interpretable model-
     270–286.
                                                                agnostic explanations, in: Proc. of the 37th Conf.
 [7] R. Bloomfield, H. Khlaaf, P. R. Conmy, G. Fletcher,
                                                                on Uncertainty in Artificial Intelligence, UAI’21,
     Disruptive innovations and disruptive assurance:
                                                                PMLR, 2021, pp. 887–896.
     Assuring machine learning and autonomy, Com-
                                                           [22] J. Jurkiewicz, J. Nawrocki, M. Ochodek, T. Głowacki,
     puter 52 (2019) 82–89.
                                                                Hazop-based identification of events in use cases:
 [8] E. Alves, D. Bhatt, B. Hall, K. Driscoll, A. Muruge-
                                                                An empirical study, Empir Software Eng 20 (2015)
     san, J. Rushby, Considerations in assuring safety
                                                                82–109.
     of increasingly autonomous systems, Technical Re-
                                                           [23] E. Lee, Y. Park, J. G. Shin, Large engineering project
     port NASA/CR-2018-220080, NASA, 2018.
                                                                risk management using a bayesian belief network,
 [9] S. Burton, I. Habli, T. Lawton, J. McDermid, P. Mor-
                                                                Expert Systems with Applications 36 (2009) 5880–
     gan, Z. Porter, Mind the gaps: Assuring the safety
                                                                5887.
     of autonomous systems from an engineering, eth-
                                                           [24] J. Cheng, R. Greiner, J. Kelly, D. Bell, W. Liu, Learn-
     ical, and legal perspective, Artificial Intelligence
                                                                ing bayesian networks from data: An information-
     279 (2020) 103201.
                                                                theory based approach, Artificial intelligence 137
[10] P. Andow, H. G. Britain, E. Safety, Guidance on
                                                                (2002) 43–90.
     HAZOP procedures for computer-controlled plants,
                                                           [25] N. Berthier, A. Alshareef, J. Sharp, S. Schewe,
     Great Britain, Health and Safety Executive, 1991.
                                                                X. Huang, Abstraction and symbolic execution of
[11] D. J. Burns, R. M. Pitblado, A Modified Hazop
                                                                deep neural networks with bayesian approximation
     Methodology For Safety Critical, Springer London,
                                                                of hidden features (2021).
     London, 1993.
                                                           [26] S. Thomas, K. Groth, Toward a hybrid causal
     framework for autonomous vehicle safety analy-                 Control Laboratory SCL-009/2003 (2003).
     sis, Proceedings of the Institution of Mechanical         [41] M. Wallace, Modular architectural representation
     Engineers, Part O: Journal of Risk and Reliability             and analysis of fault propagation and transforma-
     (2021) 1748006X2110433.                                        tion, Electronic Notes in Theoretical Computer
[27] E. Denney, G. Pai, I. Habli, Towards measurement               Science 141 (2005) 53–71.
     of confidence in safety cases, in: Int. Symp. on          [42] N. Leveson, Engineering a Safer World: Systems
     Empirical Software Engin. and Measurement, 2011,               Thinking Applied to Safety, Engineering systems,
     pp. 380–383.                                                   MIT Press, 2011.
[28] X. Zhao, D. Zhang, M. Lu, F. Zeng, A new approach         [43] P. Yang, R. Karashima, K. Okano, S. Ogata, Auto-
     to assessment of confidence in assurance cases, in:            mated inspection method for an stamp/stpa - fallen
     Computer Safety, Reliability, and Security (Safe-              barrier trap at railroad crossing -, Procedia Com-
     Comp’12), volume 7613 of LNCS, Springer, 2012, pp.             puter Science 159 (2019) 1165–1174.
     79–91.                                                    [44] T. Kaneko, Y. Takahashi, T. Okubo, R. Sasaki, Threat
[29] D. Koller, N. Friedman, Probabilistic Graphical Mod-           analysis using stride with stamp/stpa, in: Proc. of
     els: Principles and Techniques, Adaptive computa-              the Int. Workshop on Evidence-based Security and
     tion and machine learning, MIT Press, 2009.                    Privacy in the Wild, 2018.
[30] S. Rimkevičius, M. Vaišnoras, E. Babilas, E. Ušpuras,     [45] A. Adriaensen, L. Pintelon, F. Costantino, G. D.
     Hazop application for the nuclear power plants de-             Gravio, R. Patriarca, An stpa safety analysis case
     commissioning projects, Annals of Nuclear Energy               study of a collaborative robot application, IFAC-
     (2016).                                                        PapersOnLine 54 (2021) 534–539. 17th IFAC Sym-
[31] W. Tian, T. Du, S. Mu, Hazop analysis-based dy-                posium on Information Control Problems in Manu-
     namic simulation and its application in chemical               facturing INCOM 2021.
     processes, Asia-Pacific Journal of Chemical Engi-         [46] S. Chen, S. Khastgir, I. Babaev, P. Jennings, Identi-
     neering 10 (2015) 923–935.                                     fying accident causes of driver-vehicle interactions
[32] P. K. Marhavilas, M. Filippidis, G. K. Koulinas, D. E.         using system theoretic process analysis (stpa), in:
     Koulouriotis, An expanded hazop-study with fuzzy-              2020 IEEE Int. Conf. on Systems, Man, and Cyber-
     ahp (xpa-hazop technique): Application in a sour               netics (SMC), 2020, pp. 3247–3253.
     crude-oil processing plant, Safety science 124 (2020)     [47] M. Chaal, O. A. Valdez Banda, J. A. Glomsrud, S. Bas-
     104590.                                                        net, S. Hirdaris, P. Kujala, A framework to model
[33] M. Danko, J. Janošovskỳ, J. Labovskỳ, L. Jelemenskỳ,        the stpa hierarchical control structure of an au-
     Integration of process control protection layer into           tonomous ship, Safety Science 132 (2020) 104939.
     a simulation-based hazop tool, Journal of Loss Pre-
     vention in the Process Industries 57 (2019) 291–303.
[34] E. Roche, W. Dupont, A. Summers, Beyond hazop:
     Analyzing common cause and system scenarios,
     Process Safety Progress 38 (2019) e11997.
[35] F. Crawley, B. Tyler, HAZOP: Guide to Best Practice,
     Elsevier Science, 2015.
[36] M. Chudleigh, J. Catmur, Safety assessment of com-
     puter systems using HAZOP and audit techniques,
     in: SafeComp’92, Elsevier, 1992, pp. 285–292.
[37] T. A. Kletz, Hazop–past and future, Reliability
     Engineering & System Safety 55 (1997) 263–266.
[38] B. Kramer, C. Neurohr, M. Büker, E. Böde, M. Frän-
     zle, W. Damm, Identification and quantification
     of hazardous scenarios for automated driving, in:
     International Symposium on Model-Based Safety
     and Assessment, Springer, 2020, pp. 163–178.
[39] M. R. Othman, R. Idris, M. H. Hassim, W. H. W.
     Ibrahim, Prioritizing HAZOP analysis using ana-
     lytic hierarchy process (AHP), Clean Technologies
     and Environmental Policy 18 (2016) 1345–1360.
[40] E. Németh, R. Lakner, K. Hangos, I. Cameron, Hier-
     archical cpn model-based diagnosis using HAZOP
     knowledge, Technical report of the Systems and