=Paper= {{Paper |id=Vol-3249/paper3-RobOntics |storemode=property |title=Belief-Based Fault Recovery for Marine Robotics |pdfUrl=https://ceur-ws.org/Vol-3249/paper3-RobOntics.pdf |volume=Vol-3249 |authors=Jeremy Coffelt,Mahya Mohammadi Kashani,Andrzej Wąsowski,Peter Kampmann |dblpUrl=https://dblp.org/rec/conf/jowo/CoffeltKWK22 }} ==Belief-Based Fault Recovery for Marine Robotics== https://ceur-ws.org/Vol-3249/paper3-RobOntics.pdf
Belief-based fault recovery for marine robotics
Jeremy Paul Coffelt1 , Mahya Mohammadi Kashani2 , Andrzej Wąsowski2 and
Peter Kampmann1
1
    ROSEN Technology and Research Center GmbH
2
    IT University of Copenhagen


                                         Abstract
                                         We propose a framework expanding the capabilities of underwater robots to autonomously recover
                                         from anomalous situations. The framework is built around a knowledge model developed in three stages.
                                         First, we create a deterministic knowledge base to describe the “health” of hardware, software, and
                                         environment components involved in a mission. Next, we describe the same components probabilistically,
                                         defining probabilities of failures, faults, and fixes. Finally, we combine the deterministic and probabilistic
                                         knowledge into a minimal ROS package designed to detect failures, isolate the underlying faults, propose
                                         fixes for the faults, and determine which is the most likely to help. We motivate the solution with a
                                         camera fault scenario and demonstrate it with a thruster failure on a real AUV and a simulated ROV.

                                         Keywords
                                         ontology-based autonomy, probabilistic logic programming, fault detection and recovery, marine robotics




1. Introduction
Unmanned underwater vehicles (UUVs) are used for tasks that are impossible or too danger-
ous for humans: bathymetric and ecological surveys, offshore infrastructure inspection and
maintenance, or clearing underwater minefields. The required autonomy and reliability for
these missions grows along with their range and duration. When things go wrong, these robots
cannot ask a human for help like service robots or safely stop in place like ground robots.
  Remotely operated vehicles (ROVs) are UUVs that are designed to be neutrally buoyant and
remain tethered to a surface vessel. Despite this, notable ROVs, such as the Kaikio, have have
been lost at the bottom of the ocean [1]. Autonomous underwater vehicles (AUVs) are UUVs
that operate with complete autonomy, untethered to any human operator or vessel, which
puts them at an even greater risk of being lost. Although designed to be positively buoyant
and float to the surface when things go wrong, AUVs can drift far away from their last known
location, be struck by a passing vessel, or lost forever under an ice shelf [2, 3]. A recovery is
time consuming and expensive, as a specialized ship and crew must be chartered to search
for and load the robot. Several noteworthy AUVs remain lost [4].
  Because of these risks and costs, it is especially important that UUV missions are not aborted
unnecessarily. To this end, we propose a solution enabling compromised marine robots to
autonomously recover and successfully complete their missions. Our main contributions include:

The Eighth Joint Ontology Workshops (JOWO’22), August 15-19, 2022, Jönköping University, Sweden
$ jcoffelt@rosen-group.com (J. P. Coffelt); mahmo@itu.dk (M. Mohammadi Kashani); wasowski@itu.dk
(A. Wąsowski); pkampmann@rosen-group.com (P. Kampmann)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
    • An ontology deterministically describing the “health” of hardware, software, and environment
      components involved in an underwater mission. The semantics and syntax of the ontology
      allow information to be shared between the systems on the vehicle, the engineers that
      develop it, and the domain experts that guide the knowledge base.
    • A probabilistic extension of the ontology to describe the same components in terms of likeli-
      hoods of failures, faults, and fixes. This guides which fixes are available and whether they
      are likely to resolve the problem or lead to new faults.
    • A knowledge service combining the deterministic and probabilistic capabilities above. In
      order to simplify and accelerate development time, our solution relies only on existing,
      open-source resources, such as OWL, Protégé, Prolog, and ROS.
    • A demonstration of feasibility with experiments involving a catastrophic thruster failure
      during a real mission and various failures during a simulated shipwreck survey.


2. Deterministic knowledge
Before proceeding, we define two key terms. Following Laprie [5] (as extended in [6]) and ISO
standards [7], we consider a failure to be a deviation from some desired behavior and a fault to
be the defect or anomaly that caused the failure. For instance, blurry images could be a failure
due to numerous faults, including a damaged camera sensor.
   Now, consider an AUV several days and hundreds of kilometers into a week-long port-to-port
pipeline inspection. The AUV relies on a camera to avoid obstacles, track the pipeline, and
document findings. How should the AUV behave if it suddenly experiences a camera fault?
What knowledge and reasoning are required to turn a failed or aborted mission into a successful
one? We investigate these issues through the following competency questions.
   Q1: How is a fault detected? To detect a camera failure, the ontology must permit clear
definitions of metrics for nominal camera operation so that a “health monitor” can process all
camera-related data and determine whether current readings are within expected limits. For
instance, metrics like power draw and image brightness can assess the health of a camera.
   Q2: What faults can cause a failure? Unusual power draw can be caused by a camera
hardware fault, a loose connector, or a short circuit in a cable. Dark images could be caused by a
faulty sensor, or by environmental disturbances (“faults”) such as high turbidity. Consequently,
our ontology must capture all failures for each potential fault. For the AUV example, let
CamFault = {cam_hardware_fault, cam_driver_fault, . . .} be the set of all camera faults, and
CamFail = {power_fail, raw_data_fail, processed_data_fail, . . .} be all camera-related
failures. Then for a failure fail ∈ CamFail, the model constrains the possible faults to:
           fault ∈ CamFault ∧ hasFault(cam0, fault) ∧ resultsIn(fault, fail)

Q3: What failures could result from a fault? If the camera sensor is damaged (a fault),
multiple failures are expected to occur—power draws unusually high or low, raw camera data
with unexpected distributions, and suspiciously dark or noisy processed videos. The ontology
must fully describe the many-to-many relationships between faults and failures. In our example,
if a fault ∈ CamFault occurs, it produces failures satisfying the following constraint:
            fail ∈ CamFail ∧ hasFailure(cam0, fail) ∧ resultsIn(fault, fail)
Figure 1: An overview of the REMARO ontology relating failures, faults, fixes, components, and health
metrics. The single and double lines represent deterministic and probabilistic relationships, respectively.

  % individual sensor capabilities
  hasCapability(mono_camera,        2D_vision_data).
  hasCapability(stereo_camera,      3D_vision_data).
  hasCapability(side_scan_sonar, 2D_sonar_data).
  hasCapability(mbes_sonar,         3D_sonar_data).


Figure 2: Deterministic knowledge relating components and capabilities.

  % ROS topics providing health metrics for camera "cam0"
  hasHealthTopic(cam0, "remaro_auv/cam0/power_stats").
  hasHealthTopic(cam0, "remaro_auv/cam0/processed_data/img_stats").
  % Specific fields in the ROS messages that correspond to health metrics
  hasHealthMetric("remaro_auv/cam0/power_stats", cam0_current_draw).
  hasHealthMetric("remaro_auv/cam0/processed_data/img_stats", cam0_img_brightness).
  hasHealthMetric("remaro_auv/cam0/processed_data/img_stats", cam0_img_noise_ratio).
  % Acceptable limits on each health metric
  hasHealthLimits(cam0_current_draw,        [0.0, 0.2]). % amps
  hasHealthLimits(cam0_img_brightness,      [0.1, 0.9]). % 0=black --> 1=white
  hasHealthLimits(cam0_img_noise_ratio, [5.0, 1e6]). % 40=excellent


Figure 3: Deterministic knowledge modeling health metrics for an AUV camera.

  Q4: Are any fixes available for a fault? Some faults, like a damaged camera sensor, are
unrecoverable during mission execution, others like a buggy camera driver could have one or
more fixes. The ontology should support mapping faults to potential fixes, which results in
constraints and queries similar to those relating failures and faults.
  % Potential faults common to any camera
  hasFault(Comp, camera_hardware_fault)        : 0.01 :- isCamera(Comp).
  hasFault(Comp, camera_firmware_fault)        : 0.05 :- isCamera(Comp).
  hasFault(Comp, camera_driver_fault)          : 0.10 :- isCamera(Comp).
  hasFault(Comp, camera_processor_fault) : 0.25 :- isCamera(Comp).
  hasFault(Comp, camera_power_fault)           : 0.05 :- isCamera(Comp).
  hasFault(Comp, camera_comm_fault)            : 0.10 :- isCamera(Comp).
  % Possible camera failures
  resultsIn(camera_hardware_fault, camera_current_draw_fail)          : 0.50.
  resultsIn(camera_hardware_fault, camera_img_brightness_fail)        : 0.85.
  resultsIn(camera_hardware_fault, camera_img_noise_ratio_fail) : 0.70.
  resultsIn(camera_firmware_fault, camera_current_draw_fail)          : 0.05.
  % Potential fixes for various faults
  hasFix(camera_hardware_fault, camera_cycle_power_fix)          : 0.10.
  hasFix(camera_firmware_fault, camera_cycle_power_fix)          : 0.30.
  hasFix(camera_driver_fault,        camera_cycle_power_fix)     : 0.30.
  hasFix(camera_driver_fault,        camera_restart_driver_fix) : 0.50.
  % Note the dependence on whether the fault already exists
  hasRisk(camera_cycle_power_fix, camera_hardware_fault) : 0.99
                     :- isCamera(Comp), hasFault(Comp, camera_hardware_fault).
  hasRisk(camera_cycle_power_fix, camera_hardware_fault) : 0.05
                     :- isCamera(Comp), \+hasFault(Comp, camera_hardware_fault).


Figure 4: Probabilistic knowledge of failures, faults, and fixes for an AUV camera.


   Q5: Are there any potential risks with the proposed fixes? Some fixes come with no risk.
For instance, if an image processing service has crashed, restarting it can do no further damage.
Other fixes come with substantial risk. As an example, consider a central communication service
that has become unresponsive due to a large message queue. If left alone, the service might
recover and resume normal operation. However, attempting to kill and reinitialize the service
could have disastrous consequences as obstacle avoidance processes lose critical sensor data.
For these reasons, it is important the ontology includes not only a list of fixes for each fault, but
also a list of risks (additional faults) associated with each fix.
  Motivated by the competency questions above, we propose the REliable MArine RObotics
(REMARO) ontology outlined in Fig. 1. We have elaborated it and translated into OWL syntax [8]
using the Protégé editor [9]. The resulting OWL file, which is available in the REMARO Project
repository,1 is used to define the basic classes and relationships in the knowledge base.
  The deterministic capabilities of the ontology are ideally suited for storing mission-agnostic
knowledge, such as relationships between goals, tasks, capabilities, and components. As an
example, consider the Prolog snippet in Fig. 2, which relates components and capabilities.
Another Prolog file includes the health information specific to a particular vehicle configuration.
The snippet in Fig. 3 models the reliability of an AUV camera. Among other things, Fig. 3
includes health data directly and indirectly measuring camera performance. The availability of
such data is a limiting factor in determining which failures (and, therefore, faults) are detectable.

    1
        https://github.com/remaro-network/remaro_ontologies
3. Probabilistic knowledge
Suppose the AUV has an image quality failure. As previously discussed, this could be due to
numerous faults, such as a broken sensor (hardware), a crashed processing script (software), or
turbid water (environment). In general, a dead camera is less likely than a software bug. And,
depending on the environment, murky water could be near impossible or almost certain. For
these reasons, it is valuable to consider not only faults and failures, but also their likelihoods.
   In addition, some recovery “fixes” might resolve the original issue but produce new faults as un-
intended side effects. Again, probabilities could be attached to both the proposed fixes and their
potential risks. If a fix is likely to restore a key capability and has low probability of detrimental
side effects, it should likely be executed. However, if the fix involves cycling power to some
mission-critical system/section of the vehicle (compromising the safety of the entire vehicle if
power cannot be restored), then it should be avoided. With probabilities, these pros and cons can
be weighed to guide and explain efforts towards capability restoration and mission adaptation.
   As shown in Fig. 2 and 3, we represent deterministic knowledge in Prolog. To add probabilistic
knowledge while remaining in Prolog, we use the open-source cplint2 framework. This allows
knowledge to be interpreted as a probability distribution over deterministic logic programs.
cplint includes the PITA and MCINTYRE modules for exact and approximate causal inference,
respectively. With them, queries to the knowledge base can be calculated by considering a joint
distribution over possible words [10].
   These probabilistic capabilities are ideal for describing failures, faults, and fixes specific to
the vehicle and environment in the current mission. For instance, suppose priors such as
                            Pr(fault𝑖 ) = 𝑝𝑖      and     Pr(fail𝑗 | fault𝑖 ) = 𝑞𝑗,𝑖              (1)
are known for all fault𝑖 ∈ Faults and fail𝑗 ∈ Fails. Then cplint provides a reasoning
engine for the Bayes’ Theorem calculations required to compute posteriors of the form
                                        Pr(fault𝑖 | fail1 , . . . , fail𝑛 ),                      (2)
which can be used to determine the fault most likely to cause the detected failure(s). Fig. 4 lists
a few such probabilistic facts for an AUV camera. To clarify the notation in Fig. 4, note that
                       Pr(coin shows heads | coin is tossed ∧ coin is fair) = 0.6
would be translated as:
                 showsHeads(coin) : 0.6 :- isTossed(coin), \+isFair(coin).

   Note that faults, such as those given in Fig. 4, are carefully defined in broad terms to cover
all internal, external, and interface aspects of each component. Also, unless stated otherwise,
probabilities are assumed independent of each other and based on some consistent time interval
(e.g., per mission, per hour of mission, per device lifetime).
   As much of the power in the proposed solution lies in its probabilistic capabilities, it is worth
discussing potential sources of priors like those shown in Equations 1-2 and Fig. 4. We believe
the ideal approach involves a combination of the following:
   2
       https://github.com/friguzzi/cplint
    • Device manufacturers: Data sheets and manufacturer engineers/technicians are perhaps
      the simplest source of information regarding component-related faults. Manufacturers
      can also provide details about power, communication, and environmental limitations that
      are otherwise difficult to determine without tedious and expensive experimentation.
    • Past missions: If available, results from past missions can provide improved estimates
      for a particular vehicle configuration or mission environment. For instance, suppose
      data sheets give reliability estimates that are contradicted by the observed frequency of
      faults for similar vehicles and environments. Then, these observed probabilities should
      be prioritized over the theoretical probabilities provided by the manufacturer.
    • Simulated missions: When a new vehicle configuration or mission environment is be-
      ing considered, the cheapest and quickest source of information is usually simulation.
      Unfortunately, some faults and failures are difficult or impossible to simulate.
    • Domain experts: With their unique knowledge and experience, domain experts can
      combine all of the above into improved estimates. Unfortunately, domain experts can be
      difficult to find, expensive to hire, and still susceptible to human error.


4. Health monitor implementation
As shown in Fig. 5, the proposed health monitor consists of a publisher/subscriber node
connected to a knowledge base. Before detailing the implementation, we offer the following
justifications for basing our solution in ROS: (1) it is the de facto standard for robotics
middleware, (2) real-world data is available in ROS "bags", (3) a ROS-based UUV simulator is
available for experimentation, and (4) a popular ROS-based knowledge base already exists.

4.1. Knowledge base
The knowledge base provides a means of storing existing knowledge and reasoning to generate
new knowledge. A priori knowledge like that shown in Fig. 2-4 is mostly static and reusable
between missions. For instance, a team will likely have an inventory of actuators and sensors that
it shares across multiple robots. Because the capabilities of these components are deterministic
and remain fixed, the team could have a single components_and_capabilities.pl file, like the
one shown in Fig. 2. Similarly, the potential faults, failures, and fixes for these components do not
change between missions. So, the team could also have a single fails_faults_and_fixes.pl
file containing probabilistic knowledge similar to that shown in Fig. 4. These files can be
extended as inventory and beliefs evolve, but should not change for regular day-to-day testing.
   Other knowledge will vary on a mission-by-mission basis. For instance, for each vehicle
configuration, there should be a corresponding health_topics.pl file, like the one shown in
Fig. 3. This information should match the current vehicle and environment. For instance, for an
AUV camera, nominal img_brightness could depend on whether the mission occurs near the
surface or at depth, during day or night, in a turbid river or in clear ocean water.
   Currently, information contained in these files is loaded during initialization of the ROS node
discussed in Sec. 4.2. In the future, we plan to fully integrate our solution with KnowRob3 ,
    3
        https://github.com/knowrob/knowrob
Figure 5: Overview of the proposed health monitor architecture.




Figure 6: Flowchart for the proposed health monitor ROS node.


which is an open-source, ROS-based knowledge processing system for acquiring, grounding,
representing, and reasoning with heterogeneous knowledge [11]. For now, our solution relies
only on the rosprolog4 functionality within KnowRob, which provides an interface between
Prolog and other ROS nodes. Its primary usage is extracting knowledge from the Prolog files
mentioned above and using this knowledge to answer queries from other ROS nodes, such as
the proposed health monitor node.
   4
       https://github.com/knowrob/rosprolog
4.2. ROS node
As shown in Fig. 6, the ROS node is both a subscriber and publisher. On initialization, the node
extracts health topics, metrics, and limits from health_topics.pl. It then subscribes to health
topics for all hardware, software, and environment components defined for the current mission.
   During nominal operations, the node simply awaits published messages on the monitored
topics. When a new message is received, a callback is prompted, the current health value is
extracted, and that value is compared to the predefined limits. If the value is within range, the
callback terminates and the subscriber resumes waiting for the next message.
   If a value is non-nominal, the failure triggers the FailFaultFix subroutine. First, a STOP
procedure is initiated to protect the affected component from further damage. These STOP
procedures consists of ROS messages that can, for example, unpower a sensor, return an actuator
to default position, or kill a software service.
   Next, the FailFaultFix subroutine searches for faults capable of causing the detected fail-
ure. The most likely fault is determined (via cplint queries involving the knowledge in
fails_faults_and_fixes.pl). For this fault, its most likely fix (determined similarly) is then
attempted. Like the emergency STOP procedures, each FIX publishes a sequence of ROS mes-
sages. For instance, the messages could restart a camera driver, adjust perception algorithms
to account for visibility changes, or pulse the thruster to clear a blockage. The last step in the
FIX procedure essentially undoes the STOP procedure. If the failure is no longer detected, the
fix is assumed successful and the FailFaultFix subroutine exits. If the failure remains, the next
most likely fix is considered until all have been exhausted. Once all possible fixes for a fault
have been attempted, the next most likely fault is considered along with its fixes. This process
is repeated until either the failure has been resolved or no more potential fixes remain.
   Although the current implementation can compute risks (likelihoods of side-effect faults) of
candidate fixes, it can also attempt fixes for those faults should they be realized. As such, the
current priority is fixing present faults, not avoiding potential faults. In future work, we intend
to consider the upper half of Fig. 1 and how such risks might jeopardize mission success. For
now, should the FailFaultFix subroutine exists unsuccessfully, then higher-level services must be
involved to adapt mission plans for any compromised capabilities. The most notable frameworks
for this include the Cognitve Robotics Abstract Machine (CRAM)[12] and Metacontrol for ROS
(MROS)[13]. We consider our current solution a complement to such systems, ensuring faults
are truly unrecoverable before resorting to component substitutions and controller adaptations.


5. Experiments
5.1. Real thruster fault on an AUV
In our first scenario, data is analyzed from a real field trial where an AUV suffered a catastrophic
thruster failure when a cable became tangled around the rotor blades. As Fig. 7 shows, several
failures occurred in quick succession. Almost simultaneously, the difference jumped between
desired and measured speeds, and the temperature began rapidly increasing in the motor.
Around 10 seconds later, three power issues occurred near synchronously.
   Since these power-related metrics/messages may have different publication rates, all of them
Figure 7: Failure analysis for real AUV thruster fault.

  hasHealthTopic(port_thruster, "remaro_auv/current_draw").
  hasHealthTopic(port_thruster, "remaro_auv/speed_diff").
  hasHealthTopic(port_thruster, "remaro_auv/temp_gain").
  hasHealthMetric("remaro_auv/port_thruster/current_draw", current_draw).
  hasHealthMetric("remaro_auv/port_thruster/speed_diff",      speed_diff).
  hasHealthMetric("remaro_auv/port_thruster/temp_gain",       temp_gain).
  hasHealthLimits(current_draw, [0, 5]).      % amps
  hasHealthLimits(speed_diff,     [0, 0.1]). % a unitless ratio
  hasHealthLimits(temp_gain,      [0, 2]).    % 2 C/sec (6 C over 3 sec)


Figure 8: Minimal health knowledge needed for proposed thruster health monitor.


should be included in the health monitor to ensure the earliest possible failure indicator is caught.
Other particularly useful metrics can be derived from standard topics. For instance, monitoring
the relative difference between desired and measured values is usually more effective than
measuring the values individually. For this reason, we consider the scaled speed differential:
                speed_diff = |speed_measured − speed_desired| / speed_max
  In addition, we suggest limits on both temperature and its time derivative. More specifically,
we consider an averaged derivative over an interval of ∆𝑡 = 5 seconds, which provides a nice
compromise between slow response time and frequent false alarms from noisy readings:

          temp_gain(𝑡) = [motor_temperature(𝑡) − motor_temperature(𝑡 − ∆𝑡)] / ∆𝑡

To avoid similar future faults, we propose the settings shown in Fig. 8. With this configuration,
the health monitor (limited only by the 0.3 Hz publication rates of the desired_speed and
motor_temperature topics) detects within 5 seconds both the temperature gain and speed
discrepancy. This results in a STOP message (setting speed to zero) long before the current
spike that ultimately destroyed internal circuitry. Notice the catastrophic fault could have been
prevented thanks in part to the internal motor temperature sensor. A cheaper thruster might
not provide such data causing the same failure to go undetected in time to save the device.
   Although not testable with the available ROS bag, we believe several fixes were possible. First,
a "wait-and-see" fix could have allowed the cable to drift out on its own. Other possibilities
include alternating pulses to loosen a jam or reversing to unwind an entanglement. Each of
these attempts would be immediately aborted if health metrics continued to worsen.

5.2. Simulated thruster faults on an ROV
Our second scenario considers a simulated ROV suffering a thruster fault while performing a
shipwreck survey. More specifically, we consider the REXROV surveying the Hercules wreckage
shown in the Gazebo simulator screenshot in Fig. 9(a). The ROV, ship, and terrain model are all
included in the uuv_simulator5 developed during the SWARMs Project [14].
   Because we do not have access to low-level hardware data in the simulation, like temperature,
power usage, and control signals, we instead use trajectory deviations as our primary health
metric. For demonstrations purposes, we configured the vehicle to follow a tight helical path as
it descends around the Hercules. With the shipwreck at 60 meters depth and the ROV starting
15 meters above, we began three trials, each with the ROV completing five loops of radius
14 meters and a vertical displacement 2.6 meters per loop. During the first scenario, no fault
was present and the ROV followed the desired trajectory. During the second, a fault was forced
at 𝑡 = 540 seconds by manually disabling a subset of the sternside thrusters. This result is
shown in RViz in Fig. 9(b). The final scenario recreated the same fault, but followed with a
successful thruster_cycle_power_fix at 𝑡 = 850 seconds. As can be seen in Fig. 10, the ROV
successfully recovered and reacquired the desired trajectory within 30 seconds.


6. Related work
A variety of approaches are common for fault detection, including analytical, knowledge-based,
and data-driven models [15]. Analytical models are often derived from first principles to describe
differences between observed and modeled behavior [16, 17]. Such systems can be accurate,
but also difficult to develop. Knowledge-based methods, which rely on domain experts and
ontological formulations, can offer simpler and more expressive solutions [18, 19]. Data-driven
   5
       https://github.com/uuvsimulator/uuv_simulator
        (a) Simulated scenario in Gazebo.                   (b) Faulty trajectory in RViz.
Figure 9: Simulation of REXROV surveying the Hercules shipwreck.




Figure 10: Comparison of trajectory deviations with and without faults and fixes.


methods have gained increasing popularity due to the rapid growth of AI-based techniques [20].
For example, in [21], a neural network was trained on empirical data to provide a soft-fault model
for marine thrusters. An interesting hybrid approach in [22] is used to develop an energy-aware
architecture for fault mitigation in AUVs from both data-driven actuator characterizations and
model-based anomaly diagnoses. We believe our proposed solution offers a complement to such
methods, while adding probabilistic capabilities to existing knowledge-based approaches.
   Several existing frameworks rely on semantic knowledge representation for collaborative
underwater robots [23]. In the SWARMs Project [14], a probabilistic ontology was combined
with a multi-entity Bayesian network (MEBN) to reason with uncertainty for chemical spill
monitoring. Cordova [24] developed an ontology to describe robotic devices and skills to enable
collaborative knowledge acquisition between real and virtual worlds. Our ontology is also
designed to be shareable between robots and environments, but at the component level.
   Diab et al. [25] presented an ontology for overcoming task execution failures in service
robotics. In [26], a framework was developed for fault-tolerant mission adaptations in underwa-
ter robotics by relating plans, actions, and capabilities. Such solutions attempt to overcome faults
by substituting components and capabilities, or altering plans. Our solution takes a different
approach by attempting to exhaust all possible fixes to allow tasks to be executed as planned.
   Self-awareness, self-adaptation, and metacontrol offer additional perspective for overcoming
faults and failures. Patrón et al. [27, 28] used Boyd’s Cycle [29] of Observe → Orient → De-
cide → Act to combine situational awareness with analytical planners to repair and modify
plans. In [13], TOMASys6 was used as part of a ROS-based metacontrol system for self-adapting
systems that combined multiple levels of abstraction and autonomy. TOMASys is also used
in [30] to enable self-diagnosis and autonomous reconfiguration to preserve mission-mandated
capabilities. Papadimitriou [31] demonstrated a semantic-based knowledge framework to sim-
ulate and overcome a hardware fault on a REMUS100 AUV by performing mission and plan
adaptation during mine countermeasure missions (MCM). In each of these cases, the solutions
require complete frameworks in order to achieve their higher-level functionalities. Our lower-
level solution is designed to be an add-on to existing frameworks, including these, ensuring
first that components cannot be repaired before they are replaced.

7. Conclusion
This paper introduces the REMARO ontology for detecting failures, isolating their underlying
faults, and attempting fixes for their recovery. The ontology supports both deterministic and
probabilistic knowledge representation and reasoning. These probabilities allow domain experts,
past and simulated missions, and manufacturer-provided specifications to guide the order in
which faults and fixes are considered.
   To demonstrate the ontology, we propose a minimal ROS node designed to add fault recovery
capabilities to existing UUV frameworks. This node subscribes to health topics specified in
a mission configuration file. Whenever a new message is received that reveals non-nominal
health values, a subroutine is triggered to consider the most probable faults and their most
likely fixes. If a fix is found, monitoring resumes. Otherwise, higher-level services must be used.
   In future work, we hope to extend the health monitor node to provide component substitutions
when fixes fail and mission reductions when substitutions are unavailable. For now, our solution
serves as a complement to mission adaptation frameworks like CRAM [12] and metacontrol
architectures like MROS [13]. Other future plans include field testing of the presented solution
and extension of the solution to include more complex probabilistic representations. Such
extensions could consider recurring faults and temporary fixes, updating probabilities based on
outcomes from recently attempted fixes, and standardization of STOP and FIX subroutines.

Acknowledgments
This project has received funding from the European Union’s Horizon 2020 research and
innovation programme under the Marie SkłodowskaCurie grant agreement No. 956200. For
more info, please visit https://remaro.eu.
   6
       Teleological and Ontological Model of an Autonomous System
References
 [1] S. Ishibashi, H. Yoshida, Developing a sediment sampling ROV for the deepest ocean, Sea
     Technology 49 (2008) 43–46.
 [2] P. Norgren, R. Lubbad, R. Skjetne, Unmanned underwater vehicles in Arctic operations,
     in: Proceedings of the 22nd IAHR International Symposium on Ice. Singapore, 2014, pp.
     89–101.
 [3] J. Strutt, Report of the inquiry into the loss of AutoSub2 under the Fimbulisen, 2006. URL:
     https://eprints.soton.ac.uk/41098/.
 [4] J. Copely, Onwards & downwards: when ROVs or AUVs are lost in ocean exploration,
     2014. URL: http://www.joncopley.com/blog_may14a.html.
 [5] J. Laprie, Dependable computing and fault-tolerance: concepts and terminology, in:
     Proceedings of the 25th International Symposium on Fault-Tolerant Computing, IEEE,
     1995, pp. 27–30.
 [6] J. Carlson, R. R. Murphy, How UGVs physically fail in the field, IEEE Transactions on
     Robotics 21 (2005) 423–437.
 [7] ISO 10303:2021(E), Industrial automation systems and integration — Product data repre-
     sentation and exchange, Standard, International Organization for Standardization, Geneva,
     CH, 2021.
 [8] G. Antoniou, F. v. Harmelen, Web Ontology Language: OWL, in: Handbook on Ontologies,
     Springer, 2004, pp. 67–92.
 [9] M. A. Musen, The Protégé Project: a look back and a look forward, AI Matters 1 (2015)
     4–12.
[10] F. Riguzzi, G. Cota, E. Bellodi, R. Zese, Causal inference in cplint, International Journal of
     Approximate Reasoning 91 (2017) 216–232.
[11] M. Beetz, D. Beßler, A. Haidu, M. Pomarlan, A. K. Bozcuoğlu, G. Bartels, KnowRob 2.0—a
     2nd generation knowledge processing framework for cognition-enabled robotic agents, in:
     2018 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2018, pp.
     512–519.
[12] M. Beetz, L. Mösenlechner, M. Tenorth, CRAM—A Cognitive Robot Abstract Machine
     for everyday manipulation in human environments, in: 2010 IEEE/RSJ International
     Conference on Intelligent Robots and Systems, IEEE, 2010, pp. 1012–1017.
[13] C. H. Corbato, D. Bozhinoski, M. G. Oviedo, G. van der Hoorn, N. H. Garcia, H. Deshpande,
     J. Tjerngren, A. Wąsowski, MROS: Runtime adaptation for robot control architectures,
     arXiv preprint arXiv:2010.09145 (2020).
[14] X. Li, S. Bilbao, T. Martín-Wanton, J. Bastos, J. Rodriguez, SWARMs Ontology: A common
     information model for the cooperation of underwater robots, Sensors 17 (2017) 569.
[15] L. H. Chiang, E. L. Russell, R. D. Braatz, Fault detection and diagnosis in industrial systems,
     Springer Science & Business Media, 2000.
[16] M. L. Leuschen, I. D. Walker, J. R. Cavallaro, Fault residual generation via nonlinear
     analytical redundancy, IEEE transactions on control systems technology 13 (2005) 452–
     458.
[17] A. Shumsky, A. Zhirabok, C. Hajiyev, Observer-based fault diagnosis in thrusters of
     autonomous underwater vehicles, in: 2010 Conference on Control and Fault-Tolerant
     Systems (SysTol), IEEE, 2010, pp. 11–16.
[18] B. Liu, M. Duan, G. Zhao, An object frame knowledge representation approach for fault
     diagnosis expert system, in: 2011 International Conference on Future Computer Sciences
     and Application, IEEE, 2011, pp. 74–77.
[19] X. Deng, R. Luo, J. Li, Similarity matching algorithm of equipment fault diagnosis based
     on CBR, in: 2015 6th IEEE International Conference on Software Engineering and Service
     Science (ICSESS), IEEE, 2015, pp. 998–1002.
[20] X. Dai, Z. Gao, From model, signal to knowledge: A data-driven perspective of fault
     detection and diagnosis, IEEE Transactions on Industrial Informatics 9 (2013) 2226–2238.
[21] S. Nascimento, M. Valdenegro-Toro, Modeling and soft-fault diagnosis of underwater
     thrusters with recurrent neural networks, IFAC-PapersOnLine 51 (2018) 80–85.
[22] V. De Carolis, F. Maurelli, K. E. Brown, D. M. Lane, Energy-aware fault-mitigation archi-
     tecture for underwater vehicles, Autonomous Robots 41 (2017) 1083–1105.
[23] P. Patrón, Y. Petillot, The underwater environment: a challenge for planning, UK PlanSIG
     Edinburgh 2008, 2008.
[24] A. J. Cordova, Semantic web for robots : applying semantic web technologies for inter-
     operability between virtual worlds and real robots, Technische Universiteit Eindhoven,
     2012.
[25] M. Diab, M. Pomarlan, S. Borgo, D. Bebler, J. Rosell Gratacòs, J. Bateman, M. Beetz,
     FailRecOnt-an ontology-based framework for failure interpretation and recovery in plan-
     ning and execution, in: Proceedings of the 2nd International Workshop on Ontologies for
     Autonomous Robotics, 2021, pp. 1–14.
[26] P. Patrón, E. Miguelanez, J. Cartwright, Y. R. Petillot, Semantic knowledge-based repre-
     sentation for improving situation awareness in service oriented agents of autonomous
     underwater vehicles, OCEANS, 2008.
[27] P. Patrón, E. Miguelanez, Y. R. Petillot, D. M. Lane, Fault-tolerant adaptive mission planning
     with semantic knowledge representation for autonomous underwater vehicles, in: 2008
     IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 2008, pp.
     2593–2598.
[28] P. Patrón, E. Miguelanez, Y. R. Petillot, D. M. Lane, J. Salvi, Adaptive mission plan diagnosis
     and repair for fault recovery in autonomous underwater vehicles, in: OCEANS 2008, IEEE,
     2008, pp. 1–9.
[29] H. Hillaker, Tribute to John R. Boyd, Code One Magazine 12 (1997).
[30] E. Aguado, Z. Milosevic, C. Hernández, R. Sanz, M. Garzon, D. Bozhinoski, C. Rossi,
     Functional self-awareness and metacontrol for underwater robot autonomy, Sensors 21
     (2021) 1210.
[31] G. Papadimitriou, D. Lane, Semantic-based knowledge representation and adaptive mission
     planning for MCM missions using AUVs, in: OCEANS 2014-TAIPEI, IEEE, 2014, pp. 1–8.