Explaining Critical Situations Over Sensor Data Streams
                         Using Proofs and Natural Language
                         Stefan Borgwardt1 , Vera Demberg2,3 , Mayank Jobanputra2 , Alisa Kovtunova1 and
                         Duy Nhu1
                         1
                           Institute of Theoretical Computer Science, TU Dresden, Germany
                         2
                           Department of Computer Science, Saarland University, Germany
                         3
                           Department of Language Science and Technology, Saarland University, Germany


                                    Abstract
                                    This paper describes ongoing interdisciplinary work on integrating logical reasoning and natural
                                    language generation, in order to support a drone operator in the task of supervising autonomous drones,
                                    e.g. for package delivery. In previous work, justifications were computed to explain description logic
                                    (DL) reasoning, and these justifications were then used as the basis for generating a verbal situation
                                    description. Here, we consider full proofs instead of justifications, as well as temporal and concrete
                                    domains, in order to generate more informative natural language explanations. Following a method for
                                    extracting DL proofs, we (i) implemented inference tracing for the DatalogMTL reasoner MeTeoR,
                                    and (ii) used the Evee library (a tool collection for explaining DL entailments) to extract small proofs
                                    for entailed facts. We provide an empirical evaluation of the tracing and proof extraction phases,
                                    which reveals only a minor increase in the computational costs compared to the original MeTeoR
                                    reasoner. Finally, we discuss our preliminary findings on using these proofs as input to a natural
                                    language generation module that explains the reasons behind entailed facts to a user.

                                    Keywords
                                    Proofs, DatalogMTL, Concrete domain reasoning, Temporal reasoning, Verbalization


                         1. Introduction
                         Drone technology and drone control have recently advanced rapidly, to the point that consumer
                         drones with advanced sensors and improved control algorithms are commonplace [1], e.g. in
                         aerial surveys, mapping, aerial movies or rescue missions. As drones are used for an increasing
                         range of tasks, it becomes more important to interact with them. To enable these interactions,
                         it is essential to devise a setup that can flexibly process a variety of data collected by the drone
                         during its flight, reason over it, and convey the important information to the user at runtime, as
                         well as provide a full report at inspection time. Here, we focus on extracting and processing
                         relevant information from sensor data records in order to perform a controlled handover from an
                         autonomous drone to a human drone pilot (see Figure 1). In this setup, the content selection
                         task is hard since only critical information, not all available information, should be mentioned
                         at handover time. Additionally, it is desirable that the system generalizes well to diverse as well
                         as unseen environments during its operation.
                            To this end, we propose that this internal reasoning is made using FOL theories (such as
                         description logic ontologies [2] and logic programs [3]) and their temporal extensions. This
                         provides us with many benefits. First, expert-designed logic theories allow for more flexibility

                             DL 2024: 37th International Workshop on Description Logics, June 18–21, 2024, Bergen, Norway
                          $ stefan.borgwardt@tu-dresden.de (S. Borgwardt); vera@coli.uni-saarland.de (V. Demberg);
                          mayank@lst.uni-saarland.de (M. Jobanputra); alisa.kovtunova@tu-dresden.de (A. Kovtunova);
                          duy.nhu@mailbox.tu-dresden.de (D. Nhu)
                           https://lat.inf.tu-dresden.de/~stefborg/ (S. Borgwardt);
                          https://www.uni-saarland.de/lehrstuhl/demberg/members/verademberg.html (V. Demberg);
                          https://www.uni-saarland.de/lehrstuhl/demberg/members/mayank-jobanputra.html (M. Jobanputra);
                          https://lat.inf.tu-dresden.de/~alisa/ (A. Kovtunova)
                           0000-0003-0924-8478 (S. Borgwardt); 0000-0002-8834-0020 (V. Demberg); 0000-0001-9936-0943 (A. Kovtunova)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC
                                   BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                   A tree is in path within 0.3 meter and the
                                 battery is low. Please resume human control.

                                            Handover Message


                                 Drone Pilot                 Autonomous Drone

Figure 1: We focus on the drone handover as the main communicative function.


and system stability in various environments and rare critical situations. Second, temporal
ontologies can recognize complex drone states by taking into account also events in the past,
such as water contact, collisions, or battery states from several minutes ago. Third, many logical
reasoners, e.g. Elk [4, 5], Snorocket [6], Sequoia [7] and Metric Temporal Reasoner (MeTeoR) [8]
use consequence-based reasoning, which is well-suited to produce formal proofs. One can see
these formal proofs as good candidates for explanations of entailments at inspection time [9, 10].
At runtime, the proof and in particular intermediate derived predicates can be the input for
a natural language generation module to produce more flexible and high quality utterances
in different environments [11, 12]. Naturally, many sensor data records have numerical values,
e.g. temperature, altitude, and battery level. In description logics, concrete domains [13, 14]
have been introduced to enable reference to concrete objects (such as numbers) and predefined
predicates on these objects (such as numerical comparisons) when defining concepts. Recently, a
solution how to incorporate concrete domain reasoning into (atemporal) formal proofs has been
proposed [15].
   Our long-term goal is to observe and analyze sensor data records over time. Recent work [16]
also proposes a tool to answer temporal queries over time-stamped data records with background
knowledge, but does not consider explanation services, temporal or concrete domain axioms. In
this paper, we focus on reasoning over a temporal ontology, in particular using DatalogMTL [8].
This choice is supported by the fact that the DatalogMTL reasoner MeTeoR1 [8] is available under
an open source license. However, MeTeoR itself supports neither concrete domain reasoning
nor proof generation. Thus, we have extended2 its code to support these two functionalities.
Namely, following an approach for extracting description logic (DL) proofs, we implemented
proof generation for MeTeoR by tracing internal processes of the system and making the inference
steps transparent, and integrated the resulting system with the Evee library (a tool collection
for explaining DL entailments) [17] for finding small proofs for entailed facts. Note that we can
only deal with facts that are derived by MeTeoR in the materialization phase, i.e. without the
usage of automata, as automata do not naturally fit into a proof-based framework. However,
according to the experiments in [8], which are the basis for our evaluation, 99.2% of all facts in
a LUBM-based benchmark are indeed derived after finitely many rounds of materialization.
   Additionally, we built a DatalogMTL program [18] to describe complex situations that can
occur during drone flight. It uses temporal operators as well as concrete domain predicates.
Below we provide an example of a critical situation detected by this logic-based framework.

Example 1. Consider the following DatalogMTL rules with concrete domain predicates.

                                     ⊞(︀0,∞) drone(X) ∶– drone(X)                             (1)
                                               risk(X) ∶– riskofinternaldamage(X)             (2)
                     ⊞(︀0,∞) riskofinternaldamage(Y) ∶– hightemperature(Y), drone(Y)          (3)
                                hightemperature(X) ∶– temperature(X, S), >(S, 25)             (4)
1
    https://pypi.org/project/meteor-reasoner/
2
    https://github.com/de-tu-dresden-inf-lat/meteor-proofs
                                                                                    Asserted


                                                                                drone(d)@[0, 0]


                                           Asserted                        ⊞[0, ∞)drone(X) :- drone(X)


                                 temperature(d, 30)@[10, 10]                 ⊞[0, ∞)drone(d)@[0, 0]


                       hightemperature(X) :- temperature(X, S), >(S, 25)        reverse_boxplus


                                     hightemperature(d)@[10, 10]              drone(d)@[0, ∞)


                                  ⊞[0, ∞)riskofinternaldamage(Y) :- hightemperature(Y), drone(Y)


                                             ⊞[0, ∞)riskofinternaldamage(d)@[10, 10]


                                                          reverse_boxplus


                                                  riskofinternaldamage(d)@[10, ∞)


                                                 risk(X) :- riskofinternaldamage(X)


                                                          risk(d)@[10, ∞)


                                                               inclusion


                                                          risk(d)@[11, 11]


Figure 2: Formal proof generated by our extension of MeTeoR.


They express the following statements: (1) once an object is a drone, it keeps being a drone
forever; (2) a risk of internal damage is a risk; (3) if a drone experiences a high temperature,
then it has a permanent risk of internal damage; and (4) the temperature of an object is high if
it is above 25○ 𝐶.
   Then the dataset 𝐷 = {drone(d)@0, temperature(d, 30)@10} containing the two timestamped
facts that object d is a drone at time point 0, and at time point 10 the temperature of 𝑑 is 30○ C,
together with the rules above implies that the drone has experienced overheating and thereafter it
has a permanent risk of having internal damage. Figure 2 shows a proof for drone d still being
Table 1
Semantics of ground literals [8, Table 1], where I, 𝑡 ⊧ ⊺ holds for all 𝑡 and I, 𝑡 ⊧  for none.
           I, 𝑡 ⊧ x𝜚 𝐴        iff I, 𝑡′ ⊧ 𝐴 for some 𝑡′ with 𝑡 − 𝑡′ ∈ 𝜚
           I, 𝑡 ⊧ |𝜚 𝐴        iff I, 𝑡′ ⊧ 𝐴 for some 𝑡′ with 𝑡′ − 𝑡 ∈ 𝜚
           I, 𝑡 ⊧ ⊟𝜚 𝐴        I, 𝑡′ ⊧ 𝐴 for all 𝑡′ with 𝑡 − 𝑡′ ∈ 𝜚
           I, 𝑡 ⊧ ⊞𝜚 𝐴        I, 𝑡′ ⊧ 𝐴 for all 𝑡′ with 𝑡′ − 𝑡 ∈ 𝜚
           I, 𝑡 ⊧ 𝐴1 𝒮𝜚 𝐴2    I, 𝑡′ ⊧ 𝐴2 for some 𝑡′ with 𝑡 − 𝑡′ ∈ 𝜚 and I, 𝑡′′ ⊧ 𝐴1 for all 𝑡′′ ∈ (𝑡′ , 𝑡)
           I, 𝑡 ⊧ 𝐴1 𝒰𝜚 𝐴2    I, 𝑡′ ⊧ 𝐴2 for some 𝑡′ with 𝑡′ − 𝑡 ∈ 𝜚 and I, 𝑡′′ ⊧ 𝐴1 for all 𝑡′′ ∈ (𝑡, 𝑡′′ )


at risk at time point 11, i.e. that 𝐷 entails the fact 𝐹 = risk(𝑑)@11.

   For the runtime scenario, the intermediate facts involving hightemperature and
riskofinternaldamage can be used to create a warning message in natural language.


2. Preliminaries
DatalogMTL is an ontology-based framework for querying temporal data and reasoning over
data streams [19, 20]. It extends conventional Datalog rules with metric temporal operators to
represent events occurring in the real world at different points in time.
Syntax. In DatalogMTL, an atom has the form 𝑃 (𝜏 ) with 𝑃 a predicate and 𝜏 an 𝑛-ary tuple
consisting of constants and variables. A literal (a.k.a. metric atom) A takes one of the following
forms, where 𝜚 is a non-empty positive interval in Q:

                       𝐴 ∶= ⊺ ⋃︀  ⋃︀ 𝑃 (𝜏 ) ⋃︀ x𝜚 𝐴 ⋃︀ |𝜚 𝐴 ⋃︀ ⊟𝜚 𝐴 ⋃︀ ⊞𝜚 𝐴 ⋃︀ 𝐴 𝒮𝜚 𝐴 ⋃︀ 𝐴 𝒰𝜚 𝐴

A DatalogMTL rule 𝐵 ∶– 𝐴1 , . . . , 𝐴𝑛 consists of body literals 𝐴1 , . . . , 𝐴𝑛 , 𝑛 ≥ 1, and a head
literal 𝐵 that does not contain the operators x, |, 𝒮 and 𝒰; moreover, all variables in the head
must also occur in the body. If an atom, literal or rule contains no variable, it is ground. A fact
𝐹 is of the form 𝐴@𝜚, where 𝐴 is a ground atom and 𝜚 a rational interval. Moreover, a finite
set of facts is a dataset, and a finite set of rules a program. If 𝜚3 = 𝜚1 ∪ 𝜚2 is an interval, then the
coalescing of the two facts 𝐴@𝜚1 and 𝐴@𝜚2 is the fact 𝐴@𝜚3 . The grounding ground(Π, 𝒟) of
a program Π with respect to a dataset 𝒟 is the set of all ground rules obtained by assigning
constants from Π and 𝒟 to variables in Π.
Semantics. An interpretation I indicates which ground atoms 𝐴 hold at which time points
t ∈ Q, written as I, 𝑡 ⊧ 𝐴. Additionally, if I, 𝑡 ⊧ 𝐴 for all t ∈ 𝜚, we say that I satisfies the fact
𝐴@𝜚. The satisfaction relation for other ground literal is shown in Table 1. An interpretation
I satisfies a ground rule 𝑟 if, whenever I satisfies each body atom of 𝑟 at a time point 𝑡, then
I also satisfies the head of 𝑟 at 𝑡; it satisfies a (non-ground) rule 𝑟 if it satisfies each ground
instance of 𝑟. Finally, I is a model of a program Π if it satisfies each rule in Π, and it is a model
of a dataset 𝒟 if it satisfies each fact in 𝒟. A program Π and a dataset 𝒟 are consistent if they
have a common model, and they entail a fact 𝐹 , denoted as Π, 𝒟 ⊧ 𝐹 , if each model of both Π
and 𝒟 is also a model of 𝐹 .
Canonical Interpretation. The immediate consequence operator 𝑇Π for a program Π maps an
interpretation I to the least interpretation containing I and satisfying the following property for
each ground instance 𝑟 of a rule in Π: whenever I satisfies each body atom of 𝑟 at time point 𝑡,
then 𝑇Π (I) satisfies the head of 𝑟 at 𝑡. For a dataset 𝒟, let I𝒟 be the unique least model of 𝒟.
The successive application of 𝑇Π to I𝒟 defines a transfinite sequence of interpretations 𝑇Π𝛼 (I𝒟 )
for ordinals 𝛼 as follows:
   1. 𝑇Π0 (I𝒟 ) ∶= I𝒟 ,
   2. 𝑇Π𝛼 (I𝒟 ) ∶= 𝑇Π (𝑇Π𝛼−1 (I𝒟 )) for 𝛼 a successor ordinal, and
      3. 𝑇Π𝛼 (I𝒟 ) ∶= ⋃𝛽<𝛼 𝑇Π𝛽 (I𝒟 ) for 𝛼 a limit ordinal.
The canonical interpretation CΠ,𝒟 of Π and 𝒟 is the interpretation 𝑇Π𝜔1 (I𝒟 ), with 𝜔1 the first
uncountable ordinal. If Π and 𝒟 have a model, the canonical interpretation CΠ,𝒟 is the least
model of Π and 𝒟.
MeTeoR. The Metric Temporal Reasoner (MeTeoR) [8] is a scalable reasoner for full DatalogMTL
implemented in Python that supports fact entailment via materialization and automata-based
reasoning. Materialization (a.k.a. forward chaining) is a classical technique used in conventional
Datalog programs. It successively computes consequences of a program Π by applying its rules
to a dataset 𝒟, similarly to 𝑇Π . Although materialization of regular Datalog programs always
terminates, temporal operators can introduce non-terminating behavior, as they may require
infinitely many rounds of rule application. To deal with this, a more costly automata-based
technique is used in addition to materialization [20]. However, in this paper we consider only
facts that are entailed after finitely many materialization rounds, i.e. those that are satisfied by
𝑇Π𝑛 (I𝒟 ) for a finite 𝑛. For bounded DatalogMTL programs, in which all intervals 𝜌 are bounded,
materialization can be used as the basis for a decision procedure for fact entailment [21].
   MeTeoR implements the materialization using the Python function naive_join3 that deter-
mines all facts that can be derived by applying a given rule 𝑟 to the current dataset 𝒟. To do this,
it determines all groundings of the body of 𝑟 that are satisfied in I𝒟 , computes the intersections
of the temporal intervals associated to the ground body literals, and adds the resulting head
facts to 𝒟. To deal with metric body literals, e.g. x𝜌 𝐴, the apply function recursively translates
all facts of the form 𝐴@𝜌′ in 𝒟 into corresponding metric facts x𝜌 𝐴@𝜌′′ by shifting 𝜌′ according
to 𝜌. Similarly, the reverse_apply function uses the reverse calculations to translate a derived
metric head fact like ⊞𝜌 𝐵@𝜌′ into a ground fact 𝐵@𝜌′′ that can be added to 𝒟. After each
materialization round, facts in the extended dataset 𝒟 are merged by coalescing, which reduces
the space required to store 𝒟 and ensures the correctness of naive_join.
Proofs and Evee. We consider a consequence Π, 𝒟 ⊧ 𝐹 that is to be explained, where Π is a
DatalogMTL program, 𝒟 a dataset and 𝐹 a fact. Following [9, 23], we define proofs of Π, 𝒟 ⊧ 𝐹
as finite, acyclic, directed hypergraphs, where vertices 𝑣 are labeled with ground (possibly metric)
facts ℓ(𝑣) and hyperedges are of the form (𝑆, 𝑑), with 𝑆 a tuple of vertices and 𝑑 a vertex such
that {ℓ(𝑣) ⋃︀ 𝑣 ∈ 𝑆} ⊧ ℓ(𝑑); the leafs of a proof must be labeled by elements of 𝒟 and the root
by 𝐹 . In addition, an edge labeling function (see the blue boxes in Figure 2) indicates how
the conclusion ℓ(𝑑) was derived from the premises, e.g. by applying a DatalogMTL rule or a
temporal operator. In general, there can be several proofs for the same entailment. Here, we
focus on finding a proof which is minimal according to its size, i.e. the number of vertices. This
slightly differs from the proofs for temporal DL query answering in [23], where the root query
may contain temporal operators, and ontology axioms are also treated as vertices in the proof.
   The Java library Evee4 (EVincing Expressive Entailments) [17] implements various proof
generation methods for different DLs. In particular, Evee can extract size- or depth-minimal
proofs using a Dijkstra-like algorithm (e.g. from the output of the ℰℒ+ reasoner Elk) and
visualize them like in Figure 2.


3. Tracing Materialization Steps
Given a program Π and a dataset 𝒟, our extended version of MeTeoR traces all derived facts
and other relevant information throughout the materialization process. Afterward, we use Evee
to extract a proof for a target fact 𝐹 .
Fact Encoding and Tracing. We represent each inference as a simple Python dictionary called
connection. Each connection contains three keys, which we denote as preds, succ, and rule.
3
    We have not yet implemented tracing for the seminaive materialization strategy of MeTeoR [22].
4
    https://github.com/de-tu-dresden-inf-lat/evee
         {
             "preds": { "alpha": "drone(d)",
                        "interval": "[0,0]" },
             "rule": "Boxplus[0,inf)drone(X) :- drone(X)",
             "succ": "Boxplus[0,inf)drone(d)@[0,0]"
         }


Figure 3: A single connection represented as a Python dictionary


Intuitively, preds specifies the list of facts from which succ is derived, while rule represents
the inference label. Using native Python dictionaries is simple and fast, whereas initializing
custom data structures is often costly. Figure 3 shows the complete encoding of the inference of
⊞(︀0,∞) drone(d)@(︀0, 0⌋︀ as seen in Figure 2. The global hypergraph 𝒢 collects each connection
produced during the various stages of materialization. In addition to the DatalogMTL rules
themselves, we also trace the application of temporal operators in the apply and reverse_apply
functions.
Reverse Apply. For the reverse_apply function in particular, we had to do more extensive
modifications. In its original form, the source code was not suitable to connect a metric fact to
an underlying non-metric fact, in particular for multiple nested operators, e.g. as in ⊟𝜌 |𝜌′ 𝐴,
since the metric operators were processed in a different order. Thus, we adjusted the existing
reverse_apply function to enable step-wise recording of complex facts in an outside-inside
fashion. First, the metric fact is passed to reverse_apply along with an auxiliary list ℒ for
recording relevant information. Second, its outermost temporal operator is recursively popped
and the interval corresponding to the inner literal is recorded in ℒ. Last, we stop the tracing
process when the innermost atom and its interval are reached, and merge ℒ with 𝒢 by converting
each traced fact into a connection as described above.
Pre-processing, Coalescing and Post-processing. Since the initial dataset is required to
construct a full proof with Evee, each fact from 𝒟 is already added to 𝒢 with label “Asserted”
(see Figure 2) while loading the dataset. Furthermore, as mentioned in Section 2, time intervals
of ground facts are coalesced to represent the final dataset in a more human-readable way and
to ensure the correctness of the reasoner. Therefore, we extend MeTeoR accordingly to trace
the coalescing step whenever it occurs in the reasoning process and record the information in 𝒢.
After MeTeoR is finished, we run a simple conversion step to transform the inferences recorded
in 𝒢 into a JSON format compatible with Evee. Finally, a small Java program using Evee
reads the JSON file and outputs a size-minimal proof for the goal fact 𝐹 .


4. Tracing Evaluation
We conducted a small evaluation of the feasibility of the described approach.
Machine Configuration. We used a desktop machine with Windows 10 Home Edition (64-bit),
AMD Ryzen 5600 3.5 GHz CPU, 32 GB of RAM, and memory channels operating on 3200 MHz.
All results are produced from a single attempt.
LUBM Benchmark. We used the LUBM benchmark generator [24] and scripts from [8] to
construct several temporal datasets, ranging in size from 10 k to 100 k facts. Each fact is
randomly assigned a time interval. We use the same DatalogMTL program as in [8], which
contains 85 rules over the LUBM predicates. Due to randomization in the dataset generation,
we cannot use the same facts as in the experiments of [8]. Instead, we run the materialization
for 6 rounds, after which the majority of rules have been applied and only a few recursive rules
can be applied indefinitely. We then chose facts derived in the last materialization round to
                                140                          with tracing
                                                             without tracing
                                120

                                100

                                80
                     Time (s)
                                60

                                40

                                20

                                      10   20   30   40      50 60 70          80   90   100
                                                          Amount of data (k)

Figure 4: CPU time comparison between enabled and disabled tracing in MeTeoR


                                                               polynomial
                                800                            proof time


                                600
                     Time (s)


                                400


                                200


                                 0
                                      10   20   30   40      50 60 70          80   90   100
                                                          Amount of data (k)

Figure 5: Runtime of proof extraction with Evee, compared to a quadratic polynomial


generate proofs.
Tracing in MeTeoR. We hypothesize that, while tracing slows down the reasoning process of
MeTeoR, it does not significantly worsen the overall performance. To test this, we compare the
total CPU time of our implementation for computing the hypergraph 𝒢 to the runtime of the
unmodified reasoner, and report the results in Figure 4. As the size of the dataset grows, a more
significant performance gap is seen between the baseline and our implementation. However, on
average, there is only a 37% increase of runtime, and the tracing does not destroy the nice linear
scaling behavior of MeTeoR, which confirms our hypothesis.
Proof Extraction. We also measure the runtime of the Evee library for finding a minimal proof in
𝒢, with the results shown in Figure 5. Compared to the tracing, Evee takes considerably longer
to produce minimal-size proofs (e.g. 884 s vs. 140 s on the largest dataset). The reason behind
this blow-up is the polynomial time complexity of finding minimal proofs [10]. In particular,
the Dijkstra-like algorithm requires nested iteration over all inferences deriving a specific fact,
in order to find the ones resulting in minimal proofs. Figure 5 also shows a quadratic function
fitted on the first 5 data points (in red), which differs only minimally from the real runtimes.
Consequently, this shows that not only is the proof extraction with Evee theoretically tractable,
but it is also efficient. For the real-time drone use case, we expect much fewer than 10k input
facts, since we only have to consider measurements of around 20–40 sensors over the last 20–60
seconds.


5. Proof Verbalization
The overall goal of the larger system is to convey critical information during handover and
enable the drone operator to take control. Such critical information may be given as an auditory
message, as the operator will already be dealing with lots of other visual input. The generated
temporal proofs are represented as a Python dictionary or in JSON (see Figure 3) and contain a
long step-by-step logical proof. This makes it hard to present relevant information from proofs
comprehensibly to drone pilots.
   Recently emerging large language models (LLMs) such as GPT-4 [25], LLaMA-2 [26], Gem-
ini [27] can be prompted to perform a wide variety of generative tasks, including code completion
and documentation generation. This is ideal for our setup, because prompting strategies such
as few-shot prompting only need a few input examples, and do not require a large in-domain
dataset for training. We use Gemini-Pro [27] LLM for our experiments, as it performs better
than GPT-3.5-Turbo on particularly long and complex reasoning tasks [28]. We deploy Gemini
in a zero-shot manner to generate an actionable explanation using temporal proofs.
Approach. Generating explanations from proofs requires an LLM to identify the relevant proof
steps that provide reasoning for the critical situation and verbalize these steps in a coherent
manner. We first attempt a select-then-summarize prompting approach, where we prompt the
LLM to filter relevant proof steps and then verbalize them. Although the LLM could identify
relevant proof steps, it could not generate a coherent explanation only from the selected steps.
We speculate that this was due to the missing context of the entire situation. Hence, we propose
a verbalize-then-summarize (VeSum) prompting approach for generating explanations from the


            Input Prompt:                                               LLM Output:
                                                                        "Verbalization":
               Temporal Proof
                                                                        ["We can generate a warning for the
                    {"finalConclusion" : "risk(d)@[10,10]",             human drone pilot because at time 10,
                                "inferences" : [                        there is a risk of internal damage to the
                      {"conclusion" : "risk(d)@[10,10]",                drone (d).",
                            "ruleName" : "inclusion",
                      "premises" : [ "risk(d)@[10,+inf)"]}              "The risk of internal damage is because
                                         ,                              the drone (d) is experiencing a high
                     {"conclusion" : "risk(d)@[10,+inf)",               temperature of 30 degrees Celsius at
               "ruleName" : "risk(X):-riskofinternaldamage(X)",
                                                                        time 10.",
             "premises" : [ "riskofinternaldamage(d)@[10,+inf)" ]}
                                       , ...
                                                                        "The high temperature is likely because
                 Instructions                                           the drone (d) is flying.",

                   The JSON above contains a proof …                    "We know the drone (d) is flying
                                                                        because it was asserted to be true."],
            Symbols used in the JSON and their meanings: …
                                                                        "Summary":
                Only respond with a JSON containing: …
                                                                        "At time 10, there is a risk of internal
                    Summary Generation Instructions:                    damage to your drone (d) because it
                                                                        has a high temperature of 30 degrees
             Only respond with a JSON response containing               Celsius. This is likely because the
                     "Verbalization" and "Summary".                     drone is flying."


Figure 6: VeSum prompt design and a verbalization example using Gemini-Pro. The input prompt consists of
Temporal Proof and Instructions that guide LLMs in generating explanations. The instructions are divided into
two parts – data description instructions and task-specific instructions.
proofs. VeSum first instructs the LLM to produce a verbalization of the entire proof, and then
use the generated verbalization to generate an explanation summary. VeSum is inspired by
well-known prompting approaches – Chain-of-Thought (CoT) [29] and ReAct [30] prompting.
The prompt design of VeSum and the sample response for the example input (see Figure 2) are
shown in Figure 6.
Challenges. The major challenges of using LLMs for generating explanations from proof trees
are as follows:

   • LLMs are very sensitive to input instructions [31], and input templates [32]. Hence, finding
     the optimal instructions for long text generation is challenging.
   • For our task, the predicate names are mostly descriptive and verbose (e.g.
     hightemperature(X) ∶– temperature(X, S), >(S, 25)). In our preliminary experiments, when
     we shortened the names (e.g. hightemperature(X) to HT(X)), we found that models failed
     to understand the meaning of the rules. This suggests that the predicate names have
     a noticeable impact on the generated explanations, hence they should be chosen care-
     fully. Alternatively, one could provide additional descriptions of each predicate via string
     annotations and incorporate them into the prompt.
   • LLMs can write fluently and grammatically; their advantage over traditional template-
     based methods lies in their generalizability to new sensor input and better naturalness
     and variability of expression. However, the generated text is not always grounded in the
     given information. For instance, if we take a closer look at the generated summary in
     Figure 6, we can see that LLM generates the reason for the warning as “because it has a
     high temperature of 30 degrees Celsius. This is likely because the drone is flying.” While
     the proof mentions the high temperature, it does not specify any further reason for this
     measurement (see Figure 2).


6. Conclusions and Future Work
Currently, our MeTeoR-based system can explain facts entailed by a fixed dataset that involves
timestamps and numerical measurements. To achieve this, we implemented inference tracing for
MeTeoR and used Evee to extract small proofs. We showed that the additional components
have a minor impact on computational resources. We have started to explore how to generate
natural language explanations from these proofs.
   In the future, we want to improve our system in several directions: (i) consider real-time
operations in a streaming setting, in which new facts continuously come in and old facts are
removed, (ii) improve the quality of the generated verbal messages, and (iii) reduce cognitive
load for the drone operator in situations with multiple warnings.
   For (i), assuming that sensor measurements arrive incrementally every second, we should
avoid re-generating the same proof over and over again if it still holds. Using the streaming
version of MeTeoR for forward-propagating rules [33], our system could also be adapted to the
forgetting mechanism using sliding windows. This dramatically reduces the data size and makes
real-time processing feasible. Nevertheless, we can still improve the tracing and proof extraction
themselves, e.g. by goal-directed tracing of the inferences that lead to a target fact, or by using
optimized data structures for proof extraction.
   For (ii), we will first investigate better prompting techniques that leverage proofs to generate
more strongly grounded explanations. We will also experiment with more diverse proof formats
based on first-order logic or description logic to test the models’ abilities. We also aim to develop
a systematic understanding of how different phrasing choices impact explanation quality, which
will inform guidelines for designing new ontologies and improving prompting methods for existing
ones. Additionally, we will explore the latest open-source LLMs, evaluating their potential for
fine-tuning on the annotated data to improve task performance.
   For (iii), we must identify which situations are more critical than others, and should be
delivered to the drone operator with the highest priority. Here, the system should be able to
rank warnings and their explanations according to their severity, and also convey the severity
level verbally.

Acknowledgements This work was partially supported by the DFG in grant 389792660 as part
of TRR 248 (https://perspicuous-computing.science).


References
 [1] T. Fuhrman, D. Schneider, F. Altenberg, T. Nguyen, S. Blasen, S. Constantin, A. Waibe,
     An interactive indoor drone assistant, in: 2019 IEEE/RSJ International Conference on
     Intelligent Robots and Systems (IROS), 2019, pp. 6052–6057. doi:10.1109/IROS40897.
     2019.8967587.
 [2] F. Baader, I. Horrocks, C. Lutz, U. Sattler, An Introduction to Description Logic, Cambridge
     University Press, 2017. doi:10.1017/9781139025355.
 [3] C. Baral, M. Gelfond, Logic programming and knowledge representation, The Journal
     of Logic Programming 19-20 (1994) 73–148. doi:10.1016/0743-1066(94)90025-6, Special
     Issue: Ten Years of Logic Programming.
 [4] Y. Kazakov, M. Krötzsch, F. Simancik, The incredible ELK – from polynomial procedures
     to efficient reasoning with ℰℒ ontologies, J. Autom. Reasoning 53 (2014) 1–61. doi:10.
     1007/s10817-013-9296-3.
 [5] Y. Kazakov, P. Klinov, A. Stupnikov, Towards reusable explanation services in Protege,
     in: A. Artale, B. Glimm, R. Kontchakov (Eds.), Proc. of the 30th Int. Workshop on
     Description Logics (DL’17), volume 1879 of CEUR Workshop Proceedings, 2017. URL:
     http://www.ceur-ws.org/Vol-1879/paper31.pdf.
 [6] A. Metke-Jimenez, M. Lawley, Snorocket 2.0: Concrete domains and concurrent classification,
     in: S. Bail, B. Glimm, R. S. Gonçalves, E. Jiménez-Ruiz, Y. Kazakov, N. Matentzoglu,
     B. Parsia (Eds.), Informal Proceedings of the 2nd International Workshop on OWL Reasoner
     Evaluation (ORE-2013), volume 1015 of CEUR Workshop Proceedings, CEUR-WS.org,
     2013, pp. 32–38. URL: http://ceur-ws.org/Vol-1015/paper_3.pdf.
 [7] A. Bate, B. Motik, B. C. Grau, F. Simancik, I. Horrocks, Extending consequence-based
     reasoning to SRIQ, in: C. Baral, J. P. Delgrande, F. Wolter (Eds.), Principles of Knowledge
     Representation and Reasoning: Proceedings of the Fifteenth International Conference, KR,
     AAAI Press, 2016, pp. 187–196. URL: http://www.aaai.org/ocs/index.php/KR/KR16/
     paper/view/12882.
 [8] D. Wang, P. Hu, P. A. Walega, B. C. Grau, MeTeoR: Practical reasoning in Datalog with
     metric temporal operators, in: Thirty-Sixth AAAI Conference on Artificial Intelligence,
     AAAI 2022, AAAI Press, 2022, pp. 5906–5913. doi:10.1609/AAAI.V36I5.20535.
 [9] C. Alrabbaa, F. Baader, S. Borgwardt, P. Koopmann, A. Kovtunova, Finding small proofs
     for description logic entailments: Theory and practice, in: E. Albert, L. Kovacs (Eds.),
     LPAR-23: 23rd International Conference on Logic for Programming, Artificial Intelligence
     and Reasoning, volume 73 of EPiC Series in Computing, EasyChair, 2020, pp. 32–67.
     doi:10.29007/nhpp.
[10] C. Alrabbaa, F. Baader, S. Borgwardt, P. Koopmann, A. Kovtunova, Finding good proofs
     for description logic entailments using recursive quality measures, in: A. Platzer, G. Sutcliffe
     (Eds.), Automated Deduction - CADE 28 - 28th International Conference on Automated
     Deduction, Proceedings, volume 12699 of Lecture Notes in Computer Science, Springer,
     2021, pp. 291–308. doi:10.1007/978-3-030-79876-5_17.
[11] S. Borgwardt, E. Chang, K. Chapman, V. Demberg, A. Kovtunova, H. Yeh, Logic-guided
     neural utterance generation from drone sensory data (extended abstract), in: M. Homola,
     V. Ryzhikov, R. A. Schmidt (Eds.), Proceedings of the 34th International Workshop on
     Description Logics (DL 2021), volume 2954 of CEUR Workshop Proceedings, CEUR-WS.org,
     2021. URL: https://ceur-ws.org/Vol-2954/abstract-11.pdf.
[12] E. Chang, A. Kovtunova, S. Borgwardt, V. Demberg, K. Chapman, H. Yeh, Logic-guided
     message generation from raw real-time sensor data, in: N. Calzolari, F. Béchet, P. Blache,
     K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo,
     J. Odijk, S. Piperidis (Eds.), Proceedings of the Thirteenth Language Resources and
     Evaluation Conference, LREC 2022, European Language Resources Association, 2022, pp.
     6899–6908. URL: https://aclanthology.org/2022.lrec-1.745.
[13] F. Baader, P. Hanschke, A scheme for integrating concrete domains into concept languages,
     in: J. Mylopoulos, R. Reiter (Eds.), Proc. IJCAI Conference, Morgan Kaufmann, 1991, pp.
     452–457. URL: http://ijcai.org/Proceedings/91-1/Papers/070.pdf.
[14] C. Lutz, Description logics with concrete domains—a survey, in: Advances in Modal Logics
     Volume 4, World Scientific Publishing Co. Pte. Ltd., 2003.
[15] C. Alrabbaa, F. Baader, S. Borgwardt, P. Koopmann, A. Kovtunova, Combining proofs
     for description logic and concrete domain reasoning, in: A. Fensel, A. Ozaki, D. Roman,
     A. Soylu (Eds.), Rules and Reasoning - 7th International Joint Conference, RuleML+RR
     2023, Proceedings, volume 14244 of Lecture Notes in Computer Science, Springer, 2023, pp.
     54–69. doi:10.1007/978-3-031-45072-3_4.
[16] L. Westhofen, C. Neurohr, J. C. Jung, D. Neider, Answering temporal conjunctive queries
     over description logic ontologies for situation recognition in complex operational domains, in:
     B. Finkbeiner, L. Kovács (Eds.), Tools and Algorithms for the Construction and Analysis
     of Systems - 30th International Conference, TACAS, volume 14570 of Lecture Notes in
     Computer Science, Springer, 2024, pp. 167–187. doi:10.1007/978-3-031-57246-3_10.
[17] C. Alrabbaa, S. Borgwardt, T. Friese, P. Koopmann, J. Méndez, A. Popovič, On the eve of
     true explainability for OWL ontologies: Description logic proofs with Evee and Evonne, in:
     O. Arieli, M. Homola, J. C. Jung, M. Mugnier (Eds.), Proceedings of the 35th International
     Workshop on Description Logics (DL 2022), volume 3263 of CEUR Workshop Proceedings,
     CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3263/paper-2.pdf.
[18] A. Kovtunova, S. Borgwardt, Drone program for critical situation detection over sensor
     data streams, 2024. doi:10.5281/zenodo.10822865.
[19] S. Brandt, E. G. Kalayci, V. Ryzhikov, G. Xiao, M. Zakharyaschev, Querying log data with
     metric temporal logic, J. Artif. Intell. Res. 62 (2018) 829–877. doi:10.1613/jair.1.11229.
[20] P. A. Walega, B. C. Grau, M. Kaminski, E. V. Kostylev, DatalogMTL: Computational
     complexity and expressive power, in: S. Kraus (Ed.), Proceedings of the Twenty-Eighth
     International Joint Conference on Artificial Intelligence, IJCAI, ijcai.org, 2019, pp. 1886–
     1892. doi:10.24963/ijcai.2019/261.
[21] P. A. Walega, M. Zawidzki, D. Wang, B. C. Grau, Materialisation-based reasoning in
     DatalogMTL with bounded intervals, in: B. Williams, Y. Chen, J. Neville (Eds.), Thirty-
     Seventh AAAI Conference on Artificial Intelligence (AAAI), AAAI Press, 2023, pp. 6566–
     6574. doi:10.1609/AAAI.V37I5.25807.
[22] D. Wang, P. A. Walega, B. C. Grau, Seminaïve materialisation in DatalogMTL, in:
     G. Governatori, A. Turhan (Eds.), Rules and Reasoning - 6th International Joint Conference
     on Rules and Reasoning, RuleML+RR, volume 13752 of Lecture Notes in Computer Science,
     Springer, 2022, pp. 183–197. doi:10.1007/978-3-031-21541-4_12.
[23] C. Alrabbaa, S. Borgwardt, P. Koopmann, A. Kovtunova, Finding good proofs for answers
     to conjunctive queries mediated by lightweight ontologies, in: O. Arieli, M. Homola, J. C.
     Jung, M. Mugnier (Eds.), Proceedings of the 35th International Workshop on Description
     Logics (DL 2022), volume 3263 of CEUR Workshop Proceedings, CEUR-WS.org, 2022. URL:
     https://ceur-ws.org/Vol-3263/paper-3.pdf.
[24] Y. Guo, Z. Pan, J. Heflin, LUBM: A benchmark for OWL knowledge base systems, J. Web
     Semant. 3 (2005) 158–182. doi:10.1016/j.websem.2005.06.005.
[25] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida,
     J. Altenschmidt, S. Altman, S. Anadkat, et al., GPT-4 technical report, arXiv preprint
     arXiv:2303.08774 (2023). doi:10.48550/arXiv.2303.08774.
[26] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra,
     P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models,
     arXiv preprint arXiv:2307.09288 (2023). doi:10.48550/arXiv.2307.09288.
[27] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk,
     A. M. Dai, A. Hauth, et al., Gemini: A family of highly capable multimodal models, arXiv
     preprint arXiv:2312.11805 (2023). doi:10.48550/arXiv.2312.11805.
[28] S. N. Akter, Z. Yu, A. Muhamed, T. Ou, A. Bäuerle, Á. A. Cabrera, K. Dholakia, C. Xiong,
     G. Neubig, An in-depth look at Gemini’s language abilities, arXiv preprint arXiv:2312.11444
     (2023). doi:10.48550/arXiv.2312.11444.
[29] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al.,
     Chain-of-thought prompting elicits reasoning in large language models, Advances in neural
     information processing systems 35 (2022) 24824–24837. URL: https://openreview.net/
     forum?id=_VjQlMeSB_J.
[30] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, Y. Cao, ReAct: Synergizing
     reasoning and acting in language models, in: The Eleventh International Conference on
     Learning Representations, 2023. URL: https://openreview.net/forum?id=WE_vluYUL-X.
[31] M. Loya, D. Sinha, R. Futrell, Exploring the sensitivity of LLMs’ decision-making capabilities:
     Insights from prompt variations and hyperparameters, in: H. Bouamor, J. Pino, K. Bali
     (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Association
     for Computational Linguistics, Singapore, 2023, pp. 3711–3716. doi:10.18653/v1/2023.
     findings-emnlp.241.
[32] M. Sclar, Y. Choi, Y. Tsvetkov, A. Suhr, Quantifying language models’ sensitivity to
     spurious features in prompt design or: How I learned to start worrying about prompt
     formatting, arXiv preprint arXiv:2310.11324 (2023). doi:10.48550/arXiv.2310.11324.
[33] P. A. Walega, M. Kaminski, D. Wang, B. C. Grau, Stream reasoning with DatalogMTL, J.
     Web Semant. 76 (2023) 100776. doi:10.1016/j.websem.2023.100776.