1. Introduction

Computing Compact Answers to Temporal Path Queries Using SQL

Muhammad Adnan

Diego Calvanese

diego.calvanese@unibz.it 0

Julien Corman

JulienLouisMichel.Corman@unibz.it 0

Anton Dignös

anton.dignoes@unibz.it 0

Werner Nutt

werner.nutt@unibz.it 0

Ognjen Savković

Ognjen.Savkovic@unibz.it 0 0 Free University of Bozen-Bolzano , Bolzano , Italy

Temporal Regular Path Queries (TRPQs) are a recent extension of regular path queries over a graph where facts are annotated with time intervals. They enable navigation both in time and over the structure of the graph. TRPQs return pairs of entities, each associated with a binary temporal relation, which relates the two entities through time. This allows modelling phenomena phenomena such as virus propagation or mapping possible trip departures to arrival times when there is uncertainty about trafic. A key challenge of TRPQs is representing binary temporal relations in a compact way, and ensuring that these compact representations can be computed eficiently. While these problems have been recently investigated from the theoretical side, little attention has been paid to corresponding implementation techniques. In this work, we address this gap by introducing the first SQL-based implementation of TRPQ answering that produces compact answers. We investigate two alternative formats for compact answers. For each format, we first lay the foundations for an eficient implementation by translating TRPQ operations into operations over compact answers, thus preserving compactness during the evaluation process. In addition, we apply state-of-the-art interval coalescing techniques to reduce the cost of temporal joins and ensure that our results have minimal cardinality. We also present a dedicated benchmark and parameterized experiments that illustrate the trade-ofs between the two compact representations, depending on the length of intervals in the input data and query. Our empirical ifndings also reveal the critical role of coalescing for eficient query answering.

eol>Temporal Knowledge Graphs Regular Path Queries Graph Databases Property Graphs SQL

1. Introduction

With the growing popularity of graph database (DB) engines, several proposals have been made recently to extend graphs with temporal properties, in order to store and access information about the evolution of data over time [ 1, 2, 3, 4, 5, 6 ]. We focus here on Temporal Graphs (TGs), where each fact is labeled with a set of time intervals that specify its validity. Equivalently, a TG can be viewed as a sequence of “snapshot” graphs, one for each time point, which consists of all facts that hold at that time point. Figure 1 represents a TG with time unit one hour. For conciseness, we represent it as a so-called Property Graph, one of the most popular graph data models [7]. In such a graph, both vertices (like n1) and edges (like e1) can carry attributes. However, without loss of expressivity, the same data could be represented as a (less concise) edge-labelled graph (e.g. an RDF graph) with time intervals associated to each edge.

In order to query such a graph, a sensible approach consists in extending a graph query language with temporal operators. Graph query languages, such as Cypher [7] or SPARQL [8], are based on navigational queries, whose basic form are so-called Regular Path Queries (RPQs). An RPQ is a regular expression, and a pair ⟨1, 2⟩ of objects in a graph is in the answer to if there exists a path from 1 to 2 whose concatenated labels match this regular expression.

e1 meets [200, 204]

n2 name = Bob [0, 600] temp = 39 [110, 120] temp = 38 [230, 270]

e2 meets [320, 330]

e3 meets [100, 101]

n4 name = David [0, 600] temp = 38 [320, 330]

A natural extension of such queries consists in allowing navigation not only through the graph, but also through time. To this end, we consider Temporal RPQs (TRPQs), originally proposed by [ 1 ], which extend RPQs with a temporal navigation operator, allowing navigation from one object in a snapshot graph to the same object in a past or future snapshot graph. Hence, the answer to a TRPQ is a set of pairs ⟨⟨1, 1⟩, ⟨2, 2⟩⟩, where 1 and 2 are associated with a time point each, respectively 1 and 2.

As an illustration, consider the TRPQ 1 below that retrieves all pairs ⟨︀ ⟨1, 1⟩, ⟨2, 2⟩⟩︀ such that person 1 tested positive at time 1, 1 met 2 within a week prior to 1, and 2 had high temperature at time 2, less than two days after the meeting:

1 := (pos = true)/T[−168,0] /F/meets/F/T[0,48]/(temp ≥ 38 ) The expressions pos = true, meets and temp ≥ 38 “locally” check whether a node or edge satisfies a certain property. The operator F stands for (atemporal) forward navigation, either from a node to an edge or conversely. The symbol “/” represents a join that connects navigation steps. The operator T[−168,0] stands for temporal navigation in the past by at most a week (168 hours), and T[0,48] for temporal navigation in the future by at most two days. There are 23 answers to 1 over the TPG of Figure 1, precisely one tuple ⟨︀ ⟨1, 300⟩, ⟨2, 2⟩⟩︀ for each integer 2 in the interval [230, 252].

Surprisingly, this simple idea is an important departure from the way query answers are traditionally represented in temporal databases, where each tuple is instead associated with a single time point or interval for validity.1 In particular, a central problem for traditional temporal query answering is producing answers in a compact form, using time intervals. A natural solution to this problem consists in computing answers in so-called coalesced form, using time intervals. This approach has long been adopted by temporal DB engines (e.g., [11, 12]) and also adapted for graph query languages such as a T-GQL [ 2 ]. In those approaches, the time points assigned to each tuple are coalesced into intervals, so that temporal joins only require computing intersections of intervals. However, the analogous problem for the case where each answer is associated with a pair of time points (for validity) is significantly more involved. And to our knowledge, it has only been investigated very recently, in our previous work [13].

Another interesting feature of TRPQs is the transitive closure operator (written [, ]), inherited from RPQs. When applied to a temporal domain, this operator ofers a natural way to express reachability under certain temporal constraints. For instance, let us assume that our virus may be carried at most one week by the same person. Then the query

2 := (T[0,168]/F/meets/F)[ 1, ] 1Bitemporal databases [9, 10] do associate two timepoints (or intervals) to each tuple, but only one of these stands for validity, while the other one represents the (orthogonal) notion of transaction time. returns all pairs ⟨︀ ⟨1, 1⟩, ⟨2, 2⟩⟩︀ such that if 1 was carrying the virus at time 1, then it may have transitively transmitted it to person 2 at time 2. So the query

3 := (pos = true)/T[−168,0] /F/meets/F/2 identifies people at risk (namely Alice, Bob, and Carol).

As we showed above, the 23 answers to the TRPQ 1 can be coalesced with a single time interval for 2. And similarly, for 3, the 16 answers can be coalesced with only two time intervals. It is easy to see that computing all answers before summarizing them may be ineficient. For instance, a naive evaluation of query 1 over the TPG of Figure 1 may join the 169 answers to the subquery (pos = true)/T[0,−168] with the 20 answers to the subquery F/meets. Worse, a change of time granularity may have a dramatic impact on performance. E.g., the TPG of Figure 1 does not allow representing meetings shorter than an hour. But adopting minutes as a time unit instead of hours would multiply by 60 the cardinality of the operands of each join. So it is essential to not only represent answers in a compact way, but also to maintain compactness during query evaluation.

Contributions. In our recent work [13], we defined and studied four alternative compact representations of answers to a TRPQ, which can be viewed as alternative formats for (relational) tuples. We focused on the worst-case compactness of query answers and primarily addressed computational cost and the uniqueness of query answering. However, practical implementations of any of those representations were left open.

In this paper, we focus on the first two of these four representations. Our contributions are the following: • We provide a detailed analysis of TRPQ operations over tuples in each of these two formats, which serves as a basis for our SQL implementation. • Based on this analysis, we present the first implementation of these two compact representations for a set of test queries developed using the PostgreSQL database system. For interval coalescing, we apply state-of-the-art techniques. • We describe a set of parameterized experiments that are meant to illustrate the trade-of between our two representations, depending on whether intervals are longer in the input graph or in the input query. These experiments also highlight the importance of temporal coalescing for eficient query answering.

Our SQL implementations and guidelines to reproduce our experiments are available in the repository: https://github.com/osavkovic/CompactTRPQ.

Organization. The remainder of the paper is organized as follows. In Section 1.1, we provide an informal overview of the two representations that we study, via a running example. Then, Section 2 formalizes TGs and TRPQs, and Section 3 characterizes how to compute compact answers in our two representations. In Section 4, we discuss our implementation technique, and in Section 5, we present our experimental evaluation. In Section 6, we review related work, and in Section 7, we present our conclusions and discuss directions for future work.

1.1. Running Example

This section illustrates the two compact representations of answers to TRPQs investigated in this article. A key insight to understand these representations is the trade-of between folding either (pairs of) time points, or distances between time points. Let us consider the query

4 = (name = Alice)/F/meets/T[ 2,3 ]/F The answers to this query over the graph of Figure 1 (assuming discrete time) are listed in Figure 2, upper left.

Start point

End point Object Time

Object Time 1 1 1 1 1 Repr.

The first compact representation is obtained by folding start time points into intervals, while grouping answers by objects and distance between start and end point. We use to denote this format, which yields in our example the tuples in Figure 2, upper right. As we will see, this solution is better-suited to inputs where time intervals in the graph are larger than the ones present in the query (such as the interval [ 2, 3 ] in Query 4). For instance, even if one extends the duration of the meeting between Alice and Bob to 10 hours, from time 200 to 210, there are still only two compact answers under , one for each distance in the interval [ 2, 3 ], namely ⟨1, [200, 208], 2, 2⟩ and ⟨1, [200, 207], 3, 2⟩. In contrast, the number of tuples may grow linearly in the length of the distance interval in the query.

A second, symmetric solution consists in folding distances, while grouping tuples by objects and starting time (or alternatively, end time). We call this format , which yields in our example the tuples of Figure 2, bottom left. In contrast to , this format is better-suited to the case where time intervals in the query are larger than those in the input graph. In our example, increasing the distance interval in Query 4 to [ 0, 3 ] would not afect the number of tuples. However, this number grows linearly in the duration of the meeting between Alice and Bob.

A natural question is whether one can combine these two solutions, i.e., fold both time points and distances. We call this format . In our example, this yields the tuples of Figure 2, bottom right. However, in [13], we show that the number of tuples in this format is still linear in the length of the input intervals. Besides, uniqueness of representation is lost, in the sense that there may exist several (cardinality) minimal sets of tuples under this view that represent the set of answers to a query. More importantly, for practical purposes, computing one of these minimal sets of tuples becomes intractable. In contrast, as we will see producing a minimal set of answers in or (out of a non minimal one) remains in in ( log ). In [13] we also define a fourth, more complex representation, where the number of answer tuples is independent of the size of the input (graph and query) intervals. However, minimization in this format remains intractable. This is why in this paper we focus on and only.

2. Preliminaries

Temporal Graphs. We adopt the same data model as in [ 1 ], with only a slight modification in order to generalize the approach beyond Property Graphs (PGs). Precisely, we abstract away from the specific representation of classes, labels, and attributes in PGs. Instead, we use a generic set Pred of boolean predicates whose validity for a given node (or edge) and time point can be checked locally, meaning that this verification is independent of the topology of the graph. E.g., over the graph of Figure 1, such predicates may be {name = Bob}, {temp = 38}, or meets (i.e., whether an edge has label meets).

To simplify definitions, we are going to use a more convenient format ⟨1, 2, 1, 1⟩ that describes time per distance instead of ⟨⟨1, 1⟩, ⟨2, 2⟩⟩, which represents time per time. Here, 2 = 1 + 1.

Jpred K =

JT K =

JFK =

JBK =

Jpath1/path2K = Jpath1 + path2K =

Jpath[, _]K =

J⋃p︀aJtpha1KthK∪Jpath2K ≥ {⟨, , , 0⟩ | ∈ for some ∈ val (, pred )} {⟨, , , ⟩ | ∈ ( ∪ ), ∈ , ∈ } {⟨, , , 0⟩ | src() = and ∈ } ∪ {⟨, , , 0⟩ | tgt() = and ∈ } {⟨, , , 0⟩ | tgt() = and ∈ } ∪ {⟨, , , 0⟩ | src() = and ∈ } {⟨1, 3, , 1 + 2⟩ | ⟨1, 2, , 1⟩ ∈ Jpath1K and ⟨2, 3, + 1, 2⟩ ∈ Jpath2K for some 2}

Further, as in [ 1 ], we assume discrete time. For simplicity, we chose Z as our underlying temporal domain, and we use intv(Z) (resp. intv( )) for the set of all nonempty intervals over Z (resp. over some ∈ intv(Z)). A Temporal Graph (TG) is a tuple = ⟨, , conn, , val⟩, where: • and are finite sets of nodes and edges respectively, with ∩ = ∅, • conn : → × maps an edge to its source and target, • is a closed-closed interval over Z, called the active temporal domain, and • val : ( ∪ ) × Pred → 2 intv() assigns a finite set of disjoint and pairwise non-adjacent intervals to each object and predicate , indicating when holds for .

If conn() = (1, 2), we use src() for 1 and tgt() for 2.

Temporal Regular Path Queries. We focus on a fragment of the query language introduced in [ 1 ], with minor modifications that allow us abstract away from Cypher and Property Graphs, so that our approach may be applied to other graph data model (with time intervals) and other (RPQ-based) graph query languages (such as RDF).

A Temporal Regular Path Query (TRPQ) is an expression for the symbol “path” in the following grammar:

path ::= pred | F | B | T | (path/path) | (path + path) | path[, _] with ∈ intv(Z) and ∈ N.

The operator F (resp. B) stands for forward (resp., backward) atemporal navigation within a graph, either from a node to an edge or from an edge to a node, whereas the temporal navigation operator T allows navigation in time by any distance in the interval . The terminal symbol pred stands for any element of Pred , i.e., a Boolean predicate that can be evaluated locally for one object and time point, as explained above. The other operators are standard RPQ (a.k.a. regular expression) operators. The formal semantics of TRPQs is provided in Figure 3, where JK is the evaluation of a TRPQ over a TG . In this definition, we use for the TRPQ defined inductively by 1 = and +1 = /. For convenience, we represent (w.l.o.g.) an answer as two objects, one time point and a distance, rather than two objects and two time points, i.e. we use tuples of the form ⟨1, 2, , ⟩ rather than ⟨︀ ⟨1, ⟩, ⟨2, + ⟩⟩︀ .

Operations on Intervals. For two intervals , ∈ intv(Z) , we use ⊕

{ + | ∈ , ∈ }. We also use + (resp. − ) for ⊕ [, ] (resp. ⊕ [−, −]). to denote the interval

3. Formal Characterization

In this section, we present a formal characterization of how to compute compact answers in the and formats. We begin by describing these two representation formats in Section 3.1. Next, in Section 3.2, we show how the operations of our query language can be translated into corresponding operations over sets of tuples in each format. This will allow us (in Section 4.2) to implement queries that operate on (resp. ), rather than .

+ ∈ to exactly one target time point + . (a) A tuple ⟨1, 2, , ⟩ ∈ associates each source time point

+ (b) A tuple ⟨1, 2, , ⟩

∈ associates the source time point to every target time point in the interval + .

3.1. Representations

We use denote the universe of all tuples that may be output by TRPQs, i.e.,

= ( ∪ ) × ( ∪ ) × Z × Z of a set ⊆

(resp. ) of such tuples is the union of the unfoldings of the elements of .

Our two representations and can be viewed as alternative formats to encode subsets of . A tuple u in (resp. ) represents a subset of , which we call the unfolding of u. And the unfolding

Folding Time Points ( ).

Tuples under this view are identical to elements of , but where the time points associated to source objects are represented as intervals. Accordingly, the universe of tuples is = ( ∪ ) × ( ∪ ) × intv(Z) × Z and the tuple ⟨1, 2, , ⟩ ∈ unfolds to {⟨1, 2, , ⟩ | ∈ } . Intuitively, this representation associates each source time point ∈ with a unique target time point + , as illustrated with

Folding Distances ( ).

for distances (rather than time points), i.e.,

This representation is symmetric to the previous one, using now intervals

= ( ∪ ) × ( ∪ ) × Z × intv(Z) and the tuple ⟨1, 2, , ⟩ ∈

unfolds to {⟨1, 2, , ⟩ | ∈ }.

A source time point is now associated with multiple target time points, namely each + such that ∈ , as illustrated with Figure 4b. 3.2. Operations in and

We show how to translate TRPQ operations (which are defined over ) into operations over (resp. ), while preserving their semantics. These two translations (together with proofs of correctness) are already available online [14]. We reproduce them here to show how they lay the foundation for an implementation. Besides, for some operators, we show how departing from a literal implementation of these formal definitions may yield more eficient queries.

In .

The most interesting operator in this translation is the temporal join 1/2, illustrated with Figure 5,

For a TRPQ and TG , we define by induction on a subset LM of that unfolds to JK. and defined as follows:

Lpath1/path2M = {︀ ⟨1, 3, (( 1 + 1) ∩ 2) − 1, 1 + 2⟩ | ⟨1, 2, 1, 1⟩ ∈ Lpath1M and ⟨2, 3, 2, 2⟩ ∈ Lpath2M and ( 1 + 1) ∩ 2 ̸= ∅ for some 2}︀ (1) Example 1. As a simple illustration, consider the output tuple ⟨1, 2, [200, 203], 2⟩ from our running example, and assume that we want to compute the join with ⟨2, 3, [203, 206], 2⟩. First, we need to determine which time points for the object 2 are common to both tuples. These are exactly 203, 204, and 205. More generally, such points are obtained via the intersection ( 1 + 1) ∩ 2 (the olive-colored interval in Fig. 5). Next, we need to project this restriction back onto 1 to determine the joinable source interval. This results in [201, 203], which is computed as (( 1 + 1) ∩ 2) − 1. Similarly, we apply a corresponding restriction to 2. Finally, the new distance is 5, computed as 1 + 2, and the resulting joined tuple is ⟨1, 3, [201, 203], 5⟩.

We complete the definition of LM with the other (more straightforward) operators of the language: Lpred M

LFM LBM

LT M = {⟨, , , 0⟩ | ∈ ∪ and ∈ val (, pred )} = {⟨, , , 0⟩ | src() = } ∪ {⟨, , , 0⟩ | tgt() = } = {⟨, , , 0⟩ | tgt() = } ∪ {⟨, , , 0⟩ | src() = } = {⟨, , − ∩ , ⟩ | ∈ ∪ and ∈ } Ltrpq1 + trpq2M

= Ltrpq[, _]M =

Ltrpq1M ∪ Ltrpq2M ⋃︀ trpq ≥ L M

A key observation can be made from this definition: the size of LM does not depend on the size of the intervals used to label data in . However, it is linear (in the worst case) in the size of the intervals used in (for temporal navigation), as can be seen in the definition of LT M. Unfortunately, this is unavoidable, as we already observed in introduction.

In . Similarly to what we did above for , we define a subset LM of that unfolds to JK. We start once again with the temporal join, illustrated with Figure 6, and defined as follows: Lpath1/path2M = {︀ ⟨1, 3, 1, 2 + 2 − 1⟩ | ⟨1, 2, 1, 1⟩ ∈ Lpath1M and 2 ∈ 1 + 1 for some 2}︀ and ⟨2, 3, 2, 2⟩ ∈ Lpath2M (2) Example 2. Consider the tuple ⟨1, 2, 200, [ 2, 3 ]⟩ from our running example, and assume that we want to join it with ⟨2, 3, 202, [ 3, 4 ]⟩. First, we observe that 202 is in the interval 200 + [ 2, 3 ], which makes the join possible. Then we just need to calculate the output interval of distances, which is 202 − 200 + [ 3, 4 ] = [ 5, 6 ]. Hence, the resulting tuple is ⟨ 1, 3, 200, [ 5, 6 ]⟩.

A second interesting operator for is temporal navigation (T ), with:

LT M = {︀ ⟨, , , (( + ) ∩ ) − ⟩ | ∈ ∪ , ∈ and ( + ) ∩ ̸= ∅}︀ As can be seen from this definition, the size of LT M depends on the size of the active temporal domain . And this is unavoidable if JT K is represented in . However, a (sub)query of the form /T can be evaluated more eficiently, thanks to the following observation:

L/T M = {︀ ⟨1, 2, , ( ′ ⊕ ) ∩ ⟩ | ⟨1, 2, , ′⟩ ∈ LM and ( + ( ′ ⊕ )) ∩ ̸= ∅}︀

As can be seen from this equation, the cardinality of L/T M is bounded by the cardinality of LM, which makes this query evaluation strategy significantly more eficient (compared to evaluating and T independently, and then joining the results). And the same (symmetric) property holds for (sub)queries of the form T /.

An analogous observation can be made for the operators F and B. Evaluated independently, the size of their output is linear in the size if :

LLBFMM = {︀ ⟨, , , [0, 0]⟩ | src() = and ∈ } ∪ {⟨, , , [0, 0]⟩ | tgt() = and ∈ }︀ = {︀ ⟨, , , [0, 0]⟩ | tgt() = and ∈ } ∪ {⟨, , , [0, 0]⟩ | src() = and ∈ }︀ In contrast, the size of L/FM, L/BM, LF/M or LB/M only depends on LM and the topology of the graph (regardless of its intervals), as can been seen for instance with the following equality (the other three cases are symmetric):

L/FM = {⟨, , , ⟩ | src() = and ⟨, , , ⟩ ∈ LM } ∪

{⟨, , , ⟩ | tgt() = and ⟨, , , ⟩ ∈ LM } For the other operators, LM is defined as expected:

Ltrpq1 + Ltrpprqed2MM == L{t⟨rp,q1,M,[0∪, 0Lt]⟩rp|q2M∈ for some ∈ val (, pred )}

Jtrpq[, _]K = ⋃≥︀ LtrpqM

Contrary to what we observed for , we note that the size of LM does not depend on the size of the intervals used in . However, it is now linear (in the worst case) in the size of the intervals used to label data in , due to the definition of Lpred M . And once again, this is unavoidable.

4. Implementation in SQL

We are now ready to present our implementation technique. First, we observe that the characterization introduced in Section 3 may produce query answers that are not minimal in terms of cardinality. To overcome this, we introduce coalescing, a key operation that merges overlapping or redundant tuples in our two representations. This is discussed in Section 4.1. Finally, in Section 4.2, we demonstrate how these characterizations, including coalescing, can be eficiently implemented in SQL.

4.1. Coalescing Answers

We say that a set of tuples ⊆ (resp. ) is compact if it is finite and if no strictly smaller (w.r.t. cardinality) subset of (resp. ) has the same unfolding.

Unfortunately, the operations LM and LM defined above may produce a set that is not compact. However, compactness can be regained by applying a so-called coalescing operation to . Coalescing [15] intuitively consists in merging intervals that can be represented as a single one. In our representations and , a tuple consists of two objects, an interval and an integer. Two such tuples can be represented 1 2 3 as a single one if their intervals overlap (or are contiguous) and they agree on all three other values. A simple illustration is provided in Figure 7 for (where we omit the two objects, for conciseness).

Eficient interval coalescing relies on the use of window functions in SQL, which allow for detecting and merging contiguous or overlapping intervals within partitions (based on non-temporal attributes) of tuples. In contrast, graph query languages like Cypher do not support expressive window functions, making them unsuitable for implementing coalescing in an eficient way. As a result, SQL remains the preferred choice for such temporal operations, and we decided to use it as well for or experiments.

The state-of-the-art technique for temporal coalescing in SQL based on window functions, which we adopt in this work, was originally proposed by Zhou et.al. [16]. It performs coalescing in ( log ) (which is optimal), and was adopted by all recent approaches in temporal databases that require coalescing or cumulative aggregates [17, 18, 12, 19, 20]. The general approach to coalescing using this technique is very similar for both representations and , with the only diference that we coalesce time intervals in , against distance intervals in . For the detailed implementation, we refer to [16] and our repository, which contains our SQL queries.

4.2. Implementing Query Answering in and

We now show how to implement query answering in our two representations (folded time points) and (folded distances) in SQL (specifically PostgreSQL), based on state-of-the-art techniques from the field of temporal databases. The resulting SQL queries are long and complex, so we only provide the full queries in our repository (https://github.com/osavkovic/CompactTRPQ). Instead, we describe here step by step how these queries compute answers to a TRPQ.

We represent data and query outputs as sets of records, each labeled with a time interval [21]. For the data, we store nodes and edges in a format similar to the one used by Arenas et al. [ 1 ]. Figure 8 shows the base tables that correspond to the graph of Figure 1.

In what follows, as an illustration, we show how compact answers to the following query 5 over can be produced in and , starting from the base tables of Figure 8:

5 = (pos = true)/T[−168,0] /F/meets In particular, we illustrate the efect of temporal joins (a.k.a. the path1/path2 operator) and coalescing. Folding Time Points ( ). In this representation, time points are folded into intervals, while distances consist of scalar values. We start from base data in tables with an interval-based representation (cf. Figure 8). First, we transform our base data into tuples in .

For nodes, we duplicate identifiers, i.e. we produce tuples of the form ⟨, , , ⟩ . Since our temporal data already consists of intervals over time points, no operation on time intervals ( ) is needed, so we can extract them from the base tables. Finally, we initialize the distance in each tuple with the value 0.

For edges, we now have a composite identifier with two attributes (source 1 and destination 2). Then similarly to nodes, no operations on time intervals needs to be performed, and we initialize the distance with 0.

nodes 1 1 1 1 1 2 2 2 2 2 3 4 4 4 name

pos meets

Finding nodes that match pos = true (in query 5) corresponds to a Boolean condition in a SQL WHERE clause, and similarly for edges with label “meets” (which match meets in 5). The result is shown in Figure 9.

1 1 2 1 pos = true in pos true meets

Consider now the operator T[−168,0] in our query 5. Because we are in , we have to introduce to 0 to each record in the answers to pos = true. We do so in SQL using PostgreSQL’s generate_series function.2 Each record is replicated 169 times, once for each integer in [-168,0], and its initial distance (which was 0 in this case) is added to this number. This yields the output to the subquery (pos = true)/T[−168,0] , shown in Figure 10.

1 1 . . . 1 1 2 1 1 1

Finally, we perform a temporal join between the subqueries (pos = true)/T[−168,0] and meets. Following Equation (1) (illustrated with Figure 5), a temporal join in is computed as a join where the second object 2 of the left input is equal to the first object 1 of the right input , and the time interval in shifted by its distance overlaps with the time interval in . This is a so-called overlap join in temporal databases [22, 23, 24, 25], but can also be executed using traditional hash or merge joins in traditional database systems. After this operation, we can apply coalescing on the time intervals. The ifnal output of query 5 is shown in Figure 11. 2https://www.postgresql.org/docs/current/functions-srf.html 1 1 2 1 pos = true in pos true

. . . 1 2 . . . 2 3 3 2 2 2 3 3 4 4 meets in meets 200 [0, 0] label meets meets meets meets meets

204 320 330 100 101

Folding Distances ( ). In this representation, distances are folded into intervals, while (starting) time points are scalar values. We start from the same base data as in the previous case and transform these base records into tuples in . For nodes, similarly to what we did for , we duplicate identifiers. But this time, we initialize distances to a singleton interval [0, 0], and we use the generate_series function to replicate each base record, once for each time point within its original interval. We proceed analogously for edges, and filter records matching pos = true or meets like we did for . The result in each record of the output to pos = true. The result is shown in Figure 13.

To apply the T[−168,0] operator, we shift the start of the distances interval by −168 and its end by 0 1 1 2 1 pos true

(cf. Equation (2) and Figure 6) is computed as a join where the second object 2 of the left input is equal to the first object 1 of the right input , and the time point of shifted by its distance interval contains the time point of . This is a so-called range join in temporal databases [25], but can also be executed using traditional hash or merge joins. After this operation, we can apply coalescing on the distance intervals. The result is shown in Figure 14.

1 1 2 2

5. Data Set and Experiments

We conducted experiments to investigate (i) how compact query answers can be in , and and, for the two latter representations, () how the size of input intervals in graph and query afect the size of compact answers, and () to what extent coalescing LM and LM afects compactness.

We used the dataset provided in [ 1 ], which represents a graph of people and rooms with meetings for contact tracing. The TG consists of 24, 990 nodes and 2, 638, 623 edges over a domain of 52 time points. The minimum, average, and maximum interval duration in nodes (resp., edges) are 1, 19.7, and 52 (resp., 1, 2.2, and 5). The number of diferent time points in this dataset (52) is extremely small (and arguably unrealistic) when compared to the size of the graph. This is why we used in our experiments a factor (described below) that scales the size of all intervals. This allowed us to test the impact of large graph intervals, which in theory should penalize more than .

We used the SQL implementation described in Section 4.2, and PostgreSQL as a backend. As a query, we retrieve all people that had a positive contact up to a certain time period in the past, with duration , i.e.,

6 := (pos = true)/T[−,0] /F/meets

Our experiments have two parameters: , which increases the distance interval [−, 0] in 6, and the scaling factor that multiplies the size of all intervals in . In our first experiment, we compared the size of J6K to its compact representations in and , denoted with [6] and [6] below. The results for varying distances in the query (i.e., values for ) are shown in Figure 15, left. We see that [6] and [6] provide a compression ratio of about 2.2 for this dataset, which has a very small temporal domain (the curve becomes flat for ≥ 52 because the interval [−, 0] is longer than the whole temporal domain). The results for varying durations of time intervals in the graph (i.e., values for ) are shown in Figure 15, right. With more time points in the graph, the diferences between J6K and the compact representations [6] and [6] increases dramatically, with a compression factor of 22 for [6] and 12.6 for [6] when = 10.

] 6 15 0 [e1× 10 z i lts 5 u s eR 0 0 | | 6 ]]5 6 6 4 [| [| 3 ito 2 aR 1

From both plots, we see that the diference between the sizes of [6] and [6] is relatively small. However, for smaller values of , [6] is more compact, while [6] is more compact for a larger . To illustrate this behavior, we plot the ratio |[6]|/|[6]|, in Figure 16 (left), for a fixed = 5 and varying , and in Figure 16 (right) for a fixed = 25 and varying .

We also performed experiments to show the impact of coalescing on compactness. Figure 17 shows result sizes with (left) and without coalescing (right), for a fixed = 5 and diferent values for . In the coalesced representation, we see that L6M and [6] have comparable sizes, much smaller than the size of J6K. For the non-coalesced result, or prior to coalescing (right), [6] maintains a compact result during computation, while [6] produces a result similar to J6K (approx. 10 times larger than 10× 100 [ e z its 50 l u seR 0 ]]|1.5 | 6 6 [| [| 1 o i ta 0.5 R 0 L6M). This can be explained by the fact that in this dataset the number of edges (approx. 2.6M) is much larger than the number of nodes (approx. 25k) and even more if we focus on nodes that match pos = true (approx. 2k). While [6] only increases the number of nodes by applying T[−,0] by a factor of (cf. Figure 10), [6] increases the (already large) number of edges when replicating records for each time point (cf. Figure 12, right) by a factor of 10, which is the average duration of time intervals of edges for = 5.

To see the impact on runtime of these large intermediate results, we also ran experiments where we measured the processing time. We used = 5 and = 300, i.e., the right-most data point in Figure 17, and measure the wall-clock time on an Intel(R) Xeon(R) Gold 6246R CPU @ 3.40GHz machine running Ubuntu Linux. The runtime for L6M was 154 seconds, while the runtime for [6] was 1, 390 seconds due to the large intermediate result that needs to be processed (cf. Figure 17 (right)), which also emphasizes the fact that compactness has a large impact on query performance. 6] 40 J6K [6] [6]

L6M

L6M 01× 30 [ ze 20 i s ltu 10 s eR 0 0

To summarize, we can see from the experiments that both and , when coalesced, provide a much more compact representation than . When intervals in the query are smaller than in the graph, is the most compact representation, while is the most compact in the opposite case. Besides, coalesced representations are substantially more compact than their non coalesced counterparts, which afects not only data storage, but also the performance of join operations, because the cardinality of their operands is reduced.

6. Related Work

Temporal Relational DBs. In temporal relational DBs, tuples are most commonly associated with a single time interval, viewed as a compact representations of time points at which the tuple holds [21]. The coalescing operator, which merges value-equivalent tuples over consecutive or overlapping time intervals, has received a lot of attention. Böhlen et al. [15] showed that coalescing can be implemented in SQL, and provided a comprehensive analysis of various coalescing algorithms and their performance. Later on, Al-Kateb et al. [26] investigated coalescing in the attribute timestamped CME temporal relational model, before Zhou et al. [16] exploited SQL:2003’s analytical functions for the computation of coalescing. Their technique is the state-of-the-art, requiring a single scan over the ordered input and can be computed in ( log ). Also relevant to our work is the eficient computation of temporal joins over intervals. There has been a long line of research on temporal joins [27], ranging from partition-based [22, 28], index-based [29, 30], and sorting based [23, 24] techniques. Recently, in [25] it has been shown that a temporal join with the overlap predicate can be transformed into a sequence of two range joins. Our inductive representations of answers require overlap joins and range joins (cf. Section 4.2) that could potentially benefit from these approaches.

Temporal Graphs. Temporal graph models vary in terms of temporal semantics, time representation (time point, interval), timestamped entities (graphs, nodes, edges, or attribute-value assignments), and whether they represent evolution of topology alone, or also of attributes. A sequence of snapshots is the simplest representation, in which a state of a graph is associated with either a time point or an interval during which it was in that state [31, 32]. Among recent proposals (and aside from [ 1 ]), Byun et al. [ 4 ] developed ChronoGraph, which is both a temporal graph model and a graph traversal language, with dedicated aggregation techniques; Johnson et al. [ 5 ] developed Nepal, a query language scalable for large networks; Debrouvier et al. [ 2 ] introduced T-GQL, a Cypher-like query language for TPGs; Mofitt et al. [ 3 ] suggested an algebraic framework for analyzing temporal graphs, and Labouseur et al. [ 6 ] developed the graph DB system G* for storing and managing distributed dynamic graphs. To our knowledge, the problem we address, namely producing compact answers to a TRPQ, is new.

7. Conclusions and Future Work

We provided in this paper implementation techniques to compute compact answers to a TRPQ over a TG, using two alternative compact representations. In theory, the first technique is better-suited to large intervals in the TG, and the second for large intervals in the TRPQ. We put this hypothesis into practice and observed that it was partially verified. Our experiments also reveal the importance of temporal coalescing (i.e. merging time intervals when possible).

As a continuation of this work, we want to investigate implementation techniques for the two more complex representations that we defined in [ 14]. These may require techniques that go beyond SQL, due to the intractability of coalescing. Alternatively, tractability can be regained if minimality is not a requirement, or if one disallows overlapping compact representations. But in SQL, this would still require developing techniques that have not been investigated yet in the field of temporal databases. Another interesting open question is the eficient implementation of the “star” operator trpq[, _]. A natural candidate here would be SQL CTEs (or possibly Datalog engines). Finally, we would like to test answering TRPQs over real-world datasets, potentially extracted from general purpose knowledge graphs, such as Wikidata.

Acknowledgments

This research has been partially supported by the HEU project CyclOps (GA n. 101135513), by the Province of Bolzano and EU through project ERDF-FESR 1047 AI-Lab, by MUR through the PRIN project 2022XERWK9 S-PIC4CHU, and by the EU and MUR through the PNRR project PE0000013-FAIR.

Declaration on Generative AI

The author(s) have not employed any Generative AI tools when writing this paper. [7] N. Francis, A. Green, P. Guagliardo, L. Libkin, T. Lindaaker, V. Marsault, S. Plantikow, M. Rydberg, P. Selmer, A. Taylor, Cypher: An evolving query language for property graphs, in: Proc. ACM SIGMOD, 2018, pp. 1433–1445. [8] S. Harris, A. Seaborne, SPARQL 1.1 Query Language, W3C Recommendation, W3C, 2013. URL: https://www.w3.org/TR/sparql11-query. [9] C. S. Jensen, R. T. Snodgrass, Bitemporal relation, in: Encyclopedia of Database Systems, 2nd ed.,

Springer, 2018. [10] K. G. Kulkarni, J.-E. Michels, Temporal features in SQL: 2011, SIGMOD Rec. 41 (2012) 34–43. [11] R. T. Snodgrass (Ed.), The TSQL2 Temporal Query Language, Kluwer, 1995. [12] A. Dignös, B. Glavic, X. Niu, J. Gamper, M. H. Böhlen, Snapshot semantics for temporal multiset relations, PVLDB 12 (2019) 639–652. [13] M. Adnan, D. Calvanese, J. Gamper, J. Corman, A. Dignös, W. Nutt, O. Savković, Compact answers to temporal path queries, in: Proc. ISWC, LNCS, Springer, 2025. [14] M. Adnan, D. Calvanese, J. Corman, A. Dignös, W. Nutt, O. Savković, Compact answers to temporal path queries, 2025. URL: https://arxiv.org/abs/2507.22143. [15] M. H. Böhlen, R. T. Snodgrass, M. D. Soo, Coalescing in temporal databases, in: Proc. VLDB, 1996. [16] X. Zhou, F. Wang, C. Zaniolo, Eficient temporal coalescing query support in relational database systems, in: Proc. DEXA, volume 4080 of LNCS, Springer, 2006, pp. 676–686. [17] M. Al-Kateb, A. Ghazal, A. Crolotte, An eficient SQL rewrite approach for temporal coalescing in the teradata RDBMS, in: Proc. DEXA, volume 7447 of LNCS, Springer, 2012, pp. 375–383. [18] S. Brandt, E. G. Kalayci, R. Kontchakov, V. Ryzhikov, G. Xiao, M. Zakharyaschev, Ontology-based data access with a Horn fragment of metric temporal logic, in: Proc. AAAI, 2017, pp. 1070–1076. [19] D. Wang, P. Hu, P. A. Walega, B. C. Grau, MeTeoR: Practical reasoning in Datalog with metric temporal operators, in: Proc. AAAI, 2022, pp. 5906–5913. [20] C. Khnaisser, H. Hamrouni, D. B. Blumenthal, A. Dignös, J. Gamper, Eficiently labeling and retrieving temporal anomalies in relational databases, Inf. Syst. Frontiers (2025). [21] M. H. Böhlen, A. Dignös, J. Gamper, C. S. Jensen, Temporal data management – An overview, in: Tutorial Lectures of the 7th European Summer School on Business Intelligence and Big Data (eBISS), volume 324 of LNBIP, Springer, 2017, pp. 51–83. [22] A. Dignös, M. H. Böhlen, J. Gamper, Overlap interval partition join, in: Proc. ACM SIGMOD, 2014, pp. 1459–1470. [23] D. Piatov, S. Helmer, A. Dignös, An interval join optimized for modern hardware, in: Proc. ICDE, 2016, pp. 1098–1109. [24] P. Bouros, N. Mamoulis, D. Tsitsigkos, M. Terrovitis, In-memory interval joins, VLDB J. 30 (2021) 667–691. [25] A. Dignös, M. H. Böhlen, J. Gamper, C. S. Jensen, P. Moser, Leveraging range joins for the computation of overlap joins, VLDB J. 31 (2022) 75–99. [26] M. Al-Kateb, E. Mansour, M. E. El-Sharkawi, CME: A temporal relational model for eficient coalescing, in: Proc. TIME, IEEE, 2005, pp. 83–90. [27] D. Gao, C. S. Jensen, R. T. Snodgrass, M. D. Soo, Join operations in temporal databases, VLDB J. 14 (2005) 2–29. [28] F. Cafagna, M. H. Böhlen, Disjoint interval partitioning, VLDB J. 26 (2017) 447–466. [29] J. Enderle, M. Hampel, T. Seidl, Joining interval data in relational databases, in: Proc. ACM

SIGMOD, 2004, pp. 683–694. [30] M. Kaufmann, A. A. Manjili, P. Vagenas, P. M. Fischer, D. Kossmann, F. Färber, N. May, Timeline index: a unified data structure for processing queries on temporal data in SAP HANA, in: Proc.

ACM SIGMOD, 2013, pp. 1173–1184. [31] A. Fard, A. Abdolrashidi, L. Ramaswamy, J. A. Miller, Towards eficient query processing on massive time-evolving graphs, in: Proc. of the 8th Int. Conf. on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom), IEEE, 2012, pp. 567–574. [32] C. Ren, E. Lo, B. Kao, X. Zhu, R. Cheng, On querying historical evolving graph sequences, PVLDB 4 (2011) 726–737.

[1]

Arenas ,

Bahamondes ,

Aghasadeghi ,

Stoyanovich , Temporal regular path queries , in: Proc. ICDE , 2022 , pp. 2412 - 2425 .

[2]

Debrouvier , E. Parodi,

Perazzo ,

Soliani ,

Vaisman , A model and query language for temporal graph databases , VLDB J . 30 ( 2021 ) 825 - 858 .

[3]

V. Z.

Mofitt ,

Stoyanovich , Temporal graph algebra , in: Proc. DBPL , 2017 , pp. 1 - 12 .

[4]

Byun ,

Woo ,

Kim , Chronograph: Enabling temporal graph traversals for eficient information difusion analysis over time , IEEE TKDE 32 ( 2020 ) 424 - 437 .

[5]

Johnson ,

Kanza ,

L. V.

Lakshmanan ,

Shkapenyuk , Nepal: a path query language for communication networks , in: Proc. of the 1st ACM SIGMOD Workshop on Network Data Analytics , 2016 , pp. 1 - 8 .

[6]

A. G.

Labouseur ,

Birnbaum ,

P. W.

Olsen ,

S. R.

Spillane ,

Vijayan ,

J.-H.

Hwang , W.-S. Han, The G* graph database: Eficiently managing large distributed dynamic graphs , Distributed and Parallel Databases 33 ( 2015 ) 479 - 514 .