=Paper= {{Paper |id=None |storemode=property |title=Query reasoning on data trees with counting |pdfUrl=https://ceur-ws.org/Vol-1659/paper5.pdf |volume=Vol-1659 |authors=Everardo Barcenas,Edgard Benítez-Guerrero,Jesús Lavalle |dblpUrl=https://dblp.org/rec/conf/lanmr/BarcenasBL16 }} ==Query reasoning on data trees with counting== https://ceur-ws.org/Vol-1659/paper5.pdf
 Query Reasoning on Data Trees with Counting

    Everardo Bárcenas1,2 , Edgard Benı́tez-Guerrero2 , and Jesús Lavalle3,4
                                   1
                                    CONACYT
                            2
                              Universidad Veracruzana
                 3
                    Benemérita Universidad Autónoma de Puebla
             4
               Instituto Nacional de Astrofı́sica, Óptica y Electrónica
        iebarcenaspa@conacyt.mx,edbenitez@uv.mx,jlavalle@cs.buap.mx



      Abstract. Regular path expressions represent the navigation core of
      the XPath query language for semi-structured data (XML), and it has
      been characterized as the First Order Logic with Two Variables (FO2 ).
      Data tests refers to (dis)equality comparisons on data tree models, which
      are unranked trees with two kinds of labels, propositions from a finite
      alphabet, and data values from a possibly infinite alphabet. Node occur-
      rences on tree models can be constrained by counting/arithmetic con-
      structors. In this paper, we identify an EXPTIME extension of regular
      paths with data tests and counting operators. This extension is char-
      acterized in terms of a closed under negation Presburger tree logic. As
      a consequence, the EXPTIME bound also applies for standard query
      reasoning (emptiness, containment and equivalence).


1   Introduction

XPath is a W3C standard query language for semi-structured data (XML), and
it also takes an important role in many XML technologies, such as, XProc,
XSLT, and XQuery [1, 2]. The navigation core of XPath, also known as regular
path queries, has been recently characterized by the First Order Logic of Two
Variables (FO2 ) [1]. Models for this logic are unranked trees, where nodes are
labeled by propositions from a finite alphabet. Data tests, also known as data
joins in databases community, on XPath queries are expressions of the forms
ρ1 ≡ ρ2 and ρ1 6≡ ρ2 . These expressions hold whenever data values (propositions
from an infinite alphabet) contained in path ρ1 are (dis)equal to data values
contained in path ρ2 , respectively. Another important constructors on XPath
queries concerns counting: ρ1 # ρ2 , where # ∈ {≤, >, =, 6=}. These expressions
hold whenever the number of nodes denoted by ρ1 and ρ2 satisfies constraint
#. There are several recent works studying regular path extensions with either
data tests or counting [3–6, 2]. However, as far as we know, the current work rep-
resent the first study on regular path extensions concerning both constructors,
data tests and counting. More precisely, we give a characterization of a regular
paths with data test with respect to constants ρ ≡ k (ρ 6≡ k) and with count-
ing operators on children paths. For this characterization we use a modal tree
logic equipped with a fixed point operator, converse modalities and Presburger




                                        33
arithimetic constraints [7]. Due to this characterization, the EXPTIME bound
from the logic is imported to standard query reasoning (emptiness, containment,
and equivalence) with counting and data tests.
    There are several extensions of FO2 with data tests [8–11]. In [8], FO2 (<
, +1, ≡) for data trees is introduced: < stands for descendants and following
sibling relations, +1 refers to child and next sibling relations, and ≡ is a bi-
nary predicate for data tests. Decidability, without any complexity analysis, for
FO2 (<, +1, ≡) in data trees is first shown by a reduction to the reachability prob-
lem of a counter tree automata model. Previously in [10], the same result was
obtained for data words (one branched tree). Even earlier in [11], FO2 (+1, ≡) for
trees was introduced and shown decidable in 3NEXPTIME. In another direction,
regarding regular paths (XPath navigation core), it is well know data test on
full navigation regular paths is undecidable [5]. Several fragments (downward,
forward, transitive) of regular path expressions with data tests are studied [12,
13, 6, 5, 3]. With their corresponding complexity ranging from EXPTIME to non
elementary. Contrastingly, in this paper, instead of restricting navigation on
queries, we study the full navigation (children, parents, following and previous
sibling, descendants and ancestros) regular path expressions, but we restrict data
tests to constants only.
    Regarding regular paths with counting, there are several recent studies [1, 2,
4]. In [1], it was show the extension of regular paths with counting is in general
undecidable. EXPTIME fragments (counting with respect to constants) were
later identified in [2, 4]. Several other logics with counting have been proposed
in the setting of unranked trees [14–16, 7, 17]. In [17], the EXPTIME bound was
further developed for a set of coalegebraic modal logics via a type elimination al-
gorithm. Excepting [18], where the emptiness problem for ranked tree automaton
with equality and counting constraints was shown decidable without a further
complexity analysis, all the above works study separately data tests and count-
ing. In the current work, we identify an EXPTIME extension of regular paths
with both, data tests and counting.
    We describe a counting and data tests extension of regular paths in Section 2.
In Section 3, we describe a modal tree logic with a fixed point, converse modali-
ties and Presburger arithmetic constructors. The main result of this paper, which
is a characterization of the regular path extension with counting and data tests
in terms of the logic, is described in Section 4. We conclude with a summary
of this work, together with a brief discussion of further research perspectives in
Section 5.


2   Regular Path Queries with Counting and Data Tests

For the languages described in the current work, unless otherwise stated, we use a
fixed alphabet composed by set of propositions P ROP S and a a set of modalities
M ODS = {, , , }. Intuitively, propositions are used in tree models to label
nodes, and modalities are interpreted as the children, parent , right siblings
  , and left siblings  relations.




                                        34
   We now introduce the notion of a tree, which can be seen as a tree-shaped
Kripke structure (transition system).


Definition 1 (Tree). A tree T is defined as a tuple (N, R, L), such that: N is a
finite set of nodes; R : N × M ODS × N is a transition relation among nodes and
modalities forming a tree (we often write n ∈ R(n0 , m) instead of (n0 , m, n) ∈ R);
and L : N × P ROP S is a left-total labeling relation (we often write p ∈ L(n)
instead of (n, p) ∈ L).


   The set of data values are the set of natural numbers N. Data trees can be
seen as an extension of trees (Definition 1), where nodes are labeled with data
values and propositions.


Definition 2 (Data tree). A data tree Γ is defined as a tuple (N, R, L, D),
such that: (N, R, L) is a tree; and D : N 7→ N is a total function.


   We now give a precise syntax of regular paths with counting and data tests.


Definition 3 (Syntax). We define the RPQCD expressions (queries) by the
following grammar:


                       ρ :=> | α | p | α : p | ρ/ρ | ρ[β]
                      β :=ρ | ρ − ρ # k | ρ ≡ k | ¬β | β ∨ β


where p ∈ P ROP S, k ∈ N, # ∈ {>, ≤, =} and α ∈ {, , , , ? , ? }.


In the case of ρ1 − ρ2 # k, both ρi (i = 1, 2) are restricted to be children paths,
that is, they have one of the following forms: , : p,  [β] or : p[β].
    RPQCD expressions are interpreted over data trees: > selects the entire set
of nodes; α : p navigates through α and selects the p nodes; ρ1 /ρ2 is the composi-
tions of paths; and ρ[β] selects the nodes denoted by ρ satisfying condition β. In
particular, when β is ρ ≡ k, it holds whenever there is a node denoted by ρ whose
data value is equal to k. ρ1 − ρ2 # k is true if and only if the number of nodes
selected by ρ1 minus the number of nodes selected by ρ2 , satisfies constraint #k.
Notice some syntactic sugar (notation) as ρ1 #ρ2 instead of ρ1 − ρ2 #0 can also
be defined. Negation and disjunction are interpreted as expected.
   We now give a precise description on how RPQCD expressions are interpreted
over data trees.




                                         35
Definition 4 (Semantics). Given a data tree Γ = (N, R, L, D), RPQCD ex-
pressions are interpreted as follows:
                 Γ
            [[>]] =N × N
                 Γ
             [[p]] = {(n, n) | p ∈ L(n)}
                     n                    o
                  Γ                  α
            [[α]] = (n1 , n2 ) | n1 → n2
                     n                              o
                  Γ                   Γ
        [[α : p]] = (n1 , n2 ) ∈ [[α]] | p ∈ L(n2 )
                 Γ          Γ         Γ
         [[ρ1 /ρ2 ]] = [[ρ1 ]] ◦ [[ρ2 ]]
                        n                                 o
                      Γ                   Γ             Γ
            [[ρ[β]]] = (n1 , n2 ) ∈ [[ρ]] | n2 ∈ [[[β]]]
                        n                       o
                      Γ                       Γ
               [[[ρ]]] = n | (n, n0 ) ∈ [[ρ]]
                        n      n                      o      n                        o    o
                      Γ                             Γ                               Γ
[[[ρ1 − ρ2 # k]]] = n | n1 | (n, n1 ) ∈ [[ρ1 ]]           − n2 | (n, n2 ) ∈ [[ρ2 ]]     #k
                        n                                  o
                      Γ                       Γ
       [[[ρ ≡ k]]] = n | (n0 , n) ∈ [[[ρ]]] , D(n) = k
                 Γ               Γ
          [[[¬β]]] =N \ [[[β]]]
                 Γ           Γ            Γ
     [[β1 ∨ β2 ]] = [[[β1 ]]] ∪ [[[β2 ]]]
           α
where n1 → n2 holds, if and only if, n1 is related to n2 through α in Γ .
We also interpret RPQCD expressions with respect to a context, more precisely,
the interpretation of a RPQCD expression ρ on      n a data tree Γ from a subset    o of
                                              Γ                          Γ
nodes N 0 (of Γ ) is defined as follows: [[ρ]]N 0 = n0 | (n, n0 ) ∈ [[ρ]] , n ∈ N 0
   We now define the standard query reasoning problems for RPQCD: empti-
ness, containment and equivalence.
Definition 5 (Reasoning).
 – We say a RPQCD expression ρ is empty, if and only if, for any data tree Γ ,
                     Γ
   we have that [[ρ]] 6= ∅.
 – Given two RPQCD expressions ρ1 and ρ2 , we say ρ1 is contained in ρ2 ,
                                                                               Γ
   written ρ1 ⊆ ρ2 , if and only if, for any data tree Γ , we have that [[ρ1 ]] ⊆
          Γ
   [[ρ2 ]] .
 – Given two RPQCD expressions ρ1 and ρ2 , we say ρ1 is equivalent to ρ2 , if
   and only if, ρ1 ⊆ ρ2 and ρ2 ⊆ ρ1 .


3    A Presburger Tree Logic
We now describe a modal tree logic, as originally introduced in [7], with a fixed
point, converse modalities and Presburger arithmetic operators.
Definition 6 (Syntax). We inductively define the set of µTLIC formulas by
the following grammar: φ := p | ¬φ | φ ∨ φ | hmi φ | µx.φ | φ − φ # k, where
p ∈ P ROP S, m ∈ M ODS, # ∈ {>, ≤, =}, and k ∈ N coded in binary form.




                                              36
    µTLIC expressions are interpreted as subset tree nodes: propositions are
used as node labels; negation is interpreted as set complement; disjunction as
set union; modal formulas hmi φ holds in nodes where there is at least one m
transition to a node supporting φ; the fixed point operator µx.φ is interpreted
as a recursion operator; and Presburger formulas φ − ψ # k selects nodes whose
φ children minus ψ children satisfy constraint # k.
    Before formally introduce the interpretation of µTLIC formulas, we first de-
fine a valuation function V : X 7→ N of set of variables x over a set of nodes of
a given tree.
Definition 7 (Semantics). Given a tree T = (N, R, L) and a valuation V ,
µTLIC formulas are interpreted as follows:
                     T
                  [[p]]V = {n | p ∈ L(n)}
                     T           T
                [[¬φ]]V =N \ [[φ]]V
                     T       T        T
              [[φ ∨ ψ]]V = [[φ]]V ∪ [[ψ]]V
                            n                        o
                        T                          T
              [[hmi φ]]V = n | R(n, m) ∩ [[φ]]V
                        T
                            \n             T
                                                        o
                [[µx.φ]]V =      M | [[φ]]V [M /x ] ⊆ M
                            n                                         o
                        T                          T            T
         [[φ − φ # k]]V = n | R(n, ) ∩ [[φ]]V − R(n, ) ∩ [[ψ]]V # k

   Without loss of generality, we assume variables can only occur bounded,
and in the scope of modal or counting formulas [7]. Furthermore equivalent
negated normal forms can also be achieved by traditional De Morgan’s and
modal rules: ¬ hmi φ := [m] ¬φ, ¬(φ ∨ ψ) := ¬φ ∧ ¬ψ, ¬µx.φ := νx.¬φ [x /¬x ],
¬(φ−ψ > k) := φ−ψ ≤ k, ¬(φ−ψ ≤ k) := φ−ψ > k, ¬(φ−ψ = k) := φ−ψ 6= k,
and ¬(φ − ψ 6= k) := φ − ψ = k.
   We conclude this Section recalling the complexity of µTLIC.
Theorem 1 ([7]). µTLIC is in EXPTIME-complete.


4   Logic characterization
In this Section we give a characterization of RPQCD expressions in terms of
µTLIC formulas.
    First we define a non-data version of data trees. Intuitively, data values in
data trees are represented by children nodes labeled by a fresh proposition δ. For
instance, if a node has value k, then its non-data version has k children labeled
by δ. Then, Presburger formulas can be used to test values in non-data trees.
Definition 8. Provided a data tree Γ = (N, R, L, D), we define the tree T (Γ ) =
(N 0 , R0 , L0 ) as follows:
 – let Ni be a set of ki new nodes (N ∩ Ni = ∅) induced by data values of nodes
                                                                 S|N |
   in N , that is, for each ni ∈ N , D(ni ) = ki , then N 0 = N ∪ i=1 Ni ;




                                          37
                                             S|N |
 – let Ri = {ni } × {} × Ni , then R0 = R ∪ i=1 Ri ;
                                                       S|N |
 – and let Li : Ni × {δ} be left total, then L0 = L ∪ i=1 Li , provided δ is a
   proposition not occurring in L, that is, for each n ∈ N , if (n, p) ∈ L, then
   δ 6= p.
    We now give a precise translation of regular paths with counting and data
tests in terms of the logic.
Definition 9. We define a translation function F of RPQCD expressions in
terms of the logic as follows:
     F (>, C) := C ∧ ¬δ                         F (p, C) := p ∧ ¬δ ∧ C
     F (, C) := ¬δ ∧ hi C                     F (, C) := ¬δ ∧ hi C
     F ( , C) := ¬δ ∧ hi C                     F (, C) := ¬δ ∧ h i C
     F (? , C) := ¬δ ∧ µx. hi (C ∨ x)         F (? , C) := ¬δ ∧ µx. hi (C ∨ x)
     F (α : p, C) := F (α, C) ∧ F (p, >)        F (ρ1 /ρ2 , C) := F (ρ2 , F (ρ1 , C))
     F (ρ[β], C) := F (ρ, C) ∧ G(β, >)
where δ is a fresh proposition and G is a translation of qualifiers (Definition 10).
Definition 10. We define a translation of qualifiers in terms of the logic as
follows:
G(>, C) := C ∧ ¬δ                  G(α, C) := F (α, C)
G(p, C) := p ∧ ¬δ ∧ C              G(α : p, C) := F (α, C ∧ p ∧ ¬δ)
G(ρ1 /ρ2 , C) := G(ρ1 , G(ρ2 , C)) G(ρ[β], C) := G(ρ, G(β, >) ∧ C)
G(¬β, C) := ¬G(β, C)               G(β1 ∨ β2 , C) := G(β1 , C) ∨ G(β2 , C)
                  ≡
G(ρ ≡ k, C) := G (ρ ≡ k, C)        G(ρ1 −ρ2 # k, C) := G #(ρ1 , C)−G #(ρ2 , C) # k

                      G≡ (> ≡ k, C) := (φδ ∧ hi G(>, C) = k
                      G≡ (p ≡ k, C) := (φδ ∧ hi G(p, C)) = k
                      G≡ (α ≡ k, C) := G(α, (φδ ∧ hi C) = k)
                 G≡ (α : p ≡ k, C) := G(α, (φδ ∧ hi (C ∧ p)) = k)
                G≡ (ρ1 /ρ2 ≡ k, C) := G(ρ1 , G≡ (ρ2 ≡ k, C))
                 G≡ (ρ[β] ≡ k, C) := G≡ (ρ ≡ k, G(β, >) ∧ C)
                                 φδ := δ ∧ ¬p0 ∧ ¬ hi >

   G # (, C) := C ∧ ¬δ               G # (: p, C) := p ∧ ¬δ ∧ C
   G # ( [β], C) := G(β, >) ∧ C      G # (: p[β], C) := p ∧ ¬δ ∧ G(β, >) ∧ C
provided that α is the dual relation of α, more precisely,  =, =, ? =? ,
and α = α; and where p0 represents all other propositions distinct to δ (recall
the set of propositions is finite).




                                           38
    Since translation of paths consider a context represented by formulas, we
now give a non-data version of formulas. Intuitively, context formulas are indis-
tinguishably interpreted over data and non-data trees.
Definition 11 (Context formula). Given a formula φ in negated normal
form, its corresponding context formula φC is inductively defined as follows:
        pC := p                                 (¬p)C := ¬δ ∧ ¬p
        (φ ∨ ψ)C := φC ∨ ψ C                    (φ ∧ ψ)C := φC ∧ ψ C
        (hmi φ)C := ¬δ ∧ hmi φC                 ([m] φ)C := ¬δ ∧ [m] φC
                             C                                     C
        (µx.φ)C := φ µx.φ /x                    (νx.φ)C := φ νx.φ /x
                                                             

        (φ − ψ # k)C := φC − ψ C # k
Lemma 1. Given any data tree Γ , for any formula φ and any valuation V , we
          Γ  T (Γ )
have that φC V = φC V .
   We now describe the main result of this paper: a characterization of regular
paths with counting and data tests in terms of Presburger formulas.
Theorem 2 (Logic characterization of data queries). For any ρ RPQCD
expression, data tree Γ , µTLIC context formula φC , and any valuation V , we
have the following:
        Γ                    T (Γ )
 – [[ρ]][[φC ]]T (Γ ) = F ρ, φC V ; and
               V 
 – F ρ, φC is of polynomial size with respect to q and φC .
  An immediate consequence of Theorems 1 and 2 is an EXPTIME bound for
RPQCD reasoning.
Corollary 1. Reasoning (emptiness, containment and equivalence) on regular
path queries with counting and data tests (RPQCD) is in EXPTIME.

5   Discussion
We introduced an extension of regular path expressions with counting and data
tests. Counting operators express occurrence restrictions on children path ex-
pressions, whereas data tests express (dis)equality relations among paths with
respect to their data values. We give a characterization of the extension of reg-
ular paths in terms of a Presburger logic originally introduced in [7]. Since the
characterization is polynomial and the logic is closed under negation, the EXP-
TIME bound of the logic is then imported for the emptiness, containment and
equivalence of paths with counting and data tests. As a first further research per-
spective we propose the study of the model checking problem of the Presburger
logic (it is known a quadratic-time model checking algorithm for the logic with-
out converse modalities [19]). This would imply complexity bound for the query
evaluation of paths with counting and data tests. As another future work, we
propose the study of further data test extensions of regular paths, in the setting
of expressive modal logics with efficient reasoning Fischer-Ladner algorithms as
in [7].




                                       39
Acknowledgment. This work was partially developed under the support of
the Mexican National Science Council (CONACYT) in the scope of the Cátedras
CONACYT project “Infraestructura para Agilizar el Desarrollo de Sistemas Cen-
trados en el Usuario” (Ref 3053).

References
 1. ten Cate, B., Litak, T., Marx, M.: Complete axiomatizations for xpath fragments.
    J. Applied Logic 8(2) (2010) 153–172
 2. Bárcenas, E., Genevès, P., Layaı̈da, N., Schmitt, A.: Query reasoning on trees
    with types, interleaving, and counting. In: IJCAI, International Joint Conference
    on Artificial Intelligence. (2011)
 3. Figueira, D., Figueira, S., Areces, C.: Model theory of xpath on data trees. part I:
    bisimulation and characterization. J. Artif. Intell. Res. (JAIR) 53 (2015) 271–314
 4. Bárcenas, E., Lavalle, J.: Global numerical constraints on trees. Logical Methods
    in Computer Science 10(2) (2014)
 5. Figueira, D.: On xpath with transitive axes and data tests. In: Symposium on
    Principles of Database Systems, PODS. (2013) 249–260
 6. Figueira, D.: Decidability of downward xpath. ACM Trans. Comput. Log. 13(4)
    (2012) 34
 7. Bárcenas, E., Lavalle, J.: Expressive reasoning on tree structures: Recursion, in-
    verse programs, Presburger constraints and nominals. In: Mexican International
    Conference on Artificial Intelligence, MICAI. (2013) 80–91
 8. Jacquemard, F., Segoufin, L., Dimino, J.: Fo2(<, +1, ˜) on data trees, data tree
    automata and branching vector addition systems. Logical Methods in Computer
    Science 12(2) (2016)
 9. Bojańczyk, M., Place, T.: Toward model theory with data values. In: Automata,
    Languages, and Programming - International Colloquium, ICALP. (2012)
10. Bojańczyk, M., David, C., Muscholl, A., Schwentick, T., Segoufin, L.: Two-variable
    logic on data words. ACM Trans. Comput. Log. 12(4) (2011) 27
11. Bojańczyk, M., Muscholl, A., Schwentick, T., Segoufin, L.: Two-variable logic on
    data trees and XML reasoning. J. ACM 56(3) (2009)
12. Figueira, D.: Forward-xpath and extended register automata on data-trees. In:
    Database Theory - ICDT, International Conference. (2010)
13. Figueira, D., Segoufin, L.: Bottom-up automata on data trees and vertical xpath.
    In: Symposium on Theoretical Aspects of Computer Science, STACS. (2011)
14. Dal-Zilio, S., Lugiez, D., Meyssonnier, C.: A logic you can count on. In: Symposium
    on Principles of Programming Languages, POPL. (2004)
15. Seidl, H., Schwentick, T., Muscholl, A.: Counting in trees. In: Logic and Automata.
    (2008)
16. Demri, S., Lugiez, D.: Complexity of modal logics with presburger constraints. J.
    Applied Logic 8(3) (2010) 233–252
17. Kupke, C., Pattinson, D., Schröder, L.: Reasoning with global assumptions in
    arithmetic modal logics. In: Fundamentals of Computation Theory, FCT. (2015)
18. Barguñó, L., Creus, C., Godoy, G., Jacquemard, F., Vacher, C.: Decidable classes
    of tree automata mixing local and global constraints modulo flat theories. Logical
    Methods in Computer Science 9(2) (2013)
19. Bárcenas, E., Benı́tez-Guerrero, E., Lavalle, J.: On the model checking of the
    graded µ-calculus on trees. In: Mexican International Conference on Artificial
    Intelligence, MICAI. (2015)




                                          40