FO Queries Strongly Distributing over
        Components in Arbitrary Cardinality

                               Francesco Di Cosmo

                           dicosmo.francesco@gmail.com


      Abstract. In previous works a coordination-free strategy to compute
      Datalog with negation queries over databases distributed over many com-
      putational nodes has been studied, providing a syntactic characterization
      and proving the undecidability of the problem of deciding whether a
      query distributes. In view of a recast for FO queries, we report about
      a work in progress, namely the study of FO queries strongly distribut-
      ing over components. We prove that the decision problem of establishing
      whether a FO query strongly distributes over components is undecidable
      and highlight how some syntactic bonds typical of Datalog with negation,
      namely safeness, play a crucial role in specifying strongly distributive FO
      queries.

      Keywords: Strong distribution · Undecidability · Safeness


1   Introduction
In [1] the problem of establishing which Datalog¬ queries distribute over com-
ponents has been studied, i.e. to determine those queries q such that, for any
database D, the following holds:
                                        [
                              q(D) =         q(C)
                                     C∈cc(D)

Here by cc (D) we denote the set of connected components of D (formally in-
troduced in sec. 2.1). Informally, for example, if the database is a set of family
trees, then each tree is a component, where its memebers are connected by family
relationships.
    The aforementioned problem arises in the context of looking for a parallelized
coordination-free query computation strategy over databases distributed across
many computational nodes [2]. Indeed, we can interpret the right-hand term as
the result of the following strategy:
1. Facts of the whole database are stored in a scattered fashion over many
   computational nodes. Hence, for each node there is a local database.
2. During a preliminary phase, nodes can communcate with each other to up-
   date local databases asking and getting all and only facts about individuals
   from the local database. In the end, each node will host some connected
   components of the global database, say one.
 3. Each node computes the query over its local database, neglecting coordina-
    tion with other nodes, getting a set of local answers.
 4. All the sets of local answers are merged through the union set operator. This
    is the result of the computation.
   By Datalog¬ we refer to a variant of standard Datalog¬ [3] whose queries
are expressible through a program P and a goal g, such that:
 – P is a set of rules like:
                                        H←B
   where H, the head, is an asserted literal, over an intentional vocabulary,
   satisfying safeness, in the sense that its variables occurs also in B, the body,
   and B is a conjunction of literals, over the same vocabulary extended with
   an extensional one, also satisfying safeness, in the sense that every variable
   occurring in a negated literal in a rule occurs also in an asserted literal in
   the body of the same rule.
 – g is a rule without head, like:
                                       ⊥←B
In both program and goal, neither constants nor equality are allowed. We say
that a specification (P, g) is connected if, for each body B in (P, g), the graph
(V, E) is connected, where V is the set of variables occurring in B and {x, y} ∈ E
iff x and y occurs in the same asserted literal. For example the rule:
                           E 0 (x, w) ← E(x, y) ∧ E(z, w)
is not connected, but the following one is it:
                           E 0 (x, w) ← E(x, y) ∧ E(y, w)
     Two results have been established, namely that:
 1. A Datalog¬ query distributes over components iff it is specifiable by a con-
    nected specification;
 2. The problem of deciding whether, given a Datalog¬ query q, q distributes
    over components is undecidable.
These results are proved exploiting recursive specifications. With a view to re-
casting these results in absence of recursion, in this paper we report on a work
in progress, namely the study of first order (FO) queries distributing over com-
ponents.1 Moreover, in this preliminary discussion, we abstract databases with
purely relational structures of arbitrary cardinality and consider a stronger ver-
sion of distribution over components, named strong distribution over compo-
nents, i.e. those queries such that, for any structure S:
                                  [                 [
                         q(S) =         q(S 0 ) =         q(C)
                                 S 0 ⊂S           C∈cc(S)

The questions we want to answer are:
1
    We consider FO queries because, except for some details, they have the same ex-
    pressive power as Datalog¬ without recursion (see Codd’s theorems [3]).
 1. Is it possible to characterize FO queries strongly distributing over compo-
    nents by syntactics means as in [1]?
 2. Is the problem of deciding whether a FO query strongly distributes over
    components decidable?


2     Preliminaries

2.1    Structures and connected components

Let L be a relational vocabulary, i.e. a finite non empty set of relation symbols
R/n, with positive arity n > 0, and constant symbols c.2 A structure S over L
is a not empty set Dom(S), called the domain of S, enriched by interpretations
RS ⊂ Dom(S)n , for each relation symbol R/n ∈ L, and cS ∈ Dom(S), for each
constant symbol c ∈ L.
    We say that a structure S 0 is a substructure of S (S is a superstructure of
S ), S 0 ⊂ S, iff:
  0


 1. Dom(S 0 ) ⊂ Dom(S);
                                        0
 2. for each relation symbol R/n ∈ L, RS = RS ∩ Dom(S 0 )n ;
                                      0
 3. for each constant symbol c ∈ L, cS = cS .3

Given a subset A ⊂ Dom(S) such that, for each constant symbol c ∈ L, cS ∈ A ,
A generates a substructure A of S, the only one such that Dom(A) = A .
   The underlying graph of a structure S over L is the graph (V, E), where
V = Dom(S) and {a, b} ∈ E iff there is a relation symbol R/n ∈ L and a
n-tuple τ such that a, b ∈ τ ∈ RS . A connected component of S is a substruc-
ture generated by a connected component of its underlying graph. The set of
connected components of S is denoted cc (S). If S admits only one connected
component, then we say that S is connected.
   Directed graphs are simple examples of relational structures. For instance
the graph with vertex set V = {v1 , v2 , v3 } and edge set E = {(v1 , v2 ), (v3 , v3 )}
can be considered as a structure S over the purely relational language L =
{F/2}, where Dom(S) = V and F S = E. A substructure of S could be S 0 with
                               0
Dom(S 0 ) = {v1 , v3 } and F S = {(v3 , v3 )}, while the connected components of S
are the structures ({v1 , v2 }, {(v1 , v2 )}) and ({v3 , v3 }, {(v3 , v3 )}).


2.2    Preservation theorems

A FO formula ϕ (x1 , . . . , xn ) over a vocabulary L is preserved under superstruc-
tures iff, for any two structures S 0 ⊂ S over L:

        ∀a1 , . . . , an ∈ Dom(S 0 ) S 0 |= ϕ (a1 , . . . , an ) ⇒ S |= ϕ (a1 , . . . , an )
2
    Note that no function symbols are allowed.
3
    Therefore, cS ∈ Dom(S 0 ) holds.
whereas, ϕ is preserved under substructures iff the vice versa is valid, i.e.:

        ∀a1 , . . . , an ∈ Dom(S 0 ) S |= ϕ (a1 , . . . , an ) ⇒ S 0 |= ϕ (a1 , . . . , an )

The following theorems [4] hold also for vocabularies involving function symbols.
Theorem 1 (Preservation theorems).
1. A formula ϕ is preserved under superstructures iff it is equivalent to a for-
   mula in Σ1 , the set of prenex existential formulas.
2. A formula ϕ is preserved under substructures iff it is equivalent to a formula
   in Π1 , the set of prenex universal formulas.
For example, ∃x x = x is a valid Σ1 sentence.4 Since it is valid, it is preserved
under substructures and, by preservation theorem, it must be equivalent to a
(valid) Π1 sentence, like ∀x x = x. Finally, note that a quantifier-free formula is
both Σ1 and Π1 .

2.3    FO queries
Let L be a purely relational vocabulary, i.e. a relational vocabulary also without
constant symbols. A FO specification over L is a couple (ϕ, Ξ), where ϕ is a
FO formula over L and Ξ is a sequence of all elements in Var(ϕ),5 the set of
free variables in ϕ. A FO specification (ϕ (x1 , . . . , xn ) , Ξ) over L specify the FO
query q(ϕ,Ξ) such that, for any structure S over L:

                    q(S) = {(h(x))x∈Ξ |S |= ϕ (h(x1 ), . . . , h(xn ))}

Given a FO query q we say that:
 1. q is monotonic iff, for any two structures S 0 ⊂ S:

                                            q(S 0 ) ⊂ q(S)

 2. q is local iff, for any structure S and a ∈ q(S):

                                     ∃C ∈ cc (S)        a ∈ q(C)

It is straightforward to prove that a FO query strongly distributes (as defined
in sec. 1) iff it is both monotonic and local. Hence, we will split the study of
strongly distributive queries in the study of monotonicity and locality. Lastly, it
will be useful the notion of closure by constants:
Definition 1. Given a FO formula ϕ (x1 , . . . , xn ) over a vocabulary L, let L0 be
the extension of L with n new constant symbols c1 , . . . , cn . The closure by con-
stants ϕc of ϕ is the sentence over L0 obtained replacing in ϕ each free variables
xi with the constant ci , for any i ∈ {1, . . . , n}.
4
    It is valid because the domain of a structure cannot be empty.
5
    Eventually with repetitions.
3    Monotonicity
We now characterize monotonic FO queries as those queries specifiable by a Σ1
formula.
Proposition 1. Let ϕ (x1 , . . . , xn ) be a FO formula over a purely relational
vocabulary L. A FO query q(ϕ,Ξ) is monotonic iff ϕ is preserved under super-
structures.

Proof.
⇒: Let S 0 ⊂ S and a1 , . . . , an ∈ Dom(S 0 ) such that S 0 |= ϕ (a1 , . . . , an ). Hence,
   there is at least a valuation h such that h(xi ) = ai for any i ∈ {1, . . . , n},
   (h(x))x∈Ξ ∈ q(S 0 ) and, by hypothesis, (h(x))x∈Ξ ∈ q(S). So S |= ϕ (a1 , . . . , an ).
⇐: Similar to previous.

Applying the preservation theorem 1, we obtain the following theorem.
Theorem 2. A FO query q(ϕ,Ξ) is monotonic iff ϕ (ϕc ) is equivalent to a Σ1
formula (sentence).
Therefore, to check whether a FO formula q(ϕ,Ξ) is monotonic amounts to check
if the closure by constants ϕc is equivalent to a Σ1 sentence. We now prove that
this last check is not algorithmically possible.
Definition 2. Let F1 , F2 be two fragments of FO sentences over a vocabulary L.
With Eq(F1 , F2 ) we denote the decision problem of establishing whether, given
a sentence ϕ1 ∈ F1 ,
                              ∃ϕ2 ∈ F2 ϕ1 ↔ ϕ2

Lemma 1. Let F1 , F2 be two fragments of FO over a vocabulary L such that:
 1. SAT (F1 ), the decision problem of satisfiability of a sentence in F1 , is unde-
    cidable;
 2. SAT (F2 ) is decidable;
 3. F1 contains all contradictory sentences.
Then, Eq(F1 , F2 ) is undecidable.

Proof. If Eq(F1 , F2 ) were decidable through an algorithm A, then, given a sen-
tence ϕ ∈ F1 as input to A,
 – If the output is negative, then ϕ is not contradictory, namely satisfiable;
 – If the output is positive, then there is at least one ϕ0 ∈ F2 such that ϕ ↔ ϕ0
   is valid and, by completeness of FO, also derivable. Since the set of derivable
   sentences is recursively enumerable, it is possible to algorithmically build up
   at least one such ϕ0 . Recalling that SAT (F2 ) is decidable, it is possible to
   decide if ϕ0 , so ϕ, is satisfiable.
In both cases it would be possible to decide whether ϕ is satisfiable and SAT (F1 )
would be decidable, which contradicts the hypothesis 1.
Theorem 3. Let L be a relational vocabulary sufficiently expressive, i.e. with at
least one relational symbol R/n with n ≥ 2. Let also σ1 and π1 be, respectively,
the set of existential sentences and universal sentences over L. Denoting with
F O the set of all FO sentences over L, Eq(F O, σ1 ) and Eq(F O, π1 ) are both
undecidable.

Proof. It is well-known that the Bernays-Schönfinkel-Ramsey fragment (BSR),
i.e. the set of FO prenex sentences (without function symbols)6 and with a prefix
like ∃∗ ∀∗ , is such that SAT (BSR) is decidable [5]. Clearly, BSR contains both σ1
and π1 and they both contain all contradictory sentences.7 Since L is sufficiently
expressive, SAT (F O) is undecidable [6]. Thereby, we can apply the previous
lemma and obtain the thesis.

    We can summarize previous results in the following corollary:

Corollary 1. The decision problem of establishing whether, given a FO query
q(ϕ,Ξ) over a sufficiently expressive vocabulary, q(ϕ,Ξ) is monotonic is undecid-
able.

Since any contradictory formula specifies a local FO query,8 the same argument
used in lemma 1 can be reused for the decision problem of establishing whether
a FO formula ϕ specifies a strongly distributive query, i.e. if ϕ is such that:

 1. ϕ specifies a local FO query;
 2. ϕc is equivalent to a Σ1 sentence.

So, we can conclude also the following corollary:

Corollary 2. The decision problem of establishing wheter, given a FO query
q(ϕ,Ξ) over a sufficiently expressive vocabulary, q(ϕ,Ξ) strongly distributes is un-
decidable.


4    Locality

Due to the previous section, we focus only on Σ1 formulas. Moreover, here we
consider only quantifier-free disjunctive normal form (DNF) formulas without
=.9 However, we report only about two syntactic bonds over conjunctions and
6
  Recall that L is a purely relational vocabulary, hence function simbols would not
  occour anyway.
7
  All contradictions are equivalent and ∃x x 6= x and ∀x x 6= x are two of them,
  the first in σ1 and the second in π1 .
8
  Because, for any structures S, q(S) = ∅ holds.
9
  Anyway, we are confident that adding equality would not change the core ideas
  of what follows, but would only require some more technicality, like an equality
  propagation procedure as the union-find algorithm in [7]. Yet, we do not prove it
  here.
disjunctions necessary to admit locality, referable to safeness of Datalog¬ rules
through the process of rectification and unfolding [3].10
     We say that a formula ϕ (x1 , . . . , xn ) is local iff it specifies only local queries,
i.e. iff, for any structure S and any a1 , . . . , an ∈ Dom(S):11

              S |= ϕ (a1 , . . . , an ) ⇒ ∃C ∈ cc (S)   C |= ϕ (a1 , . . . , an )


4.1     Conjunctions

Since a contradictory conjunction of literals is local, we will consider only sat-
isfiable formulas. First, we focus on negative conjunctions, i.e. where all literals
are negated, then, we take into account the remaining.

Theorem 4. Let L be a purely relational vocabulary and ϕ (x1 , . . . , xn ) be a
satisfiable negative conjunction of literals over L. Then ϕ is local iff n = 1.

Proof. Since ϕ is satisfiable, it is not possible that in ϕ occur both an asserted
literal and its negation.

⇒: Let S be a structure such that:
    • Dom(S) = {a1 , . . . , an }, where ai 6= aj if i 6= j;
    • for any relation symbol R/m ∈ L, S R = ∅.12
   Therefore, each a ∈ Dom(S) forms a connected component and, since ϕ is
   negative:
                                   S |= ϕ(a1 , . . . , an )
   Since ϕ is local, (a1 , . . . , an ) must lay on a single connected component. This
   is possible only if n = 1.
⇐: By hypothesis, ϕ is of the form ϕ(x). Then, let S be a structure and a ∈
   Dom(S) such that:
                                            S |= ϕ(a)
      Clearly, ∃C ∈ cc (S) such that a ∈ Dom(C) and, by preservation theorem 2,
      also:13
                                       C |= ϕ(a)
      By arbitrariness of S and a, ϕ is local.

Theorem 5. Let L be a purely relational vocabulary and ϕ (x1 , . . . , xn ) a not
negative satisfiable conjunction of literals over L. If ϕ is local, then ϕ is safe,
i.e. any variable occurring in a negated literal ocours also in an asserted literal.
10
   Rectification and unfolding are those processes that allow to translate Datalog¬
   without recursion in FO.
11
   Clearly, it follows also that a1 , . . . , an ∈ C.
12
   I.e. S can be considered a plain set.
13
   Each quantifier-free formula is also a Π1 formula.
Proof.
V       Let ϕ (x1 , . . . , xn ) be a not negative satisfiable conjunction of literals
  i∈I Li , where x ∈ Var(ϕ) occurs only in negated literals. Say that x is x1 . Since
ϕ is not contradictory, then there is a structure S and a1 , . . . , an ∈ Dom(S) such
that: S |= ϕ (a1 , . . . , an ). Now consider the structure S 0 , obtained from S adding
a new element a to Dom(S). Therefore, a does not take part in the interpretative
part of S 0 and, clearly:
                                    S 0 |= ϕ (a, a2 , . . . , an )
Since {a} is the domain of a connected component of S 0 , the sequence
(a, a2 , . . . , an ) does not lay on a single connected component and so ϕ is not
local.

4.2   Disjunctions
Theorem    W 6. Let L be a purely relational vocabulary and ϕ (x1 , . . . , xn ) a dis-
junction i∈I ϕi , where ϕi is a satisfiable Σ1 formula over L for each i ∈ I. If ϕ
is local, then ϕ is regular, i.e. all disjuncts share the same set of free variables.
                                                            W
Proof. Let ϕ (x1 , . . . , xn ) be a satisfiable disjunction i∈I ϕi , where ϕi ∈ Σ1 for
each i ∈ I. Suppose there are indices i, j ∈ I such that Var(ϕi ) 6= Var(ϕj ),
say x ∈ Var(ϕi ) \ Var(ϕj ) and suppose that x is x1 . Since ϕj (y1 , . . . , ym ) is
satisfiable, there is a structure S and a1 , . . . , am ∈ Dom(S) such that:
                                 S |= ϕj (a1 , . . . , am )
Let h be a valuation such that h(yi ) = ai for each i ∈ {1, . . . , m}. Now, as in the
previous proof, consider a structure S 0 , obtained from S adding an element a to
Dom(S). Then, the valuation k, obtained from h putting k(x) = a, still satisfies
ϕj in S 0 (because x 6∈ Var(ϕj )). So, by semantic of disjunction:
                              S 0 |= ϕ (k(x1 ), . . . , k(xn ))
Since {k(x1 )} = {a} is the domain of a connected component of S 0 ,
(k(x1 ), . . . , k(xn )) does not lay on a single connected component and so ϕ is
not local.

5     Conclusions and further work
We have proved that, for sufficiently expressive vocabularies, the decision prob-
lem of establishing if a FO query strongly distributes over components is unde-
cidable. This has been possible through a contrast with the Entscheidungsprob-
lem of satisfiability of F O formulas. However, we remark that we considered
FO in its full expressive power, ignoring those syntactical bonds stemming from
rectification and unfolding, i.e. reflecting Datalog¬ safeness. Would something
change if those bonds were considered? In fact, tackling the problem of local-
ity, we showed that those bonds, in the form of safe conjunctions and regular
disjunctions, are necessary conditions to allow locality of DNF quantifier-free
formulas. This preliminary result should be extended to a full classification of
FO formulas specifying local FO queries.
References
1. Ameloot, T.J., Ketsman, B., Neven, F., Zinn, D.: Datalog queries distributing over
   components. ACM TOCL 18(1), 1–35 (2017)
2. Ameloot, T.J., Ketsman, B., Neven, F., Zinn, D.: Weaker forms on monotonicity for
   declarative networking: a more fine-grained answer to the CALM-conjecture. ACM
   TODS 40(4), 1–45 (2016)
3. Ullman, J.D.: Principles of Database and Knowledge-Base Systems, Volume I. Com-
   puter Science Press, Rockville, Maryland (1989)
4. Chang, C.C., Keisler, H.: Model theory. Elsevier (1990)
5. Ramsey, F.P..: On a problem of formal logic. In: Gessel, I., Rota, G., CLASSIC PA-
   PERS IN COMBINATORICS 2009, pp. 1–24. Birkhäuser Boston (Springer) (2009)
6. Börger, E., Grädel, E., Gurevich, Y.: The classical decision problem.2nd edn.
   Springer Verlag (Springer Science & Business Media) (2001)
7. Aho, A.V., Hopcroft, J. E.: The design and analysis of computer algorithms.
   Addison-Wesley Longman Publishing Company, Inc., Boston, MA, USA (1974)