=Paper=
{{Paper
|id=Vol-3002/short1
|storemode=property
|title=Answer Set Computation of Negative Two-Literal Programs Based on Graph Neural Networks: Preliminary Results
|pdfUrl=https://ceur-ws.org/Vol-3002/short1.pdf
|volume=Vol-3002
|authors=Antonio Ielo,Francesco Ricca
|dblpUrl=https://dblp.org/rec/conf/cilc/IeloR21
}}
==Answer Set Computation of Negative Two-Literal Programs Based on Graph Neural Networks: Preliminary Results==
<pdf width="1500px">https://ceur-ws.org/Vol-3002/short1.pdf</pdf>
<pre>
    Answer set computation of negative two-literal
     programs based on graph neural networks:
                 Preliminary results

                         Antonio Ielo1 and Francesco Ricca1
                              1
                             University of Calabria, Italy
                      ai6chr@gmail.com - ricca@mat.unical.it


        Abstract. Graph neural networks architectures have been successfully
        applied to combinatorial optimization problems such as SAT and MAX-
        CSP. In this work in progress we attempt to adapt similar techniques to
        Answer Set Programming, in particular to negative two-literal programs
        as they represent a class of programs of practical and theoretical interest
        whose answer sets have a full graph-theoretical characterization.

        Keywords: Answer Set Programming · Graph Neural Networks


1     Introduction

In the recent years deep learning techniques have been successfully applied to the
combinatorial optimization domain [3]. This has been possible mainly thanks to
the graph neural network architectures [12], whose inductive biases (mainly per-
mutation invariance) are well suited to exploit the graph encodings of problem
instances that frequently and naturally occour in this field. Graph neural net-
works were recently applied to SAT [14, 13] and CSP [15] providing interesting
results to the approximate solution of these problems. As far as we know, such
approaches have not been attempted on Answer Set Programming (ASP) [2].
ASP is an established logic-based programming paradigm which has been suc-
cessfully applied for solving complex problems arising in Artificial Intelligence.
    In this paper we approach the application of graph neural networks to ASP.
We focus on negative two-literal ASP programs [11], a relevant subclass of ASP
programs that is able to model all problems in NP. Negative two-literal programs
have a natural correspondence with graphs, and thus fit well the graph neural
network architecture. In particular, we build on the approach of RUNCSP [15]
and adapt the graph neural network architecture to handle ASP programs. Pre-
liminary experimental results show strength and weaknesses of the approach
which paves the way for future improvements and applications.

    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
2    Message Passing Graph Neural Networks
In the recent years there has been a surge of interest in machine learning tech-
niques able to process graph data [4], hence an abundance of graph neural net-
work architectures and frameworks have been defined.
    We briefly recall the message passing framework [6, 7], as it is the most
relevant for our work. We refer the reader to [1] for an up-to-date reference on
graph neural network architectures and their applications.
    Let G be a directed graph. We denote the set of its vertices and the set of
its edges as V (G) and E(G) respectively. Let {{A}} denote the set of multisets
over the elements of a set A. Let N (v) ⊆ V (G) denote the neighbourhood of the
node v ∈ V (G), that is the set of nodes w ∈ V (G) such that (v, w) ∈ E(G). Let


                                 M : Rk × Rk 7−→ Rk
                                 U : Rk × Rk 7−→ Rk
                                 C : {{Rk }} 7−→ Rk
                                 R : Rk 7−→ Rd


be functions, and let hv ∈ Rk be a real-valued vector associated to the vertex
v ∈ V (G). A message passing round consists in the following computation:

                   hi+1
                    v   = U (hiv , C({{M (hiv , hiv0 ) : v 0 ∈ N (v)}}))            (1)
Here hiv represents a state associated to each node in the i-th iteration of mes-
sage passing. Neighbouring nodes exchange messages, generated by the message
function M . Each node aggregates incoming messages through the aggregating
function C and updates its own state through the update function U . Finally,
the readout function R can be used to project each node state from Rk to Rd and
represents the output of the neural network on a given node. This computation
is iterated up to T times. If the above mentioned functions are differentiable
(as those implemented through feedforward neural networks or recurrent neural
networks) then the whole system - which we will refer to as message passing
neural network - can be trained through standard deep learning optimization
techniques such as backpropagation [7].


3    Answer Set Programming
We now recall some basic notions regarding ASP.

Syntax. Let C be a set of constants, a predicate p is a subset of C k and k is
known as the arity of P . A ground atom is an expression p(c1 , ..., ck ) where p is
a predicate of arity k and (c1 , ..., ck ) ∈ p. If a is an atom, ¬a is its opposite atom
and we say ¬a is the negation of a.
   A rule r is an expression:

                            a0 ← a1 , ..., ak , ¬ak+1 , ..., ¬an                       (2)
where ai are ground atoms. a0 is the head of the rule and we denote it by H(r),
while a1 , ..., ap , ¬ap+1 , ..., ¬an is the body of the rule which we denote by B(r). We
can partition B(r) into two sets B + (r) = {a1 , ..., ak } and B − (r) = {ak+1 , ..., an },
respectively the set of positive atoms and negative atoms in the rule’s body.
   A rule is positive if B − r is empty, and negative if B + r is empty.
   A (normal) logic program program is a set of rules. A program is positive
(negative) if all its rules are positive (negative).

Semantics. Let X ⊆ A be a set of atoms. We say X satisfies a rule r if
B + (r) ⊆ X, B − (r) ∩ X is empty implies that H(r) ∈ X. If X satisfies all the
rules of a program P then X is a model for P . A model is said to be minimal
if there does not exist a model X 0 such that X 0 ⊂ X. Positive logic programs
admit a unique minimal model.
    Let X ⊆ A. The reduct of a program P with respect to X is a program P X
that contains a rule H(r) ← B + (r) whenever B − (r) ∩ X is empty. An answer
set is a set X of atoms such that X is a minimal model of P X . A program is
consistent if it admits at least one answer set, incoherent otherwise.

Supportedness. Let P be a logic program over a set of atoms A. Given an
answer set S of P we say that an atom a ∈ S is supported in S if there exists a
rule r in P such that H(r) = a. It is in the folklore that all atoms in an answers
set are supported.
    It is known that negative two-literal programs [11, 16] are expressive enough
to encode combinatorial problems over graphs of practical interest [8]. Indeed, de-
ciding whether a negative two-literal is consistent is known to be a NP-complete
problem [10]. Moreover, as it will be clearer in the following answer sets of nega-
tive two literal programs have a natural counterpart correspondence with graphs.
For this reason we concentrate our attention on this sub-class of ASP programs.

Negative two-literal programs. [11, 16] A negative two-literal program (N2LP)
is a negative logic program whose rules are all of the form a ← ¬b.

Properties of N2LP. We now recall some relevant properties of N2LP. [16].

Proposition 1. Each normal logic program is equivalent to a negative two-
literal program under answer set semantics.

Proposition 2. Let P be a negative two-literal program on (a set of atoms) A.
Then S is an answer set of P if and only if the two conditions are satisfied:

 – If b1 , b2 ∈ A \ S then b1 ← ¬b2 is not a rule in P .
 – If a ∈ S there exists b ∈ A \ S such that a ← ¬b is a rule of P .
    Let P be a negative two-literal program. We can associate to each negative
two-literal program a graph G(P ), where nodes are the atoms of P and there
exists an edge (α, β) whenever P contains a rule of the form β ← ¬α.
    Recall that W ⊆ V (G) is an independent set of a directed graph G if E(G)
does not contain an edge (w1 , w2 ) with w1 , w2 ∈ W . If W is an independent set of
G and W ∪{v} for all v ∈ V (G)\W is not an independent set of G then we say W
is a maximal independent set. If for all edges (a, b) ∈ E(G) we have that a ∈ W
or b ∈ W we say that W dominates G. A kernel is a maximal independent set
of G that dominates G. Thus, one can restate the latter proposition as follows:

Proposition 3. Let P be a negative two-literal program, G(P ) its associated
graph. Then S is an answer set of P if and only if V (G(P )) \ S is a kernel of
G(P ).

    Thus, one can tackle the computation of answer sets of N2LP by devising
an approach able to compute kernels of a graph. This is the key observation
underlying our approach in which we extend a graph neural network solving the
independent set to identify kernels.
    Since the answer sets of negative two-literal programs can be characterized
in a graph-theoretical way, they represent a natural starting point in exploring
feasibility of graph neural networks based techniques. Indeed, no encoding or
feature extraction are necessary: directed graphs are negative two-literal pro-
grams, and we will talk about graphs/programs and nodes/atoms as it is more
convenient, with no ambiguity.


4   A Graph Neural Network for N2LP

The architecture of the network extends RUNCSP [15]. RUNCSP is a message
passing neural network architecture that unsupervisedly finds approximate so-
lutions to MAX-CSP problems, by attempting to minimize a loss function that
penalizes solutions which violate constraints. In the simplified context of our
problem we need to describe the functions M, U, C, R (see Section 2).
    The message function (M ) is a linear function of the nodes’ hidden states.
The aggregating function (C) is the average function. The update function (U )
is a LSTM cell [17]. The readout function (R) is a linear function followed by a
sigmoid activation function to obtain outputs in [0, 1] for node classification. An
output close to 1 for a node v means that the corresponding atom is part of the
candidate answer set. The loss function can be defined as:
                          1 X
             L(P, t) =        − log(1 − (1 − ψ t (a)) · (1 − ψ t (b))           (3)
                         |R|
                            (a,b)∈R

Where R is the set of rules b ← ¬a of a logic program P and ψ t represents the
neural network’s readout function on the t message passing round. The readout
function output on an atom embedding is interpreted as the probability that
the atom belongs to the candidate answer set of P . In the original architecture,
the loss (”constraint loss”, as referred to in the original article [15]) of all the T
message passing iterations is combined with a discount factor in the following
way:

                                          T
                                          X
                            Ltot (P ) =       λT −t L(P, t)                       (4)
                                          t

    Intuitively, this penalizes the network for violating constraints in the later
message passing rounds, while being less relevant in the earlier rounds. The term
(1 − ψ(a)) · (1 − ψ(b)) amounts to the probability that the rule b ← ¬a is violated
in the candidate answer set.
    The main difference between RUNCSP and our neural network lies in the
loss function: we are not looking for an independent set, as in the RUNCSP
Maximal Independent Set experiment [15], but for a kernel of the graph. What
we are missing is enforcing supportedness of atoms (in place of the ”size” of
the candidate answer, like in the original approach). In some preliminary ex-
periments we observed that adding extra terms to the loss function makes the
results of training unstable, hence we proceed as follows. Let ψ(P ) ∈ [0, 1]n be
the output of our neural networks on a logic program P , it represents the prob-
ability that the i-th atom belongs to the candidate answer set. We denote such
probability pi . Thus, we are interested in computing a real vector S ∈ [0, 1]n
where si represents the probability that the i-th atom is supported by an atom
outside the answer set.
    We can express such computation as a message passing operation:
                                              Y
                                 si = 1 −               pk                        (5)
                                            k∈N (vi )

    The product term expresses the probability that all the neighbours of vi
are inside the answer set - that is, unable to support vi . Hence its complement
represents the probability that at least one of its neighbours is outside the answer
set.
    Such computation is based solely on ψ(P ) and does not have any learning
involved. si · pi represents the probability that vi belongs to the answer set
and at least one of its neighbours does not belong to the answer set - that is,
it is supported. We train unsupervisedly on constraint loss, while we enforce
supportedness by modulating ψ(P ) via S, using their coordinate-wise product
for our predicitons. Intuitively, we bias probability of choosing atoms by the
likelihood of them being supported.
    In practice the notion of atom supportedness in an answer set does not take
in account by “how many” rules an atom is supported, hence the above com-
putation is biased towards atoms with an high number of neighbours. For ex-
ample, suppose v is an atom with 3 neighbours, and each one of them has a
probability p = 0.5 to be outside the answer set. Then by the above equation,
S(v) = 1.0 − 0.53 = 0.875. In order to solve this issue, we compute S as follows:
                               Si = max (1 − pk )
                                    k∈N (i)

   That is, we compute the likelihood of vi being supported as the maximum
probability of one of its neighbours being ouside of the answer set. In the above
numerical example, the probability of v being supported would be only of 0.5
rather than 0.875.


5     Experiment

5.1   Hyperparameters and training

The network is trained on a random stream of Erdős-Rényi [5] graphs with 20 to
50 nodes and average degree between 2.0 and 5.0. Number of nodes and average
degree are independently, uniformly sampled for each graph required during the
network’s training.
    All the neural network’s hyperparameters (batch size, inference message pass-
ing rounds, loss discount factor and training message passing rounds) have been
kept the same as described in the original RUNCSP article’s reproducibility sec-
tion [15], with the exception of the learning rate and node embedding size which
have been set respectively to 2 · 10−5 and 256.
    The neural network is trained with stochastic weight averaging and an early
stop policy based on the loss on a validation set composed of Erdős-Rényi graphs
of 150 nodes with average degree that spans from 2.0 to 12.0. The reported
experiments’ results were performed with the model that best performed on the
fixed validation set among all the epochs.


5.2   Evaluation

Our neural network is a binary node classifier that given a negative two-literal
logic program P maps its atoms to their probability of being contained in a
candidate answer set S 0 . We can obtain S 0 by choosing nodes above a treshold t,
that we fix to 0.5 in our experiments. If P is a consistent program we can compare
S 0 to its answer set S and compute binary classification metrics such as accuracy
and F1 score in order to evaluate ”how far”, in terms of misclassified atoms, our
candidate answer set S 0 is from the real answer set S. P might have multiple
answer sets, and we compare our solution with the closest answer set (w.r.t.
the minimal cardinality of the symmetric set difference) we obtained running
the ASP system clingo. Hence, to evaluate our neural network on a consistent
program P :

1. Build the graph G(P ).
2. Evaluate the neural network on G(P ) to obtain S 0
3. Compute closest answer set S to S 0
4. Compute accuracy and F1 score for predictions S 0 with respect to S
    We evaluate our network on random graphs of 150 nodes and average degree
spanning from 2.0 to 9.0. Note that in this interval one should expect to find
hard N2LP programs [11]. Each measure in Table 1 is obtained by averaging
binary classification metrics on a sample of 1000 random programs of the given
degree by performing the steps previously described. We report our results in
terms of F1 score and accuracy.
    The Neural columns report the results of the experiment using our neural
network, while the Random column reports the results of the experiment insert-
ing each atom in the answer set with a 0.5 probability.
    Our approach outperforms balanced random choices and is very accurate on
low-density graphs, however the performance degrades as the underlying graphs
get more dense, that is when the logic programs have more rules.


6   Conclusion and future work
In this paper we proposed the first graph neural network approach for ASP pro-
grams. The approach builds on the ideas underlying RUNCSP architecture and
demonstrates a promising behaviour on random programs. Indeed, it is basically
very accurate on low-density graphs, and thus might already be considered an
useful tool in this setting. However, the network needs improvements to be used
in more general setting. In the future we plan to investigate ways for improving
the performance of the network in “dense” graphs and we will study how to
apply it for improving ASP solvers performance (e.g., using our approach as an
heuristics).


     Table 1. Results of the experiment on Erdős-Rényi graphs of 150 nodes.

                   Avg. Degree     F1 Score        Accuracy

                                 Neural Random Neural Random
                       2.0       0.996   0.372   0.996   0.473
                       2.5       0.982   0.379   0.978   0.452
                       3.0       0.974   0.381   0.967   0.432
                       3.5       0.949   0.386   0.932   0.419
                       4.0       0.927   0.392   0.899   0.412
                       4.5       0.906   0.393   0.868   0.401
                       5.0       0.890   0.396   0.842   0.394
                       5.5       0.862   0.395   0.811   0.386
                       6.0       0.837   0.397   0.787   0.379
                       6.5       0.795   0.396   0.748   0.371
                       7.0       0.749   0.401   0.709   0.370
                       7.5       0.667   0.402   0.651   0.365
                       8.0       0.622   0.404   0.615   0.363
                       8.5       0.549   0.404   0.564   0.359
                       9.0       0.498   0.405   0.527   0.356
Acknowledgments. We’d like to thank the PyTorch Geometric Tutorial com-
munity [9] - all of this would have been less, less pleasant than it was. A special
thank to prof. Davide Bacciu for the discussions on RUNCSP and for the hints
in devising graph neural networks.

References
 1. Bacciu, D., Errica, F., Micheli, A., Podda, M.: A gentle introduction to deep learn-
    ing for graphs. Neural Networks (2020)
 2. Brewka, G., Eiter, T., Truszczynski, M.: Answer set programming at a glance.
    Commun. ACM 54(12), 92–103 (2011)
 3. Cappart, Q., Chételat, D., Khalil, E., Lodi, A., Morris, C., Veličković, P.: Com-
    binatorial optimization and reasoning with graph neural networks. arXiv preprint
    arXiv:2102.09544 (2021)
 4. Chami, I., Abu-El-Haija, S., Perozzi, B., Ré, C., Murphy, K.: Machine learning on
    graphs: A model and comprehensive taxonomy. arXiv preprint arXiv:2005.03675
    (2020)
 5. Erdos, P., Rényi, A., et al.: On the evolution of random graphs. Publ. Math. Inst.
    Hung. Acad. Sci 5(1), 17–60 (1960)
 6. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message
    passing for quantum chemistry. In: International conference on machine learning.
    pp. 1263–1272. PMLR (2017)
 7. Hamilton, W.L.: Graph representation learning. Synthesis Lectures on Artifical
    Intelligence and Machine Learning 14(3), 1–159 (2020)
 8. Huang, G.S., Jia, X., Liau, C.J., You, J.H.: Two-literal logic programs and sat-
    isfiability representation of stable models: A comparison. In: Conference of the
    Canadian Society for Computational Studies of Intelligence. pp. 119–131. Springer
    (2002)
 9. Longa, A., Santin, G., Pellegrini, G.: Pytorch geometric tutorial (2020)
10. Marek, W., Truszczyński, M.: Autoepistemic logic. Journal of the ACM (JACM)
    38(3), 587–618 (1991)
11. Namasivayam, G., Truszczyński, M.: Simple random logic programs. In: Inter-
    national Conference on Logic Programming and Nonmonotonic Reasoning. pp.
    223–235. Springer (2009)
12. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini,
    G.: The graph neural network model. IEEE Trans. Neural Net-
    works 20(1), 61–80 (2009). https://doi.org/10.1109/TNN.2008.2005605,
    https://doi.org/10.1109/TNN.2008.2005605
13. Selsam, D., Bjørner, N.: Neurocore: Guiding high-performance sat solvers with
    unsat-core predictions (2019)
14. Selsam, D., Lamm, M., Bünz, B., Liang, P., de Moura, L., Dill, D.L.: Learning a
    SAT solver from single-bit supervision. In: 7th International Conference on Learn-
    ing Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenRe-
    view.net (2019), https://openreview.net/forum?id=HJMC iA5tm
15. Toenshoff, J., Ritzert, M., Wolf, H., Grohe, M.: Run-csp: unsupervised learning of
    message passing networks for binary constraint satisfaction problems (2019)
16. Wang, K., Wen, L., Mu, K.: Random logic programs: linear model. Theory and
    Practice of Logic Programming 15(6), 818–853 (2015)
17. Yu, Y., Si, X., Hu, C., Zhang, J.: A Review of Recurrent Neural Networks: LSTM
    Cells and Network Architectures. Neural Computation 31(7), 1235–1270 (07 2019)

</pre>