Constrained optimisation over massive databases

Toni Mancini1 , Pierre Flener2 , Amir Hossein Monshi2 , and Justin Pearson2
        1
            Dipartimento di Informatica, Sapienza Università di Roma, Italy
                             tmancini@di.uniroma1.it
     2
       Department of Information Technology, Uppsala University, Sweden
Pierre.Flener,Justin.Pearson@it.uu.se; AmirHossein.Monshi.1351@student.uu.se

                                        Abstract
      Constrained optimisation is increasingly considered by industry as a
      major candidate to cope with hard problems of practical relevance.
      However, the approach of exploiting, in those scenarios, current Con-
      straint or Mathematical Programming solvers has severe limitations,
      which clearly demand new methods: data is usually stored in poten-
      tially very-large databases, and building a problem model in central
      memory suitable for current solvers could be very challenging or impos-
      sible. In this paper, we extend the approach followed in [3], by present-
      ing a declarative language for constrained optimisation based on sql,
      and novel techniques for local-search algorithms explicitly designed to
      handle massive data-sets. We also discuss and experiment with a solver
      implementation that, working on top of any DBMS, exploits such al-
      gorithms in a way transparent to the user, allowing smooth integration
      of constrained optimisation into industrial environments.


1     Introduction
Constrained combinatorial optimisation is increasingly considered by indus-
try as a major candidate to cope with hard problems of practical relevance,
like scheduling, rostering, resource allocation, security, bio-informatics, etc.
However, the approach of exploiting, in those scenarios, current Constraint
(CP) or Mathematical Programming (MP) solvers has severe limitations,
which clearly demand new methods:
• Problems of industrial relevance may easily exhibit sizes that are way out
  of the reach of current solvers: data usually resides on information systems,
  in forms of potentially very large, possibly distributed relational databases
  with complex integrity constraints, and cannot easily be transferred to
  central memory for being processed.
• Moving data into central memory also potentially affects data integrity.
  Solving very-large combinatorial problems is likely to need a huge amount
  of computing time, during which original data must be kept locked, thus
  freezing the normal activities of the information system.
• In several scenarios, the structure of both the data and the problem is
  very articulated and complex. Although it is always possible to map them
  into the constructs offered by current constraint modelling languages, this


Proceedings of the 16th International RCRA workshop (RCRA 2009):
Experimental Evaluation of Algorithms for Solving Problems with Combinatorial Explosion
Reggio Emilia, Italy, 11–12 December 2009
  further step must be carefully designed and implemented, being a real engi-
  neering activity. This is usually perceived as a major obstacle by industry
  programmers, who usually do not have CP/MP and abstraction skills.
Several attempts to join combinatorial optimisation and databases have al-
ready been carried out in the literature, especially in the contexts of Deduc-
tive Database Systems (see, e.g., [9]) and Logic Programming (see, e.g., the
system DLVDB described in [11]).
    As a different approach to cope with these issues and to provide effective
means to spread combinatorial optimisation into industrial environments, in
previous work [3] a non-deterministic extension of Relational Algebra (RA)
has been proposed. This language, called NP-Alg, has the ability of express-
ing all problems in the complexity class NP, where (the decision versions of)
many problems of practical relevance lie. NP-Alg has been the formal basis
to derive an extension for the well-known database querying language sql,
with similar characteristics, but also able to handle optimisation problems.
This second language, called ConSQL (i.e., sql with constraints) proposes
a new paradigm for modelling constrained optimisation problems, that of
modelling as querying: a problem specification is given as a set of views de-
voted to encode solutions, and constraints that the contents of such views
must satisfy. Of course, since we face with optimisation problems whose de-
cision versions belong to the complexity class NP, the “views” we are talking
about are not first-order definable, hence not expressible in standard sql.
    Advantages of this approach wrt the ones above are two-fold: (i) the
modelling language is essentially sql, plus a few new and very intuitive con-
structs: industrial programmers do not have to master CP/MP to formalise
their problems; (ii) the paradigm allows solving approaches (like the one we
consider in this paper) that interact transparently with the DBMS and its
mechanisms for data transfers and access policies, and for maintaining data
integrity, since the solving process, which is seen as a normal transaction by
the DBMS, may easily recover from data updates.
    In this paper, after recalling NP-Alg (Section 2) we make the following
contributions:
1. We re-design our ConSQL language in order to obtain a language having
   very few deviations from standard sql (Section 3).
2. We reconsider the local-search [6] approach suggested in [3], and argue
   that, to succeed in solving large problems, classical local-search has to be
   re-thought, with an explicit focus on the handling of massive data. In
   particular, we isolate two major issues in classical local-search and pro-
   pose two methods to boost scalability: dynamic neighbourhood reduction
   (DNR), and incremental evaluation of constraints (IEC) (Section 4).
3. We provide a new implementation of the system, based on the ideas above
   and on a much wider portfolio of local-search algorithms, with full support
   to the user for declaratively choosing the search strategy to follow, and ex-
   perimentally evaluate it against the original one in terms of performance
   and scalability (Section 5).


                                       2
2    The language NP-Alg
NP-Alg [3] is a non-deterministic extension of RA, suitable for modelling
combinatorial decision problems.
Syntax. Let R be a relational schema, i.e., a set of relations of arbitrary
finite arities. An NP-Alg expression is a pair hS, faili where:
           (a )         (a )
1. S = {S1 1 , . . . , Sn n } is a set of new relations of arbitrary finite arities
   a1 , . . . , an . These relations are called “guessed”. R and S are disjoint.
2. fail is an expression of plain RA on the new database vocabulary R ∪ S.

Semantics. Given a finite database D over schema R, let DOMD be the
unary relation representing the set of all constants occurring in D. The
semantics of an NP-Alg expression hS, faili is as follows:
1. For each possible extension of the relations in S with elements in DOMD ,
   the expression fail is evaluated, using ordinary rules of RA.
2. If there exists one such extension for which fail evaluates to the empty
   relation ∅ (denoted as fail∅), the answer to the decision problem is “yes”.
   Otherwise, the answer is “no”. When the answer is “yes”, such extension
   of relations in S is a solution of the problem instance.
Intuitively, fail encodes constraints that extensions of S need to satisfy to
be considered solutions. These are represented in terms of violations, hence
the goal is to guess an extension of S that makes fail = ∅ (a dual approach,
where constraints are defined in a ‘positive’ way can however be followed).
    NP-Alg has strong relationships with existential second-order logic
(ESO) [8] and first-order model-expansion (FO-MX) [7], with guessed re-
lations playing the role of quantified predicates in ESO and expansion vo-
cabulary symbols in FO-MX. It can be shown [3] that NP-Alg can express
all (and only) the queries that specify decision problems belonging to NP.
    In the following, we assume the standard version of RA [1] with the
usual operators {σ, π, ×, −, ∪, ∩}, and positional notation ($1, $2, etc.) for
attributes of the various relations. Also, given a tuple τ with n compo-
nents, we denote its i-th component by τ [$i], i ∈ [1 . . n], and the sub-tuple
obtained from τ by considering only components with indices from i to j
(i ≤ j) by τ [$i . . $j]. Finally, to ease notation, especially when expressing
conditions over tuples defined over the Cartesian product of some relations,
we may denote tuple indices by the name of the original relations they come
from, followed by the relative column number. Hence, e.g., σ (R1 × R2 )
                                                                  $1=$3
is equivalent to        σ      (R1 × R2 ) if R1 is binary.
                   R1 .$1=R2 .$1

Example 1 (Graph 3-colouring). Let D be a database with two relations N
and E, encoding a graph: N is unary and encodes the set of nodes, while
E is binary and encodes the set of edges of the graph, as pairs of nodes.
Hence DOMD = N . The NP-complete Graph 3-colouring problem [5] can be
expressed in NP-Alg by the following expression:


                                          3
1. S = {R, G, B} all unary, encoding the set of nodes to be coloured in red,
   green, and blue, respectively;
2. fail = fail_cover ∪ fail_disj ∪ π fail_col, where: fail_cover = N−(R ∪ G ∪ B),
                                   $1                                         
                                                          S
   fail_disj = (R∩G)∪(R∩B)∪(G∩B), fail_col =                      σ (C ×C)∩E .
                                                              C∈{R,G,B} $16=$2
For any guessed extension of R, G, B, fail_cover = ∅ iff every node is as-
signed at least one colour; similarly, fail_disj = ∅ iff every node is assigned
at most one colour, and fail_col = ∅ iff no two adjacent nodes are assigned
the same colour. Hence, fail  ∅ iff there exist extensions for R, G, B encod-
ing an assignment of colours to nodes such that no two adjacent nodes have
the same colour. Such extensions are solutions to the problem instance.

3     ConSQL: modelling as querying
In this section we present a new and improved version of our ConSQL
language [3]. ConSQL is a non-deterministic extension of sql, with its
optimisation-free subset having the same expressive power of NP-Alg. Con-
SQL is a super-set of sql: users can hence safely exploit the rich set of
language features of sql during problem modelling. The current version of
the language has only very few deviations from standard sql, in the aim of
greatly easing the modelling task of constrained optimisation problems for
the average industry programmer.
    A detailed description of the syntax and semantics of ConSQL is given
in the appendix (available at http://www.dis.uniroma1.it/~tmancini).
Here, we introduce the language by an example.
Example 2 (University timetabling). Assume a university wants to solve
the following course timetabling problem (a variant of the ones presented
in [4], in order to show the flexibility of the modelling language), namely
finding a schedule for the lectures of a set of courses, in a set of classrooms,
by minimising the total number of students that are enrolled in courses hav-
ing overlapping lectures, by satisfying constraints about lecture allocation in
time and rooms (con1, con2, con6), room equipment (con3), teachers’ un-
availability (con4) and conflicts (con7), and course lengths (con5).
    Data reside in a relational database with the tables listed in Table 1 (pri-
mary keys are underlined; foreign key constraints are omitted for brevity). A
ConSQL specification for this problem is as follows:
create SPECIFICATION Timetabling (
  // A guessed view (see new construct CHOOSE), encoding a ‘guessed’ timetable, as an
  // assignment of courses to room/period pairs (w. some possibly unassigned, i.e., null)
  create view TT as select p.id as p, r.id as r, CHOOSE(select id as c from Course) CAN BE NULL
    from Period p, Room r
  // A view defined in terms of TT (dependent guessed view)
  create view ConflCourses as select c1.id as c1, c2.id as c2 from Course c1, Course c2, TT t1, TT t2
    where c1.id < c2.id and t1.c = c1.id and t2.c = c2.id and t1.p = t2.p
  // Objective function, relying on view ConflCourses
  MINIMIZE select count(*) from Student s, ConflCourses cc, Enrolled e1, Enrolled e2
    where e1.student = s and e1.course = cc.c1 and e2.student = s and e2.course = cc.c2


                                                 4
 Student(id, name, . . . )           Set of students with id and other attributes
 Teacher(id, name, . . . )           Set of teachers
 Course(id, teacher, num_lect)       Set of courses to be scheduled, with teacher and nb. of lectures
                                     planned
 Period(id, day, week, start, end)   Set of time-periods, plus information on day, start and finish times
 Room(id, capacity)                  Available rooms with their capacity
 Enrolled(student, course)           Info on enrolment of students in the various courses
 Equip(id, name)                     Equipment available for teaching (e.g., VGA projector)
 EquipAvail(equip, room)             Presence of equipment in the various rooms
 EquipNeeded(course, equip)          Equipment needed for the various courses
 TeacherUnav(teacher, period)        Unavailability constraints for teachers: lectures of their courses
                                     cannot be scheduled in these periods

      Table 1: Relational database schema for the University timetabling problem.

  // Constraints:
  // con1. No two lectures of the same course on the same day
  check "con1" (not exists ( select * from TT t1, TT t2, Period p1, Period p2
    where t1.c = t2.c and t1.c is not null and t1.p = p1.id and t2.p = p2.id
      and p1.day = p2.day and t1.p != t2.p ))
  // con2. Capacity constraint, in terms of ordinary SQL view
  create view Audience as select e.course as c, count(*) as nb_stud from Enrolled e group by e.course
  check "con2" ( not exists (
    select * from TT t, Room r, Audience a where t.r=r.id and t.c=a.course and r.capacity<a.nb_stud ))
  // con3. Equipment
  check "con3" ( not exists ( select * from TT t, EquipNeeded en
     where t.c = en.course and en.equip not in (select equip from EquipAvail ea where ea.room=t.r) ))
  // con4. Teachers unavailability
  check "con4" ( not exists ( select * from TT tt, Course c, TeacherUnav tu
    where tt.c = c.id and c.teacher = tu.teacher and tt.p = tu.period ))
  // con5. Right nb of lectures for each course, in terms of helper view over TT
  create view CourseLengths as select c, count(*) as nb_lec from TT t where t.r is not null group by t.c
  check "con5" ( not exists (
    select * from CourseLengths cl, Course c where cl.c = c.id and cl.nb_lect <> c.nb_lect ))
  // con6. At most 3 lectures of the same course per week
  check "con6" (3>all( select count(*) from TT t, Period p where t.p = p.id group by t.c, p.week ))
  // con7. Courses taught by the same teacher not in the same period
  check "con7" ( not exists ( select * from TT tt1, TT tt2, Course c1, Course c2
    where c1.id < c2.id and c1.teacher = c2.teacher and tt1.p = tt2.p ))
);


As it can be seen, a problem specification is defined in terms of the new
construct create SPECIFICATION, embedding the definition of a set of views
(via the standard sql create view construct), an optional objective function
(via the new constructs MINIMIZE and MAXIMIZE) and a set of constraints
(via the standard sql check keyword). Some of the views are guessed, in
that they have special columns, called guessed column-sets. Such columns
are defined by the new construct CHOOSE, which is the main enabler of the
non-deterministic mechanism added to sql in order to increase its expressive
power. The semantics is as follows:
• The solver is asked to populate non-deterministically the guessed column-
  sets of views, choosing tuples from the query argument of CHOOSE.
• This choice must be such that all constraints are satisfied and the objective
  function, if present, takes an optimal value.
In the example, our goal is to assign the lectures of some course to room/period
pairs (guessed view TT), with some pairs possibly being empty (CAN BE
NULL), in such a way that the number of students enrolled in courses with
conflicting lectures is minimised and that all the constraints con1–con7 are


                                                 5
                           p     r    c        p      r    c
                           p1   ra    ?        p1    ra   c3
                           p1   rb    ?        p1   rb    c1
                           p2   ra    ?        p2    ra    -
                           p2   rb    ?        p2   rb    c2
                           p3   ra    ?        p3    ra    -
                           p3   rb    ?        p3   rb    c2
                                (a)                 (b)
Figure 1: (a) Initialisation of guessed view TT on a sample database with 3 periods and
2 rooms. (b) A possible extension of TT if database has ≥ 3 courses (‘-’ denotes null.)

satisfied. The ‘residual’ (i.e., the pure sql) part of view TT has schema (p, r),
storing all period/room pairs. This is extended ‘to the right’ with a guessed
column-set of width 1 having the schema (c) of the argument query. The
overall view has schema (p, r, c). While the values in the first two columns are
fixed (period/room pairs), those in the entries of the third column are non-
deterministically picked from the sets of values occurring as ids for courses.
Some of them can be assigned to null, allowing us to model situations, like
ours, where we need partial functions. Figure 1(a) shows guessed view TT at
the initialisation stage, when the database holds three periods with ids p1,
p2, p3, and two rooms with ids ra and rb. Figure 1(b) shows one possible ex-
tension of the guessed view, when the database stores courses having ids c1,
c2, c3 (plus, possibly, others). Note that, in case the modifier distinct were
used, it would be impossible for two period/room pairs to be assigned to the
same course. This is analogous to the all-different constraint used in CP.
    Extensions of guessed views need to satisfy all constraints in order to
be considered solutions to the problem. These are expressed by using the
standard sql keyword check applied to an arbitrary sql Boolean expression
(various forms allowed by standard SQL to specify conditions, hence con-
straints, are described in the appendix). Finally, an objective function can
be defined by using one of the two new keywords MINIMIZE and MAXIMIZE
applied to an aggregate arbitrary query returning a single scalar defined over
database tables, ordinary views, guessed views.

4    Solving NP-Alg expressions over massive data
In principle, in order to solve a constrained optimisation problem, different
competing techniques could be adopted, e.g., CP, Propositional Satisfiability
(SAT), MP. In presence of massive data-sets, such complete approaches may
show severe limitations, since they require the systematic exploration of the
huge search space. In these cases, incomplete approaches like local-search [6],
may help. These algorithms sacrifice completeness for efficiency, and may
be advantageous especially on large instances, and in scenarios where com-
puting a solution with a sufficient high quality is satisfactory. Local-search
has been employed successfully to different formalisms, like SAT (cf. bg-
walksat [13]) and CP (cf., e.g., comet [12]), the successful resolution of
large scheduling and timetabling problems [10] or other large constrained
optimisation problems [12].
    In this section, after discussing some convenient syntactic extensions to
NP-Alg, and recalling the main notions of classical local-search, we present
novel techniques to make local-search applicable on massive data-sets.
NP-Alg revisited. The language described in Section 2 can be extended

                                          6
without affecting its expressive power, with means to support bounded inte-
gers and arithmetic, typed guessed relations, guessed functions, and guessed
permutations. As an example, although in the base version of the language
entries of guessed relations take values ranging in DOMD (hence extensions
of an r-ary relation range over subsets of DOMDr ), in a (just syntactically)
enriched version we could easily force them to be typed and also to model
functions [3]. In this way we could easily model, e.g., the Graph k-colouring
problem, where the set of the k available colours is given in the additional
unary relation K, as follows:
Example 3 (Graph k-colouring).
1. Guessed relation: Col : N → K, a total function.
2. fail =     σ      (Col × Col) ∩ E.
         $16=$3∧$2=$4
Guessed function Col is represented as a relation having two typed columns,
$1 and $2, with entries in $1 ranging over N and those in $2 ranging over
K. Being also forced to be a total function, Col is such that any node in N
occurs exactly once in the first column of Col. The expression for fail will
evaluate to ∅ iff the guessed extension for Col is such that no two different
nodes ($1 6= $3) assigned to the same colour ($2 = $4) share an edge.
It can be shown that the enriched language where guessed relations are al-
ways forced to be typed total functions between two unary database relations
has the same expressive power as the base language. This is because: (i) tu-
ples of non-unary relations can always be denoted by new constants which
act as keys; and (ii) arbitrary guessed relations can always be encoded by
characteristic functions. Also, the expression fail can always be rewritten
in disjunctive form, as a union of sub-expressions. Since this variation is
syntactically much closer to our target practical language ConSQL, in the
remainder of this paper we will refer exclusively to it. Hence, from now on,
we rely on the following definition:
Definition 1 (NP-Alg expression). An NP-Alg expression hS, faili over
a relational schema R is defined as:
1. A set S = {S1 , . . . , Sn } of new binary relations, with any Si : Di → Ci
   forced to be a total function with domain Di and co-domain Ci , with both
   these relations being unary and belonging to R.
2. An expression fail = ki=1 faili of plain RA on the vocabulary R ∪ S.
                            S

The expressions faili encode constraints to satisfy, with their tuples for a
given extension of the guessed functions representing constraint violations.
    The gap between NP-Alg and ConSQL is now much smaller (see Exam-
ple 3): guessed functions in NP-Alg neatly correspond to guessed column-
sets of ConSQL, and expressions faili correspond to check not exists con-
straints. Extensions like having multiple guessed column-sets in the same
view, null values, and non-unary relations serving as domains and co-domains
are not substantial, and could be simulated by means of additional joins, spe-
cial values in the co-domain relations, and introducing unary key fields.

                                      7
Classical local-search. The following notions play key-roles in all local-
search algorithms. We recall them in terms of an NP-Alg expression hS, faili
when evaluated over a finite database over schema R.
Definition 2 (States, search space, costs, solutions). A state, or candidate
solution, is an extension S for guessed functions S. The search space is the
set of all possible states. The cost of a state S is Σki=1 |faili |, i.e., the total
number of tuples returned by sub-expressions faili when evaluated on S. A
solution is a state of cost 0.
Most of the local-search algorithms described in the literature [6] have, at
their core, the following greedy behaviour:
   current-state ← Init-State()
   while not Termination-Condition() do
            move ← choose-improving-move(current-state)
            if move 6= nil then current-state ← make-move(move) else break
The idea is, after a –usually random– initialisation, to iteratively evaluate
and perform small changes (moves) to the current state, in order to reduce
the cost of the current candidate solution, until some termination condition
(in terms of, e.g., the cost of current-state or the amount of computational
resources used) is met. The universe of possible moves is chosen in advance
by the programmer from the structure of the problem. Although this choice
could in principle be arbitrary, in local-search moves are mostly defined to
involve a very small change to the current state.
    The simple structure of guessed functions in NP-Alg suggests a very nat-
ural definition for atomic moves, while any other possible definition could be
given just by composition.
Definition 3 (Move). A triple hS,d,ci with S ∈ S (S : D → C), d ∈ D, c ∈ C.
Hence, given any state S of S and the corresponding extension S of guessed
function S, a move δ = hS, d, ci changes the extension S of S by modifying
the co-domain value assigned to domain value d to c. We denote the state
reached from S after performing δ as S ⊕ δ. The new extension of S is
denoted by S ⊕ δ. A move δ executed on state S is improving, neutral or
worsening iff the cost of state S ⊕ δ is, respectively, less than, equal to, or
greater than the cost of state S. Also, given a constraint i, move δ executed
on S is improving, neutral or worsening wrt constraint i iff the share of cost
of state S ⊕ δ due to constraint i (i.e., |faili |) is, respectively, less than, equal
to, or greater than that of state S.
    The choice of moves to perform usually varies, depending on the partic-
ular algorithm considered. As an example, gradient descent chooses an im-
proving move randomly, while steepest descent chooses the move that, if per-
formed, maximally reduces the cost of the new current state. Since steepest
descent needs to consider all possible moves in order to choose the best one, it
could be very inefficient on large scale problems. A third very successful algo-
rithm is min-conflicts that, once adapted to the structure of the search space
in the NP-Alg framework, randomly selects one guessed function S : D → C
and one of its domain values d ∈ D, and then chooses the best improving

                                          8
move among all those regarding S and d. All these greedy algorithms return
nil if no moves lead to a cost reduction, and hence are easily trapped in the
so-called local minima, i.e., states that have cost lower than all their neigh-
bours (states that could be reached by executing a single move). To avoid
being stuck in local minima, and to be able to continue search towards better
states, usually these basic algorithms are enhanced with more sophisticated
(not pure-greedy) techniques. As an example, simulated annealing executes
also worsening moves, with a probability decreasing during time, while tabu-
search dynamically filters the set of available moves, by forbidding those that
would bring search back to recently visited states. Also, to increase robust-
ness, such algorithms are enhanced with random restarts, joint together in
batches, and/or wrapped by even more sophisticated meta-heuristics [6].
    Classical local-search algorithms typically use problem constraints only
to evaluate the quality of a move, i.e., its impact in the cost of the search
state achieved if such a move is performed. In the context of constrained
optimisation over massive data, this approach quickly becomes impractical.
Example 4 (University timetabling, continued). Assume that database re-
lations have sizes |Room| = r, |Period| = p, |Course| = c. By Definition 3, a
steepest descent algorithm needs, at any step, to evaluate a constant number
prc of candidate moves (each period/room pair can be reassigned to one of the
other c−1 courses plus null) to find and choose the one that maximally reduces
the cost of the new current state. By using a naive approach, for any such
move, 7 sql queries, one for each constraint, should be evaluated on the new
database obtained by changing the course assigned to one room/period pair.
    Even in not-so-large scenarios, e.g., p = 960 (8 periods/day, 5 days/week
for 6 months, observe that scheduling is not periodic, given, e.g., teachers’
unavailabilities), r = 30, c = 40, the number of candidate moves to con-
sider at each step is over 106 . Now, assuming that each query takes only
10−3 seconds to run, and even ignoring time for modifying and rolling-back
the database each time, a naive version of steepest descent would need about
106 · 7 · 10−3 = 7 · 103 seconds to execute one single move!

Constraint-directed dynamic neighbourhood reduction (DNR). To
overcome these difficulties, the concept of constraint-directed neighbour-
hoods has been recently proposed [2]. The idea is to focus on the problem
constraints also to isolate subsets of the moves that possess some useful prop-
erties, such as those that are guaranteed to be improving wrt one or more
constraints. These ideas can be exploited and extended in our context in a
very elegant and natural way, thanks to the great simplicity of NP-Alg.
     Consider an NP-Alg expression hS, faili, where fail = ki=1 faili , and
                                                                    S
all faili are conjunctive queries, i.e., of the form faili = σφi (Si1 × · · · × Sis ×
Ri1 · · · × Rir ), with Sij ∈ S (j ∈ [1 . . s]) and Rip ∈ R (p ∈ [1 . . r]). In
ConSQL this is equivalent to say that constraints can be expressed as not
exists select-from-where queries, as it is often the case. Let us now focus on
a single constraint i. Assume that, in a given state S, faili evaluates to a
non-empty set of tuples Vi (that can be regarded as violations of constraint


                                         9
i). In order to find a solution we need to resolve all violations of all con-
straints. Let us limit our attention on the extension S ∈ S of a distinguished
guessed function S (binary relation encoding a function S : D → C with D,
C unary relations in R) occurring m ≥ 1 times in the expression for faili .
To emphasise the m occurrences of S, we can –wlog– rewrite faili as follows
(after also changing references to columns in φi accordingly):
                      faili = σφi (S (1) × · · · × S (m) × T),              (1)

with S (1) , . . . , S (m) being the m occurrences of S, and T the Cartesian prod-
uct of all the relations other than S in the constraint expression.
    Since a move over S improving wrt constraint i must resolve at least one
violation in Vi ⊆ S (1) × · · · × S (m) × T, we give the following definition:
Definition 4 (Promising moves). The set of promising moves over guessed
function S wrt constraint i is defined
                                    by the following 
                                                      RA query:
                      Prom Si = π        σ (Vi × D × C)                         (2)
                                 D,C   χ∧¬φ0i
              Wm
where χ= j=1       (D.$1=Vi .$(2j − 1)) and φ0i is obtained from φi by replacing
any atom a in which S (x).$2 occurs (x ∈ [1..m]) with ((S (x).$1 =D.$1) →
a0 ) ∧ ((S (x).$1 6= D.$1) → a) where a0 is obtained from a by replacing any
occurrence of S (x) .$2 with C.$1.
Despite its apparent complexity, the meaning of formula (2) is quite in-
tuitive. A generic tuple λ in Vi has the form hd1 , c1 , . . . dm , cm , ti, with
d1 , . . . , dm ∈ D and c1 , . . . , cm ∈ C, and t a sub-tuple over schema T. PromSi
computes moves over S (hence pairs δ = hd, ci ∈ D × C, cf. the projection
operator) such that: (i) The domain value of δ, d, is equal to at least one
among d1 , . . . dm of at least one violation λ ∈ Vi (condition χ); and (ii) For
such a λ, tuple λ0 , built by replacing in λ any component cj with the new
value c iff the corresponding dj is equal to d, does not satisfy condition φi .
Since in RA we cannot modify tuples when evaluating a condition, in (2)
we take the alternative approach of changing the condition itself into φ0i .
Requirement (i) guarantees that synthesised moves involve only domain val-
ues that appear in violations λ ∈ Vi , hence, if performed, resolve at least
λ. Requirement (ii), instead, ensures that, after performing δ, the newly
introduced tuple in S, i.e., δ, does not lead to a new violation λ0 , at least
with tuple t. The following result holds (proofs omitted for space reasons):
Theorem 1. In any state S, for any constraint i, and for any S ∈ S having
extension S in S, all moves δ over S that, if executed from S, are improving
wrt constraint i belong to Prom Si .
In other words, since a move is improving iff it is improving wrt at least
one constraint, by computing, in the current state, Prom Si for all guessed
functions S and all constraints i, we obtain a subset of the possible moves
that contains all the improving moves, filtering out many (hopefully most)
of the others. To better appreciate the advantages of this computation, let
us reconsider the University timetabling problem.

                                        10
                    TT                                                               Prom Audience
                                                                                          con2
       Audience      p   r      c                                                     p   r      c
        c nb_stud   p1 r1      c1                                                    p2 r2      c1
                    p1 r2       -    Vcon2                                           p2 r2 · · ·
       c1    27     p1 r3      c3
       c2    26                      t.p t.r t.c   r.id r.capacity   a.c a.nb_stud   p2 r2      c5
                    p2 r1       -                                                    p2 r2      c7
       c3    37     p2 r2      c6    p2 r2 c6       c6      40       c6      43
       c4    35                      p3 r2 c7       c7      40       c7      67      p2 r2       -
                    p2 r3      c5                                                    p3 r2      c1
       c5    48     p3 r1       -    ... ... ...   ...      ...      ...    ...
       c6    43                                                                      p3 r2 · · ·
                    p3 r2      c7                                                    p3 r2      c6
       c7    67     p3 r3      c6                                                    p3 r2       -
                    ... ...    ...                                                   ... ... ...
           (a)           (b)                              (c)                             (d)
Figure 2: (a) View Audience. (b) A portion of the extension of guessed view TT in
current state. (c) Portion of Vcon2 due to the given portion of TT. (d) Promising moves
over TT computed starting from the displayed portion of Vcon2 .

Example 5 (University timetabling, continued). Assume that the problem
instance refers to periods p1–p40, rooms r1, r2, r3 with capacities of 30, 40, 50
students, respectively, and courses c1–c7. The number of students enrolled
in the various courses is given by view Audience in Figure 2(a). Suppose
now that, in the current state, whose extension for TT is partially given in
Figure 2(b), the expression for con2 evaluates to the set of violations Vcon2 of
Figure 2(c). Tuples in Vcon2 give us precious information to drive the search.
As an example, the first tuple shows that the current candidate timetabling
has the problem that course c6 is given in period p2 in a room, r2, which is
too small. Analogously for the other tuples. The query for Prom TT     con2 (once
generalised to handle domain and co-domain relations of arity greater than
1) returns the set of moves that resolve at least one violation in Vcon2 without
producing a new one, at least with the same tuples of table Room and view
Audience. Figure 2(d) shows part of the result of Prom TT    con2 containing the
moves synthesised from the first violation. Note that, by using the naive
approach, prc = 40 · 3 · 7 = 840 moves needed to be considered. With the
presented dynamic reduction instead, at most 7 moves for each violation of
con2 need to be considered. Hence, the more progressed the greedy search, and
the closer we are to a solution, the fewer the moves that need to be considered.
DNR can be exploited also when we do not need knowledge of all improving
moves. Consider, e.g., min-conflicts: at any iteration this algorithm ran-
domly selects one guessed function S : D → C and one of its domain values
d ∈ D, and then chooses the best improving move among all those regarding
S and d. Given that for a move to be improving, it needs to resolve at least
one violation of one constraint, we can design a violation-directed (and gener-
alised) version of min-conflicts in such a way that it first selects a random vio-
lation λ ∈ ki=1 Vi , and then, for each pair of components hdj , cj i ∈ λ caused
                S
by an occurrence of any guessed function Sj : Dj → Cj in the definition of
the corresponding constraint, it considers as promising all moves of the kind
hSj , dj , c0 i, c0 ∈ Cj , c0 6= cj , i.e., all moves that would make λ disappear in the
target state. We omit a formal discussion of this variation for space reasons,
but we will describe some preliminary experimental results in Section 5.
Incremental evaluation of constraints (IEC). Although DNR allows
us to reduce the set of moves that need to be considered during a greedy
search, it tells us nothing about the quantitative impact (in terms of cost
change) of the surviving moves over i and the other constraints. To do this,


                                                   11
the naive algorithm would be to execute any such move and evaluate all
problem constraints over the new database, then roll-back.
    In the important and usual case of constraints of the form (1), the fol-
lowing results show that: (i) we can compute the exact cost variation upon
all moves in a given set by running only 2 queries for each constraint; and
(ii) after one such move has been chosen, we can update incrementally its
impact in terms of violations added and resolved for all the constraints.
Definition 5 (Resolved Si and Added Si ). Given an NP-Alg expression hS, faili,
one of the guessed functions S ∈ S, and one sub-expression faili having the
form (1) and evaluating to Vi in state S, let MS be a relation over the schema
of S encoding an arbitrary set of moves over S. We define the queries:
                                                                           
Resolved Si = σ0 MS×Vi ,       Added Si = Ψ σ 0 MS ×(S (1)×· · ·×S (m)×T) ,
                      
              χ                                  ξ χ00 ∧φi

where χ0 = m
              W                                             00
                                                                   Wm                (j)
                j=1 (MS [$1] = Vi [$(2j − 1)]), χ =                  j=1 (MS [$1] = S [$1]),
and φ0i as in Definition 4. The operator Ψ in Added Si is a non-standard
replacement operator (that could easily be simulated in sql by means of
functions and if statements in the target-list) that changes each tuple τ of
the input expression into ξ(τ ). Here, ξ(τ ) produces tuple τ 0 as follows:
                                            (
                                             τ [MS .$1, MS .$2] if τ [S (i) .$1] = τ [MS .$1]
∀i ∈ [1. .m] : τ 0 [S (i) .$1, S (i) .$2] =
                                             τ [S (i) .$1, S (i) .$2] otherwise

while for any other component $j of τ , we have that τ 0 [$j] = τ [$j].
These queries play a key-role in the incremental evaluation of constraints:
Theorem 2 (Incremental evaluation of constraints). Given S, Vi and MS as
in Definition 5, in the state S ⊕ δ reached after performing any move δ ∈ MS
from state state S, the expression faili for each constraint i evaluates to:
                         σ (Resolved Si ) ∪ π          σ Added Si ,
                                                                
            Vi − π
                   −MS MS =δ                         −MS MS =δ

with π−MS denoting the projection operator that filters out columns due to MS .
Also, Resolved Si ∩ Added Si = ∅.
Theorem 2 tells us that, given an arbitrary set of moves over a guessed func-
tion S that occurs in the expression faili , we can compute incrementally the
result of faili in all the states reached by performing each of these moves,
given the result in the current state. In particular, by taking MS = Prom Si ,
we can compute the exact impact, both in terms of cost variations and in
the extension of Vi , of all promising moves.
    Query Resolved Si returns a set of tuples, each representing a move δ =
hd, ci in MS concatenated with a violation in Vi in which value d occurs in
the first column of at least one S (i) . Changing the co-domain value associ-
ated to d to c will make this violation disappear from Vi in the target state.
As for Added Si , it computes the contributions, in terms of added tuples to
the result of the query faili , of the execution of each move δ ∈ MS . Such

                                            12
       p r c   t.p t.r t.c r.id r.cap a.c a.nb_stud
       p2 r2 c1p2 r2 c6 c6        40 c6       43    p r c t.p t.r t.c r.id r.cap a.c a.nb_stud
       p2 r2 c5p2 r2 c6 c6        40 c6       43    p2 r2 c5 p2 r2 c5 c5     40 c5       48
       p2 r2 c7p2 r2 c6 c6        40 c6       43    p2 r2 c7 p2 r2 c7 c7     40 c7       67
       p2 r2 - p2 r2 c6 c6        40 c6       43    p3 r2 c5 p3 r2 c5  c5    40  c5      48
       p3 r2 c1p3 r2 c7 c7        40 c7       67    p3 r2 c6 p3 r2 c6 c6     40 c6       43
       p3 r2 c6p3 r2 c7 c7        40 c7       67
       p3 r2 - p3 r2 c7 c7        40 c7       67
                           (a)                                        (b)
Figure 3: Resolved TT                        TT                                 TT
                    con2 (a) and Added con2 (b) for some moves in P romcon2 of Figure 2(d).


contributions are tuples of the Cartesian product in which value d occurs
at least once (condition χ00 ), and ‘modified’ by properly taking into account
the new value c that the moves under consideration assign. Again, since in
RA we cannot modify tuples under evaluation, we take the dual approach of
changing the condition from φ to φ0 . However, since we also must return the
modified tuples, we need to use the non-standard Ψ operator, which changes
the tuples in the input relations in the same way as done by φ0 .
    Moreover, given that Resolved Si ∩ Added Si = ∅, just by grouping and
counting tuples in those results, we can compute with one additional query
the exact cost of all moves in set MS over the constraints. Also, if all Vi s are
materialised, we obtain incremental maintenance of the results of the corre-
sponding faili s by deleting and inserting tuples returned by these queries.
Example 6 (University timetabling, continued). Figure 3 shows the results
of computing queries Resolved TT                             TT
                                 con2 (part (a)) and Added con2 (part (b)) re-
garding the subset of moves in P romTT con2 explicitly shown in Figure 2(d). As
an example, move (p2, r2, c1) will reduce the cost of con2 by 1 (since it occurs
once in Resolved TT                                                     TT
                 con2 , see Figure 3(a), and it does not occur in Added con2 , see
Figure 3(b)). Analogously, move (p2, r2, c5) proves to be neutral wrt con2.
Some brief comments follow on the complexity of the three queries. For each
guessed function S and constraint i, and assuming that Vi is materialised
and incrementally updated thanks to results of Theorem 2, we have that:
(i) P romSi can be computed with two joins, the first of which, given the
tight condition χ, does not lead to an explosion of the number of tuples (any
tuple in Vi can match at most with m tuples of D, with m being usually very
small); (ii) Resolved Si involves a single join operation; (iii) Added Si involves
one more join operation than the expression of constraint i. Also in this case,
the presence of the tight condition χ00 avoids in practice the combinatorial
explosion of the number of tuples due to the added join.

5     Implementation, experiments, future work
We implemented a new ConSQL solver based on the ideas presented in Sec-
tion 4, and built on top of a relational DBMS. The system, written in Java,
takes as input a problem specification in ConSQL and uses standard sql
commands for choosing, evaluating, and performing moves, hence being inde-
pendent of the DBMS used. The user can also annotate her specification with
a declarative description of the search strategy to be used, given in terms of
the local-search algorithms (and parameters) to run (gradient descent, steep-
est descent, tabu search, min-conflicts, simulated annealing) and their com-
position (via random restarts or batches). Details are given in the appendix.

                                              13
     In order to test the effectiveness of the techniques above on the scal-
ability of the approach, we performed preliminary experiments, evaluating
the performance gain of DNR+IEC with respect to the naive evaluation of
moves and constraints. It is worth observing that our purpose was not to
beat state-of-the-art solvers (and in fact we do not even comment on sat-
isfaction), but rather to seek answers to the following questions: what is
the impact of DNR+IEC on: (i) the reduction of the number of moves to
evaluate; (ii) the overall performance of the greedy part of the search.
     We experimented with two problems, Graph colouring and University
timetabling, and two different greedy algorithms: steepest descent and the
violation-directed version of min-conflicts mentioned in Section 4. The rea-
sons behind our choices are as follows: steepest descent needs to evaluate,
at every iteration, all possible moves, hence it is a good indicator of the im-
pact of DNR+IEC in terms of effectiveness in neighbourhood reduction and
the consequent performance gain. On the other hand, min-conflicts needs
to evaluate a much smaller set of moves at any step: hence it can be much
more efficient than the former, especially on large scale instances. However,
we will see that also in this case DNR+IEC may play an important role.
     Experiments have been performed on an Athlon64 X2 Dual Core 3800+
PC with 4GB RAM, using MySQL DBMS v. 5.0.75, with no particular opti-
misations. Instances of both problems have been solved running both algo-
rithms with and without DNR+IEC. In order to perform fair comparisons,
the two runs of the same algorithm on the same instance have been executed
by using the same random seed. As for steepest descent, this implies that
the two runs started from the same state, and executed the same sequence of
moves, terminating at the same local minimum. This does not hold for min-
conflicts, since when IEC is not activated, constraints are not materialised,
and the order with which tuples are retrieved cannot be controlled without
paying a very high price in terms of efficiency.
     As for the choice of problems (specification are given in the appendix)
and instances: specification for graph colouring is very simple and consists
of one guessed column-set and one constraint (see Example 3). Given its
compactness, it gives clean information about the impact of our techniques
on a per-constraint basis. We used some quite large benchmark instances
taken from http://mat.gsia.cmu.edu/COLOR/instances.html, looking for
minimal colourings, with number of needed colours known in advance. As
for University timetabling, in order to deal with real world instances, we
changed our formulation to that of [4], where schedulings are periodic, and
some data preprocessing is already done (e.g., constraint about equipment is
already reduced to a relation listing rooms suitable for the various courses).
Also, we had to ignore some constraints which would need non-conjunctive
queries, not yet supported by DNR+IEC. The final formulation resulted in
8 ConSQL constraints (taking into account course lengths, room suitabil-
ity, teachers’ periodic unavailabilities, conflicts among courses with common
students, and minimum course spread). As for the instances, we used some
of those of Track 3 of the 2007 Intl. Timetabling Competition (http://www.
cs.qub.ac.uk/itc2007), taken from http://tabu.diegm.uniud.it/ctt.


                                      14
    Some typical-case results are reported in Table 2. Enabling DNR+IEC
considerably boosted performance: impressive speed-ups were experienced
on all instances, especially when all moves need to be evaluated (steepest
descent). Also in the cases where DNR+IEC runs did not terminate in the 1-
hour timeout, the number of iterations executed is always much higher than
with DNR+IEC disabled. However, DNR+IER brings advantages also when
complete knowledge on the possible moves is not needed (min-conflicts), al-
though speed-ups are unsurprisingly lower. Finally, we observed (see Fig-
ure 4) the strong expected reductions of the size of the neighbourhood, thanks
to the action of DNR (steepest descent). In particular, DNR is able to filter-
out often > 30% of the moves at the beginning of search, and constantly
more than 80% (with very few exceptions) when close to a local minimum.
    Given the promising, even if preliminary results, we are working on ex-
tending DNR and IEC to constraints defined by queries with aggregates
and groupings, on evaluating them on other heuristics, and on investigating
database optimisations, e.g., automated indices generation, in order to ex-
ecute queries as efficiently as possible and tackle much larger scenarios, as
those mentioned in Section 1.
Acknowledgements. This work has been partly carried out during visits of
T. Mancini to Uppsala University, supported by the Swedish Foundation for
Intl Cooperation in Research and Higher Education (STINT, grant KG2003-
4723) and Blanceflor Foundation. P. Flener and J. Pearson are supported
by grant 2007-6445 of the Swedish Research Council (VR).
Authors thank the anonymous reviewers for their comments and suggestions.

References
 [1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley, 1995.
 [2] M. Ågren, P. Flener, and J. Pearson.           Revisiting constraint-directed search.
     Information and Computation, 207(3):438–457, 2009.
 [3] M. Cadoli and T. Mancini. Combining Relational Algebra, sql, Constraint Modelling,
     and local search. Theory and Practice of Logic Programming, 7(1&2):37–65, 2007.
 [4] F. De Cesco, L. Di Gaspero, and A. Schaerf. Benchmarking curriculum-based course
     timetabling: Formulations, data formats, instances, validation, and results. In Proc.
     of PATAT’08, 2008.
 [5] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the
     Theory of NP-Completeness. W.H. Freeman & Co., 1979.
 [6] H. H. Hoos and T. Stützle. Stochastic Local Search, Foundations and Applications.
     Elsevier/Morgan Kaufmann, 2004.
 [7] D. Mitchell and E. Ternovska. A framework for representing and solving NP search
     problems. In Proc. of AAAI 2005, pages 430–435, 2005. AAAI Press/MIT Press.
 [8] C. H. Papadimitriou. Computational Complexity. Addison Wesley, 1994.
 [9] R. Ramakrishnan and J. D. Ullman. A survey of deductive database systems. J. of
     Logic Programming, 23(2):125–149, 1995.
[10] A.Schaerf. A survey of automated timetabling. Artif. Intell. Review,13(2):87-127,1999.
[11] G. Terracina, N. Leone, V. Lio, and C. Panetta. Experimenting with recursive
     queries in database and logic programming systems. Theory and Practice of Logic
     Programming, 8(2):129–165, 2008.
[12] P. Van Hentenryck and L. Michel. Constraint-Based Local Search. MIT Press, 2005.
[13] W. Zhang, A. Rangan, and M. Looks. Backbone guided local search for maximum
     satisfiability. In Proc. of IJCAI 2003, pages 1179–1186, 2003. Morgan Kaufmann.


                                            15
     Graph colouring             Steepest descent: sec (#iter)        Min-conflicts: sec (#iter)
  instance  nodes edges cols DNR+IEC on DNR+IEC off speedup DNR+IEC on DNR+IEC off speedup
    anna     138   493   11    39 (21)       2682 (21)        69x 11 (14)          38 (33)       1.5x
 games120    120   638    9    502 (40)        – (37)          8x 31 (48)          43 (48)       1.4x
   homer     561 1629 13       247 (20)         – (2)        146x 36 (22)          53 (22)       1.5x
 le450_5a    450 5714    5     – (140)          – (1)        140x 112 (22)      1667 (240)       1.4x
 miles1000   128 3216 42       286 (38)         – (1)        478x 418 (25)       1266 (60)       1.3x
 miles250    128   387   8     10 (28)       1734 (28)       173x  6 (10)         20 (36)        0.9x
 miles500    128 1170 20       38 (28)          – (7)        379x 41 (14)         173 (46)       1.3x
 miles750    128 2113 31       143 (36)         – (2)        453x 199 (25)        479 (46)       1.3x
 mulsol.i.1  197 3925 49        – (4)           – (0)          ?  637 (36)       1083 (47)       1.3x
 mulsol.i.2  188 3885 31      2683 (60)         – (1)         81x 235 (22)        879 (64)       1.3x
 mulsol.i.3  184 3916 31      1506 (54)         – (1)        129x 296 (28)       1228 (91)       1.3x
 mulsol.i.4  185 3946 31       974 (56)         – (1)        207x 288 (27)       1238 (91)       1.3x
 mulsol.i.5  186 3973 31      1827 (53)         – (1)        104x 146 (13)        991 (72)       1.2x
   myciel6    95   755    7    14 (42)       1214 (42)        87x 10 (39)          19 (47)       1.6x
queen11_11 121 3960 11         152 (52)         – (9)        137x 36 (14)         200 (63)       1.2x
queen13_13 169 6656 13          – (36)          – (2)        18x   58 (8)        896 (104)       1.2x
 zeroin.i.2  211 3541 30       287 (53)         – (1)        665x 274 (26)        917 (68)       1.3x

                  University timetabling                  Min-conflicts: sec (#iter)
            instance courses lectures rooms periods DNR+IEC on DNR+IEC off speedup
            comp01     30      160      6     30      410 (5)           – (6)         7x
            comp02     82      283      16    25       – (8)            – (2)        4x
            comp03     72      251      16    25       – (12)          – (2)          6x
            comp04     79      286      18    25       – (10)          – (1)         10x
            comp05     54      152      9     36      511 (8)          – (11)         5x
            comp06     108     361      18    25       – (2)            – (0)         ?
            comp07     131     434      20    25       – (1)            – (0)         ?
            comp08     86      324      18    25       – (8)            – (1)         ?
            comp09     76      279      18    25     3079 (15)          – (2)         9x
            comp10     115     370      18    25       – (1)            – (0)         ?
            comp11     30      162      5     45       – (23)          – (4)          6x
            comp12     88      218      11    36       – (16)          – (3)          5x
            comp13     82      308      19    25       – (17)          – (0)          ?
            comp14     85      275      17    25       – (3)            – (1)        3x
            comp15     72      251      16    25       – (10)          – (1)         10x
            comp16     108     366      20    25       – (1)            – (0)         ?
            comp17     99      339      17    25       – (3)            – (1)        3x
            comp18     47      138      9     36      142 (40)         – (47)        21x
            comp19     74      277      16    25      1621 (8)          – (2)         9x
            comp20     121     390      19    25       – (1)            – (0)         ?
            comp21     94      327      18    25       – (1)            – (0)         ?

Table 2: ConSQL results in running steepest descent and min-conflicts on Graph
colouring (top) and University timetabling (down), with and without DNR+IEC (1-hour
timeout). Speedup = (iter (on)/time(on))/(iter (off )/time(off )). Since the size of
university timetabling instances did never allow steepest descent to perform more than
one iteration in 1 hour, such data has been omitted. With a longer timeout, we observed
behaviours similar to those of Graph colouring.


                              100%

                               90%

                               80%

                               70%

                               60%

                               50%

                               40%

                               30%

                               20%

                               10%

                                0%
                                     0%   10%   20%   30%   40%   50%   60%   70%   80%   90%   100%


Figure 4: Ratio of number of promising moves wrt total number of moves, as a function
of the state of run (0%=start, 100%=local min. reached) for all graph colouring instances
in Table 2(top) for which the steepest descent run with DNR+IEC terminated in 1 hour.


                                                             16