Introduction

Constrained optimisation over massive databases

Toni Mancini

tmancini@di.uniroma1.it 1

Pierre Flener

Amir Hossein Monshi

AmirHossein.Monshi.1351@student.uu.se 0

Justin Pearson

Justin.Pearson@it.uu.se 0 0 Department of Information Technology, Uppsala University , Sweden 1 Dipartimento di Informatica, Sapienza Università di Roma , Italy

Constrained optimisation is increasingly considered by industry as a major candidate to cope with hard problems of practical relevance. However, the approach of exploiting, in those scenarios, current Constraint or Mathematical Programming solvers has severe limitations, which clearly demand new methods: data is usually stored in potentially very-large databases, and building a problem model in central memory suitable for current solvers could be very challenging or impossible. In this paper, we extend the approach followed in [3], by presenting a declarative language for constrained optimisation based on sql, and novel techniques for local-search algorithms explicitly designed to handle massive data-sets. We also discuss and experiment with a solver implementation that, working on top of any DBMS, exploits such algorithms in a way transparent to the user, allowing smooth integration of constrained optimisation into industrial environments.

Introduction

Constrained combinatorial optimisation is increasingly considered by industry as a major candidate to cope with hard problems of practical relevance, like scheduling, rostering, resource allocation, security, bio-informatics, etc. However, the approach of exploiting, in those scenarios, current Constraint (CP) or Mathematical Programming (MP) solvers has severe limitations, which clearly demand new methods:

Problems of industrial relevance may easily exhibit sizes that are way out of the reach of current solvers: data usually resides on information systems, in forms of potentially very large, possibly distributed relational databases with complex integrity constraints, and cannot easily be transferred to central memory for being processed.

Moving data into central memory also potentially affects data integrity. Solving very-large combinatorial problems is likely to need a huge amount of computing time, during which original data must be kept locked, thus freezing the normal activities of the information system.

In several scenarios, the structure of both the data and the problem is very articulated and complex. Although it is always possible to map them into the constructs offered by current constraint modelling languages, this further step must be carefully designed and implemented, being a real engineering activity. This is usually perceived as a major obstacle by industry programmers, who usually do not have CP/MP and abstraction skills. Several attempts to join combinatorial optimisation and databases have already been carried out in the literature, especially in the contexts of Deductive Database Systems (see, e.g., [ 9 ]) and Logic Programming (see, e.g., the system DLVDB described in [ 11 ]).

As a different approach to cope with these issues and to provide effective means to spread combinatorial optimisation into industrial environments, in previous work [ 3 ] a non-deterministic extension of Relational Algebra (RA) has been proposed. This language, called NP-Alg, has the ability of expressing all problems in the complexity class NP, where (the decision versions of) many problems of practical relevance lie. NP-Alg has been the formal basis to derive an extension for the well-known database querying language sql, with similar characteristics, but also able to handle optimisation problems. This second language, called ConSQL (i.e., sql with constraints) proposes a new paradigm for modelling constrained optimisation problems, that of modelling as querying : a problem specification is given as a set of views devoted to encode solutions, and constraints that the contents of such views must satisfy. Of course, since we face with optimisation problems whose decision versions belong to the complexity class NP, the “views” we are talking about are not first-order definable, hence not expressible in standard sql.

Advantages of this approach wrt the ones above are two-fold: (i) the modelling language is essentially sql, plus a few new and very intuitive constructs: industrial programmers do not have to master CP/MP to formalise their problems; (ii) the paradigm allows solving approaches (like the one we consider in this paper) that interact transparently with the DBMS and its mechanisms for data transfers and access policies, and for maintaining data integrity, since the solving process, which is seen as a normal transaction by the DBMS, may easily recover from data updates.

In this paper, after recalling NP-Alg (Section 2) we make the following contributions: 1. We re-design our ConSQL language in order to obtain a language having very few deviations from standard sql (Section 3). 2. We reconsider the local-search [ 6 ] approach suggested in [ 3 ], and argue that, to succeed in solving large problems, classical local-search has to be re-thought, with an explicit focus on the handling of massive data. In particular, we isolate two major issues in classical local-search and propose two methods to boost scalability: dynamic neighbourhood reduction (DNR), and incremental evaluation of constraints (IEC) (Section 4). 3. We provide a new implementation of the system, based on the ideas above and on a much wider portfolio of local-search algorithms, with full support to the user for declaratively choosing the search strategy to follow, and experimentally evaluate it against the original one in terms of performance and scalability (Section 5).

The language NP-Alg

NP-Alg [ 3 ] is a non-deterministic extension of RA, suitable for modelling combinatorial decision problems.

Syntax. Let R be a relational schema, i.e., a set of relations of arbitrary finite arities. An NP-Alg expression is a pair hS; faili where: 1. S = fS(a1); : : : ; Sn(an)g is a set of new relations of arbitrary finite arities 1 a1; : : : ; an. These relations are called “guessed”. R and S are disjoint. 2. fail is an expression of plain RA on the new database vocabulary R [ S. Semantics. Given a finite database D over schema R, let DOMD be the unary relation representing the set of all constants occurring in D. The semantics of an NP-Alg expression hS; faili is as follows: 1. For each possible extension of the relations in S with elements in DOMD, the expression fail is evaluated, using ordinary rules of RA. 2. If there exists one such extension for which fail evaluates to the empty relation ; (denoted as fail ;), the answer to the decision problem is “yes”. Otherwise, the answer is “no”. When the answer is “yes”, such extension of relations in S is a solution of the problem instance.

Intuitively, fail encodes constraints that extensions of S need to satisfy to be considered solutions. These are represented in terms of violations, hence the goal is to guess an extension of S that makes fail = ; (a dual approach, where constraints are defined in a ‘positive’ way can however be followed).

NP-Alg has strong relationships with existential second-order logic (ESO) [ 8 ] and first-order model-expansion (FO-MX) [ 7 ], with guessed relations playing the role of quantified predicates in ESO and expansion vocabulary symbols in FO-MX. It can be shown [ 3 ] that NP-Alg can express all (and only) the queries that specify decision problems belonging to NP.

In the following, we assume the standard version of RA [ 1 ] with the usual operators f ; ; ; ; [; \g, and positional notation ($1; $2, etc.) for attributes of the various relations. Also, given a tuple with n components, we denote its i-th component by [$i]; i 2 [1 : : n], and the sub-tuple obtained from by considering only components with indices from i to j (i j) by [$i : : $j]. Finally, to ease notation, especially when expressing conditions over tuples defined over the Cartesian product of some relations, we may denote tuple indices by the name of the original relations they come from, followed by the relative column number. Hence, e.g., (R1 R2) $1=$3

R1:$1=R2:$1 is equivalent to (R1

R2) if R1 is binary.

Example 1 (Graph 3-colouring). Let D be a database with two relations N and E, encoding a graph: N is unary and encodes the set of nodes, while E is binary and encodes the set of edges of the graph, as pairs of nodes. Hence DOMD = N . The NP-complete Graph 3-colouring problem [ 5 ] can be expressed in NP-Alg by the following expression: 1. S = fR; G; Bg all unary, encoding the set of nodes to be coloured in red, green, and blue, respectively; 2. fail = fail_cover [ fail_disj [ $1fail_col, where: fail_cover = N (R [ G [ B), (C

C)\E .

For any guessed extension of R; G; B, fail_cover = ; iff every node is assigned at least one colour; similarly, fail_disj = ; iff every node is assigned at most one colour, and fail_col = ; iff no two adjacent nodes are assigned the same colour. Hence, fail ; iff there exist extensions for R, G, B encoding an assignment of colours to nodes such that no two adjacent nodes have the same colour. Such extensions are solutions to the problem instance. 3

ConSQL: modelling as querying

In this section we present a new and improved version of our ConSQL language [ 3 ]. ConSQL is a non-deterministic extension of sql, with its optimisation-free subset having the same expressive power of NP-Alg. ConSQL is a super-set of sql: users can hence safely exploit the rich set of language features of sql during problem modelling. The current version of the language has only very few deviations from standard sql, in the aim of greatly easing the modelling task of constrained optimisation problems for the average industry programmer.

A detailed description of the syntax and semantics of ConSQL is given in the appendix (available at http://www.dis.uniroma1.it/~tmancini). Here, we introduce the language by an example.

Example 2 (University timetabling). Assume a university wants to solve the following course timetabling problem (a variant of the ones presented in [ 4 ], in order to show the flexibility of the modelling language), namely finding a schedule for the lectures of a set of courses, in a set of classrooms, by minimising the total number of students that are enrolled in courses having overlapping lectures, by satisfying constraints about lecture allocation in time and rooms (con1, con2, con6), room equipment (con3), teachers’ unavailability (con4) and conflicts (con7), and course lengths (con5).

Data reside in a relational database with the tables listed in Table 1 (primary keys are underlined; foreign key constraints are omitted for brevity). A ConSQL specification for this problem is as follows: create SPECIFICATION Timetabling ( // A guessed view (see new construct CHOOSE), encoding a ‘guessed’ timetable, as an // assignment of courses to room/period pairs (w. some possibly unassigned, i.e., null) create view TT as select p.id as p, r.id as r, CHOOSE(select id as c from Course) CAN BE NULL from Period p, Room r // A view defined in terms of TT (dependent guessed view) create view ConflCourses as select c1.id as c1, c2.id as c2 from Course c1, Course c2, TT t1, TT t2 where c1.id < c2.id and t1.c = c1.id and t2.c = c2.id and t1.p = t2.p // Objective function, relying on view ConflCourses MINIMIZE select count(*) from Student s, ConflCourses cc, Enrolled e1, Enrolled e2 where e1.student = s and e1.course = cc.c1 and e2.student = s and e2.course = cc.c2 Student(id, name, . . . ) Teacher(id, name, . . . ) Course(id, teacher, num_lect) Period(id, day, week, start, end) Room(id, capacity) Enrolled(student, course) Equip(id, name) EquipAvail(equip, room) EquipNeeded(course, equip) TeacherUnav(teacher, period)

Set of students with id and other attributes Set of teachers Set of courses to be scheduled, with teacher and nb. of lectures planned Set of time-periods, plus information on day, start and finish times Available rooms with their capacity Info on enrolment of students in the various courses Equipment available for teaching (e.g., VGA projector) Presence of equipment in the various rooms Equipment needed for the various courses Unavailability constraints for teachers: lectures of their courses cannot be scheduled in these periods As it can be seen, a problem specification is defined in terms of the new construct create SPECIFICATION, embedding the definition of a set of views (via the standard sql create view construct), an optional objective function (via the new constructs MINIMIZE and MAXIMIZE) and a set of constraints (via the standard sql check keyword). Some of the views are guessed, in that they have special columns, called guessed column-sets. Such columns are defined by the new construct CHOOSE, which is the main enabler of the non-deterministic mechanism added to sql in order to increase its expressive power. The semantics is as follows:

The solver is asked to populate non-deterministically the guessed columnsets of views, choosing tuples from the query argument of CHOOSE. This choice must be such that all constraints are satisfied and the objective function, if present, takes an optimal value.

In the example, our goal is to assign the lectures of some course to room/period pairs (guessed view TT), with some pairs possibly being empty (CAN BE NULL), in such a way that the number of students enrolled in courses with conflicting lectures is minimised and that all the constraints con1–con7 are satisfied. The ‘residual’ (i.e., the pure sql) part of view TT has schema (p, r), storing all period/room pairs. This is extended ‘to the right’ with a guessed column-set of width 1 having the schema (c) of the argument query. The overall view has schema (p, r, c). While the values in the first two columns are fixed (period/room pairs), those in the entries of the third column are nondeterministically picked from the sets of values occurring as ids for courses. Some of them can be assigned to null, allowing us to model situations, like ours, where we need partial functions. Figure 1(a) shows guessed view TT at the initialisation stage, when the database holds three periods with ids p1, p2, p3, and two rooms with ids ra and rb. Figure 1(b) shows one possible extension of the guessed view, when the database stores courses having ids c1, c2, c3 (plus, possibly, others). Note that, in case the modifier distinct were used, it would be impossible for two period/room pairs to be assigned to the same course. This is analogous to the all-different constraint used in CP.

Extensions of guessed views need to satisfy all constraints in order to be considered solutions to the problem. These are expressed by using the standard sql keyword check applied to an arbitrary sql Boolean expression (various forms allowed by standard SQL to specify conditions, hence constraints, are described in the appendix). Finally, an objective function can be defined by using one of the two new keywords MINIMIZE and MAXIMIZE applied to an aggregate arbitrary query returning a single scalar defined over database tables, ordinary views, guessed views. 4

Solving NP-Alg expressions over massive data

In principle, in order to solve a constrained optimisation problem, different competing techniques could be adopted, e.g., CP, Propositional Satisfiability (SAT), MP. In presence of massive data-sets, such complete approaches may show severe limitations, since they require the systematic exploration of the huge search space. In these cases, incomplete approaches like local-search [ 6 ], may help. These algorithms sacrifice completeness for efficiency, and may be advantageous especially on large instances, and in scenarios where computing a solution with a sufficient high quality is satisfactory. Local-search has been employed successfully to different formalisms, like SAT (cf. bgwalksat [ 13 ]) and CP (cf., e.g., comet [ 12 ]), the successful resolution of large scheduling and timetabling problems [ 10 ] or other large constrained optimisation problems [ 12 ].

In this section, after discussing some convenient syntactic extensions to NP-Alg, and recalling the main notions of classical local-search, we present novel techniques to make local-search applicable on massive data-sets. NP-Alg revisited. The language described in Section 2 can be extended without affecting its expressive power, with means to support bounded integers and arithmetic, typed guessed relations, guessed functions, and guessed permutations. As an example, although in the base version of the language entries of guessed relations take values ranging in DOMD (hence extensions of an r-ary relation range over subsets of DOMDr), in a (just syntactically) enriched version we could easily force them to be typed and also to model functions [ 3 ]. In this way we could easily model, e.g., the Graph k-colouring problem, where the set of the k available colours is given in the additional unary relation K, as follows:

$16=$3^$2=$4 Example 3 (Graph k-colouring). 1. Guessed relation: Col : N ! K, a total function. 2. fail = (Col Col) \ E.

Guessed function Col is represented as a relation having two typed columns, $1 and $2, with entries in $1 ranging over N and those in $2 ranging over K. Being also forced to be a total function, Col is such that any node in N occurs exactly once in the first column of Col. The expression for fail will evaluate to ; iff the guessed extension for Col is such that no two different nodes ($1 6= $3) assigned to the same colour ($2 = $4) share an edge. It can be shown that the enriched language where guessed relations are always forced to be typed total functions between two unary database relations has the same expressive power as the base language. This is because: (i) tuples of non-unary relations can always be denoted by new constants which act as keys; and (ii) arbitrary guessed relations can always be encoded by characteristic functions. Also, the expression fail can always be rewritten in disjunctive form, as a union of sub-expressions. Since this variation is syntactically much closer to our target practical language ConSQL, in the remainder of this paper we will refer exclusively to it. Hence, from now on, we rely on the following definition: Definition 1 (NP-Alg expression). An NP-Alg expression hS; faili over a relational schema R is defined as: 1. A set S = fS1; : : : ; Sng of new binary relations, with any Si : Di ! Ci forced to be a total function with domain Di and co-domain Ci, with both these relations being unary and belonging to R. 2. An expression fail = Sik=1 faili of plain RA on the vocabulary R [ S. The expressions faili encode constraints to satisfy, with their tuples for a given extension of the guessed functions representing constraint violations.

The gap between NP-Alg and ConSQL is now much smaller (see Example 3): guessed functions in NP-Alg neatly correspond to guessed columnsets of ConSQL, and expressions faili correspond to check not exists constraints. Extensions like having multiple guessed column-sets in the same view, null values, and non-unary relations serving as domains and co-domains are not substantial, and could be simulated by means of additional joins, special values in the co-domain relations, and introducing unary key fields. Classical local-search. The following notions play key-roles in all localsearch algorithms. We recall them in terms of an NP-Alg expression hS; faili when evaluated over a finite database over schema R.

Definition 2 (States, search space, costs, solutions). A state, or candidate solution, is an extension S for guessed functions S. The search space is the set of all possible states. The cost of a state S is ik=1jfailij, i.e., the total number of tuples returned by sub-expressions faili when evaluated on S. A solution is a state of cost 0.

Most of the local-search algorithms described in the literature [ 6 ] have, at their core, the following greedy behaviour: current-state Init-State() while not Termination-Condition() do move choose-improving-move(current-state) if move 6= nil then current-state make-move(move) else break The idea is, after a –usually random– initialisation, to iteratively evaluate and perform small changes (moves) to the current state, in order to reduce the cost of the current candidate solution, until some termination condition (in terms of, e.g., the cost of current -state or the amount of computational resources used) is met. The universe of possible moves is chosen in advance by the programmer from the structure of the problem. Although this choice could in principle be arbitrary, in local-search moves are mostly defined to involve a very small change to the current state.

The simple structure of guessed functions in NP-Alg suggests a very natural definition for atomic moves, while any other possible definition could be given just by composition.

Definition 3 (Move). A triple hS;d;ci with S 2 S (S : D ! C), d 2 D, c 2 C. Hence, given any state S of S and the corresponding extension S of guessed function S, a move = hS; d; ci changes the extension S of S by modifying the co-domain value assigned to domain value d to c. We denote the state reached from S after performing as S . The new extension of S is denoted by S . A move executed on state S is improving, neutral or worsening iff the cost of state S is, respectively, less than, equal to, or greater than the cost of state S. Also, given a constraint i, move executed on S is improving, neutral or worsening wrt constraint i iff the share of cost of state S due to constraint i (i.e., jfailij) is, respectively, less than, equal to, or greater than that of state S.

The choice of moves to perform usually varies, depending on the particular algorithm considered. As an example, gradient descent chooses an improving move randomly, while steepest descent chooses the move that, if performed, maximally reduces the cost of the new current state. Since steepest descent needs to consider all possible moves in order to choose the best one, it could be very inefficient on large scale problems. A third very successful algorithm is min-conflicts that, once adapted to the structure of the search space in the NP-Alg framework, randomly selects one guessed function S : D ! C and one of its domain values d 2 D, and then chooses the best improving move among all those regarding S and d. All these greedy algorithms return nil if no moves lead to a cost reduction, and hence are easily trapped in the so-called local minima, i.e., states that have cost lower than all their neighbours (states that could be reached by executing a single move). To avoid being stuck in local minima, and to be able to continue search towards better states, usually these basic algorithms are enhanced with more sophisticated (not pure-greedy) techniques. As an example, simulated annealing executes also worsening moves, with a probability decreasing during time, while tabusearch dynamically filters the set of available moves, by forbidding those that would bring search back to recently visited states. Also, to increase robustness, such algorithms are enhanced with random restarts, joint together in batches, and/or wrapped by even more sophisticated meta-heuristics [ 6 ].

Classical local-search algorithms typically use problem constraints only to evaluate the quality of a move, i.e., its impact in the cost of the search state achieved if such a move is performed. In the context of constrained optimisation over massive data, this approach quickly becomes impractical. Example 4 (University timetabling, continued). Assume that database relations have sizes jRoomj = r, jPeriodj = p, jCoursej = c. By Definition 3, a steepest descent algorithm needs, at any step, to evaluate a constant number prc of candidate moves (each period/room pair can be reassigned to one of the other c 1 courses plus null) to find and choose the one that maximally reduces the cost of the new current state. By using a naive approach, for any such move, 7 sql queries, one for each constraint, should be evaluated on the new database obtained by changing the course assigned to one room/period pair.

Even in not-so-large scenarios, e.g., p = 960 (8 periods/day, 5 days/week for 6 months, observe that scheduling is not periodic, given, e.g., teachers’ unavailabilities), r = 30, c = 40, the number of candidate moves to consider at each step is over 106. Now, assuming that each query takes only 10 3 seconds to run, and even ignoring time for modifying and rolling-back the database each time, a naive version of steepest descent would need about 106 7 10 3 = 7 103 seconds to execute one single move! Constraint-directed dynamic neighbourhood reduction (DNR). To overcome these difficulties, the concept of constraint-directed neighbourhoods has been recently proposed [ 2 ]. The idea is to focus on the problem constraints also to isolate subsets of the moves that possess some useful properties, such as those that are guaranteed to be improving wrt one or more constraints. These ideas can be exploited and extended in our context in a very elegant and natural way, thanks to the great simplicity of NP-Alg.

Consider an NP-Alg expression hS; faili, where fail = Sik=1 faili, and all faili are conjunctive queries, i.e., of the form faili = i (Si1 Sis Ri1 Rir ), with Sij 2 S (j 2 [1 : : s]) and Rip 2 R (p 2 [1 : : r]). In ConSQL this is equivalent to say that constraints can be expressed as not exists select-from-where queries, as it is often the case. Let us now focus on a single constraint i. Assume that, in a given state S, faili evaluates to a non-empty set of tuples Vi (that can be regarded as violations of constraint i). In order to find a solution we need to resolve all violations of all constraints. Let us limit our attention on the extension S 2 S of a distinguished guessed function S (binary relation encoding a function S : D ! C with D, C unary relations in R) occurring m 1 times in the expression for faili. To emphasise the m occurrences of S, we can –wlog– rewrite faili as follows (after also changing references to columns in i accordingly):

faili = i (S(1) S(m) T); (1) with S(1); : : : ; S(m) being the m occurrences of S, and T the Cartesian product of all the relations other than S in the constraint expression.

Since a move over S improving wrt constraint i must resolve at least one violation in Vi S(1) S(m) T, we give the following definition: Definition 4 (Promising moves). The set of promising moves over guessed function S wrt constraint i is defined by the following RA query: PromiS =

D;C ^: 0i (Vi

C) (2) = Wjm=1(D:$1 = Vi:$(2j where 1)) and 0i is obtained from i by replacing any atom a in which S(x):$2 occurs (x 2 [1::m]) with ((S(x):$1 =D:$1) ! a0) ^ ((S(x):$1 6= D:$1) ! a) where a0 is obtained from a by replacing any occurrence of S(x):$2 with C:$1.

Despite its apparent complexity, the meaning of formula (2) is quite intuitive. A generic tuple in Vi has the form hd1; c1; : : : dm; cm; ti, with d1; : : : ; dm 2 D and c1; : : : ; cm 2 C, and t a sub-tuple over schema T. PromiS computes moves over S (hence pairs = hd; ci 2 D C, cf. the projection operator) such that: (i) The domain value of , d, is equal to at least one among d1; : : : dm of at least one violation 2 Vi (condition ); and (ii) For such a , tuple 0, built by replacing in any component cj with the new value c iff the corresponding dj is equal to d, does not satisfy condition i. Since in RA we cannot modify tuples when evaluating a condition, in (2) we take the alternative approach of changing the condition itself into 0i. Requirement (i) guarantees that synthesised moves involve only domain values that appear in violations 2 Vi, hence, if performed, resolve at least . Requirement (ii), instead, ensures that, after performing , the newly introduced tuple in S, i.e., , does not lead to a new violation 0, at least with tuple t. The following result holds (proofs omitted for space reasons): Theorem 1. In any state S, for any constraint i, and for any S 2 S having extension S in S, all moves over S that, if executed from S, are improving wrt constraint i belong to PromiS.

In other words, since a move is improving iff it is improving wrt at least one constraint, by computing, in the current state, PromiS for all guessed functions S and all constraints i, we obtain a subset of the possible moves that contains all the improving moves, filtering out many (hopefully most) of the others. To better appreciate the advantages of this computation, let us reconsider the University timetabling problem.

Example 5 (University timetabling, continued). Assume that the problem instance refers to periods p1–p40, rooms r1, r2, r3 with capacities of 30, 40, 50 students, respectively, and courses c1–c7. The number of students enrolled in the various courses is given by view Audience in Figure 2(a). Suppose now that, in the current state, whose extension for TT is partially given in Figure 2(b), the expression for con2 evaluates to the set of violations Vcon2 of Figure 2(c). Tuples in Vcon2 give us precious information to drive the search. As an example, the first tuple shows that the current candidate timetabling has the problem that course c6 is given in period p2 in a room, r2, which is too small. Analogously for the other tuples. The query for PromcToTn2 (once generalised to handle domain and co-domain relations of arity greater than 1) returns the set of moves that resolve at least one violation in Vcon2 without producing a new one, at least with the same tuples of table Room and view Audience. Figure 2(d) shows part of the result of PromcToTn2 containing the moves synthesised from the first violation. Note that, by using the naive approach, prc = 40 3 7 = 840 moves needed to be considered. With the presented dynamic reduction instead, at most 7 moves for each violation of con2 need to be considered. Hence, the more progressed the greedy search, and the closer we are to a solution, the fewer the moves that need to be considered. DNR can be exploited also when we do not need knowledge of all improving moves. Consider, e.g., min-conflicts : at any iteration this algorithm randomly selects one guessed function S : D ! C and one of its domain values d 2 D, and then chooses the best improving move among all those regarding S and d. Given that for a move to be improving, it needs to resolve at least one violation of one constraint, we can design a violation-directed (and generalised) version of min-conflicts in such a way that it first selects a random violation 2 Sik=1 Vi, and then, for each pair of components hdj ; cj i 2 caused by an occurrence of any guessed function Sj : Dj ! Cj in the definition of the corresponding constraint, it considers as promising all moves of the kind hSj ; dj ; c0i; c0 2 Cj ; c0 6= cj , i.e., all moves that would make disappear in the target state. We omit a formal discussion of this variation for space reasons, but we will describe some preliminary experimental results in Section 5. Incremental evaluation of constraints (IEC). Although DNR allows us to reduce the set of moves that need to be considered during a greedy search, it tells us nothing about the quantitative impact (in terms of cost change) of the surviving moves over i and the other constraints. To do this, the naive algorithm would be to execute any such move and evaluate all problem constraints over the new database, then roll-back.

In the important and usual case of constraints of the form (1), the following results show that: (i) we can compute the exact cost variation upon all moves in a given set by running only 2 queries for each constraint; and (ii) after one such move has been chosen, we can update incrementally its impact in terms of violations added and resolved for all the constraints. Definition 5 (Resolved iS and Added iS). Given an NP-Alg expression hS; faili, one of the guessed functions S 2 S, and one sub-expression faili having the form (1) and evaluating to Vi in state S, let MS be a relation over the schema of S encoding an arbitrary set of moves over S. We define the queries: Resolved iS= 0

MS Vi ;

00^ 0i Added iS=

MS (S(1)

S(m) T) ; where 0 = Wjm=1(MS[$1] = Vi[$(2j

1)]), 00 = Wjm=1(MS[$1] = S(j)[$1]), and 0i as in Definition 4. The operator in Added iS is a non-standard replacement operator (that could easily be simulated in sql by means of functions and if statements in the target-list) that changes each tuple of the input expression into ( ). Here, ( ) produces tuple 0 as follows: 8i 2 [1: :m] : 0[S(i):$1; S(i):$2] = ( [MS:$1; MS:$2] if [S(i):$1] = [MS:$1]

[S(i):$1; S(i):$2] otherwise while for any other component $j of , we have that 0[$j] = [$j]. These queries play a key-role in the incremental evaluation of constraints: Theorem 2 (Incremental evaluation of constraints). Given S, Vi and MS as in Definition 5, in the state S reached after performing any move 2 MS from state state S, the expression faili for each constraint i evaluates to: Vi

MS MS= (Resolved iS) [

MS MS=

Added iS ; with MSdenoting the projection operator that filters out columns due to MS. Also, Resolved iS \ Added iS = ;.

Theorem 2 tells us that, given an arbitrary set of moves over a guessed function S that occurs in the expression faili, we can compute incrementally the result of faili in all the states reached by performing each of these moves, given the result in the current state. In particular, by taking MS = PromiS, we can compute the exact impact, both in terms of cost variations and in the extension of Vi, of all promising moves.

Query Resolved iS returns a set of tuples, each representing a move = hd; ci in MS concatenated with a violation in Vi in which value d occurs in the first column of at least one S(i). Changing the co-domain value associated to d to c will make this violation disappear from Vi in the target state. As for Added iS, it computes the contributions, in terms of added tuples to the result of the query faili, of the execution of each move 2 MS. Such contributions are tuples of the Cartesian product in which value d occurs at least once (condition 00), and ‘modified’ by properly taking into account the new value c that the moves under consideration assign. Again, since in RA we cannot modify tuples under evaluation, we take the dual approach of changing the condition from to 0. However, since we also must return the modified tuples, we need to use the non-standard operator, which changes the tuples in the input relations in the same way as done by 0.

Moreover, given that Resolved iS \ Added iS = ;, just by grouping and counting tuples in those results, we can compute with one additional query the exact cost of all moves in set MS over the constraints. Also, if all Vis are materialised, we obtain incremental maintenance of the results of the corresponding failis by deleting and inserting tuples returned by these queries. Example 6 (University timetabling, continued). Figure 3 shows the results of computing queries Resolved cToTn2 (part (a)) and Added cToTn2 (part (b)) regarding the subset of moves in P romcToTn2 explicitly shown in Figure 2(d). As an example, move (p2; r2; c1) will reduce the cost of con2 by 1 (since it occurs once in Resolved cToTn2, see Figure 3(a), and it does not occur in Added cToTn2, see Figure 3(b)). Analogously, move (p2; r2; c5) proves to be neutral wrt con2. Some brief comments follow on the complexity of the three queries. For each guessed function S and constraint i, and assuming that Vi is materialised and incrementally updated thanks to results of Theorem 2, we have that: (i) P romiS can be computed with two joins, the first of which, given the tight condition , does not lead to an explosion of the number of tuples (any tuple in Vi can match at most with m tuples of D, with m being usually very small); (ii) Resolved iS involves a single join operation; (iii) Added iS involves one more join operation than the expression of constraint i. Also in this case, the presence of the tight condition 00 avoids in practice the combinatorial explosion of the number of tuples due to the added join. 5

Implementation, experiments, future work

We implemented a new ConSQL solver based on the ideas presented in Section 4, and built on top of a relational DBMS. The system, written in Java, takes as input a problem specification in ConSQL and uses standard sql commands for choosing, evaluating, and performing moves, hence being independent of the DBMS used. The user can also annotate her specification with a declarative description of the search strategy to be used, given in terms of the local-search algorithms (and parameters) to run (gradient descent, steepest descent, tabu search, min-conflicts, simulated annealing) and their composition (via random restarts or batches). Details are given in the appendix.

In order to test the effectiveness of the techniques above on the scalability of the approach, we performed preliminary experiments, evaluating the performance gain of DNR+IEC with respect to the naive evaluation of moves and constraints. It is worth observing that our purpose was not to beat state-of-the-art solvers (and in fact we do not even comment on satisfaction), but rather to seek answers to the following questions: what is the impact of DNR+IEC on: (i) the reduction of the number of moves to evaluate; (ii) the overall performance of the greedy part of the search.

We experimented with two problems, Graph colouring and University timetabling, and two different greedy algorithms: steepest descent and the violation-directed version of min-conflicts mentioned in Section 4. The reasons behind our choices are as follows: steepest descent needs to evaluate, at every iteration, all possible moves, hence it is a good indicator of the impact of DNR+IEC in terms of effectiveness in neighbourhood reduction and the consequent performance gain. On the other hand, min-conflicts needs to evaluate a much smaller set of moves at any step: hence it can be much more efficient than the former, especially on large scale instances. However, we will see that also in this case DNR+IEC may play an important role.

Experiments have been performed on an Athlon64 X2 Dual Core 3800+ PC with 4GB RAM, using MySQL DBMS v. 5.0.75, with no particular optimisations. Instances of both problems have been solved running both algorithms with and without DNR+IEC. In order to perform fair comparisons, the two runs of the same algorithm on the same instance have been executed by using the same random seed. As for steepest descent, this implies that the two runs started from the same state, and executed the same sequence of moves, terminating at the same local minimum. This does not hold for minconflicts, since when IEC is not activated, constraints are not materialised, and the order with which tuples are retrieved cannot be controlled without paying a very high price in terms of efficiency.

As for the choice of problems (specification are given in the appendix) and instances: specification for graph colouring is very simple and consists of one guessed column-set and one constraint (see Example 3). Given its compactness, it gives clean information about the impact of our techniques on a per-constraint basis. We used some quite large benchmark instances taken from http://mat.gsia.cmu.edu/COLOR/instances.html, looking for minimal colourings, with number of needed colours known in advance. As for University timetabling, in order to deal with real world instances, we changed our formulation to that of [ 4 ], where schedulings are periodic, and some data preprocessing is already done (e.g., constraint about equipment is already reduced to a relation listing rooms suitable for the various courses). Also, we had to ignore some constraints which would need non-conjunctive queries, not yet supported by DNR+IEC. The final formulation resulted in 8 ConSQL constraints (taking into account course lengths, room suitability, teachers’ periodic unavailabilities, conflicts among courses with common students, and minimum course spread). As for the instances, we used some of those of Track 3 of the 2007 Intl. Timetabling Competition (http://www. cs.qub.ac.uk/itc2007), taken from http://tabu.diegm.uniud.it/ctt.

Some typical-case results are reported in Table 2. Enabling DNR+IEC considerably boosted performance: impressive speed-ups were experienced on all instances, especially when all moves need to be evaluated (steepest descent). Also in the cases where DNR+IEC runs did not terminate in the 1hour timeout, the number of iterations executed is always much higher than with DNR+IEC disabled. However, DNR+IER brings advantages also when complete knowledge on the possible moves is not needed (min-conflicts), although speed-ups are unsurprisingly lower. Finally, we observed (see Figure 4) the strong expected reductions of the size of the neighbourhood, thanks to the action of DNR (steepest descent). In particular, DNR is able to filterout often > 30% of the moves at the beginning of search, and constantly more than 80% (with very few exceptions) when close to a local minimum.

Given the promising, even if preliminary results, we are working on extending DNR and IEC to constraints defined by queries with aggregates and groupings, on evaluating them on other heuristics, and on investigating database optimisations, e.g., automated indices generation, in order to execute queries as efficiently as possible and tackle much larger scenarios, as those mentioned in Section 1.

Acknowledgements. This work has been partly carried out during visits of T. Mancini to Uppsala University, supported by the Swedish Foundation for Intl Cooperation in Research and Higher Education (STINT, grant KG20034723) and Blanceflor Foundation. P. Flener and J. Pearson are supported by grant 2007-6445 of the Swedish Research Council (VR).

Authors thank the anonymous reviewers for their comments and suggestions.

References

[1]

Abiteboul ,

Hull , and

Vianu . Foundations of Databases. Addison Wesley , 1995 .

[2]

Ågren ,

Flener , and

Pearson . Revisiting constraint-directed search . Information and Computation , 207 ( 3 ): 438 - 457 , 2009 .

[3]

Cadoli and

Mancini . Combining Relational Algebra, sql, Constraint Modelling, and local search . Theory and Practice of Logic Programming , 7 ( 1 &2): 37 - 65 , 2007 .

[4]

De Cesco ,

L. Di

Gaspero , and

Schaerf . Benchmarking curriculum-based course timetabling: Formulations, data formats, instances, validation, and results . In Proc. of PATAT'08 , 2008 .

[5]

M. R.

Garey and

D. S.

Johnson . Computers and Intractability: A Guide to the Theory of NP-Completeness .

W.H.

Freeman & Co., 1979 .

[6]

H. H.

Hoos and

Stützle . Stochastic Local Search, Foundations and Applications . Elsevier/Morgan Kaufmann, 2004 .

[7]

Mitchell and

Ternovska . A framework for representing and solving NP search problems . In Proc. of AAAI 2005 , pages 430 - 435 , 2005 . AAAI Press/MIT Press.

[8]

C. H.

Papadimitriou . Computational Complexity. Addison Wesley , 1994 .

[9]

Ramakrishnan and

J. D.

Ullman . A survey of deductive database systems . J. of Logic Programming , 23 ( 2 ): 125 - 149 , 1995 .

[10]

Schaerf . A survey of automated timetabling . Artif. Intell. Review , 13 ( 2 ): 87 - 127 , 1999 .

[11]

Terracina ,

Leone ,

Lio , and

Panetta . Experimenting with recursive queries in database and logic programming systems . Theory and Practice of Logic Programming , 8 ( 2 ): 129 - 165 , 2008 .

[12]

Van Hentenryck and

Michel . Constraint-Based Local Search . MIT Press, 2005 .

[13]

Zhang ,

Rangan , and

Looks . Backbone guided local search for maximum satisfiability . In Proc. of IJCAI 2003 , pages 1179 - 1186 , 2003 . Morgan Kaufmann.