Constrained optimisation over massive databases Toni Mancini1 , Pierre Flener2 , Amir Hossein Monshi2 , and Justin Pearson2 1 Dipartimento di Informatica, Sapienza Università di Roma, Italy tmancini@di.uniroma1.it 2 Department of Information Technology, Uppsala University, Sweden Pierre.Flener,Justin.Pearson@it.uu.se; AmirHossein.Monshi.1351@student.uu.se Abstract Constrained optimisation is increasingly considered by industry as a major candidate to cope with hard problems of practical relevance. However, the approach of exploiting, in those scenarios, current Con- straint or Mathematical Programming solvers has severe limitations, which clearly demand new methods: data is usually stored in poten- tially very-large databases, and building a problem model in central memory suitable for current solvers could be very challenging or impos- sible. In this paper, we extend the approach followed in [3], by present- ing a declarative language for constrained optimisation based on sql, and novel techniques for local-search algorithms explicitly designed to handle massive data-sets. We also discuss and experiment with a solver implementation that, working on top of any DBMS, exploits such al- gorithms in a way transparent to the user, allowing smooth integration of constrained optimisation into industrial environments. 1 Introduction Constrained combinatorial optimisation is increasingly considered by indus- try as a major candidate to cope with hard problems of practical relevance, like scheduling, rostering, resource allocation, security, bio-informatics, etc. However, the approach of exploiting, in those scenarios, current Constraint (CP) or Mathematical Programming (MP) solvers has severe limitations, which clearly demand new methods: • Problems of industrial relevance may easily exhibit sizes that are way out of the reach of current solvers: data usually resides on information systems, in forms of potentially very large, possibly distributed relational databases with complex integrity constraints, and cannot easily be transferred to central memory for being processed. • Moving data into central memory also potentially affects data integrity. Solving very-large combinatorial problems is likely to need a huge amount of computing time, during which original data must be kept locked, thus freezing the normal activities of the information system. • In several scenarios, the structure of both the data and the problem is very articulated and complex. Although it is always possible to map them into the constructs offered by current constraint modelling languages, this Proceedings of the 16th International RCRA workshop (RCRA 2009): Experimental Evaluation of Algorithms for Solving Problems with Combinatorial Explosion Reggio Emilia, Italy, 11–12 December 2009 further step must be carefully designed and implemented, being a real engi- neering activity. This is usually perceived as a major obstacle by industry programmers, who usually do not have CP/MP and abstraction skills. Several attempts to join combinatorial optimisation and databases have al- ready been carried out in the literature, especially in the contexts of Deduc- tive Database Systems (see, e.g., [9]) and Logic Programming (see, e.g., the system DLVDB described in [11]). As a different approach to cope with these issues and to provide effective means to spread combinatorial optimisation into industrial environments, in previous work [3] a non-deterministic extension of Relational Algebra (RA) has been proposed. This language, called NP-Alg, has the ability of express- ing all problems in the complexity class NP, where (the decision versions of) many problems of practical relevance lie. NP-Alg has been the formal basis to derive an extension for the well-known database querying language sql, with similar characteristics, but also able to handle optimisation problems. This second language, called ConSQL (i.e., sql with constraints) proposes a new paradigm for modelling constrained optimisation problems, that of modelling as querying: a problem specification is given as a set of views de- voted to encode solutions, and constraints that the contents of such views must satisfy. Of course, since we face with optimisation problems whose de- cision versions belong to the complexity class NP, the “views” we are talking about are not first-order definable, hence not expressible in standard sql. Advantages of this approach wrt the ones above are two-fold: (i) the modelling language is essentially sql, plus a few new and very intuitive con- structs: industrial programmers do not have to master CP/MP to formalise their problems; (ii) the paradigm allows solving approaches (like the one we consider in this paper) that interact transparently with the DBMS and its mechanisms for data transfers and access policies, and for maintaining data integrity, since the solving process, which is seen as a normal transaction by the DBMS, may easily recover from data updates. In this paper, after recalling NP-Alg (Section 2) we make the following contributions: 1. We re-design our ConSQL language in order to obtain a language having very few deviations from standard sql (Section 3). 2. We reconsider the local-search [6] approach suggested in [3], and argue that, to succeed in solving large problems, classical local-search has to be re-thought, with an explicit focus on the handling of massive data. In particular, we isolate two major issues in classical local-search and pro- pose two methods to boost scalability: dynamic neighbourhood reduction (DNR), and incremental evaluation of constraints (IEC) (Section 4). 3. We provide a new implementation of the system, based on the ideas above and on a much wider portfolio of local-search algorithms, with full support to the user for declaratively choosing the search strategy to follow, and ex- perimentally evaluate it against the original one in terms of performance and scalability (Section 5). 2 2 The language NP-Alg NP-Alg [3] is a non-deterministic extension of RA, suitable for modelling combinatorial decision problems. Syntax. Let R be a relational schema, i.e., a set of relations of arbitrary finite arities. An NP-Alg expression is a pair hS, faili where: (a ) (a ) 1. S = {S1 1 , . . . , Sn n } is a set of new relations of arbitrary finite arities a1 , . . . , an . These relations are called “guessed”. R and S are disjoint. 2. fail is an expression of plain RA on the new database vocabulary R ∪ S. Semantics. Given a finite database D over schema R, let DOMD be the unary relation representing the set of all constants occurring in D. The semantics of an NP-Alg expression hS, faili is as follows: 1. For each possible extension of the relations in S with elements in DOMD , the expression fail is evaluated, using ordinary rules of RA. 2. If there exists one such extension for which fail evaluates to the empty relation ∅ (denoted as fail∅), the answer to the decision problem is “yes”. Otherwise, the answer is “no”. When the answer is “yes”, such extension of relations in S is a solution of the problem instance. Intuitively, fail encodes constraints that extensions of S need to satisfy to be considered solutions. These are represented in terms of violations, hence the goal is to guess an extension of S that makes fail = ∅ (a dual approach, where constraints are defined in a ‘positive’ way can however be followed). NP-Alg has strong relationships with existential second-order logic (ESO) [8] and first-order model-expansion (FO-MX) [7], with guessed re- lations playing the role of quantified predicates in ESO and expansion vo- cabulary symbols in FO-MX. It can be shown [3] that NP-Alg can express all (and only) the queries that specify decision problems belonging to NP. In the following, we assume the standard version of RA [1] with the usual operators {σ, π, ×, −, ∪, ∩}, and positional notation ($1, $2, etc.) for attributes of the various relations. Also, given a tuple τ with n compo- nents, we denote its i-th component by τ [$i], i ∈ [1 . . n], and the sub-tuple obtained from τ by considering only components with indices from i to j (i ≤ j) by τ [$i . . $j]. Finally, to ease notation, especially when expressing conditions over tuples defined over the Cartesian product of some relations, we may denote tuple indices by the name of the original relations they come from, followed by the relative column number. Hence, e.g., σ (R1 × R2 ) $1=$3 is equivalent to σ (R1 × R2 ) if R1 is binary. R1 .$1=R2 .$1 Example 1 (Graph 3-colouring). Let D be a database with two relations N and E, encoding a graph: N is unary and encodes the set of nodes, while E is binary and encodes the set of edges of the graph, as pairs of nodes. Hence DOMD = N . The NP-complete Graph 3-colouring problem [5] can be expressed in NP-Alg by the following expression: 3 1. S = {R, G, B} all unary, encoding the set of nodes to be coloured in red, green, and blue, respectively; 2. fail = fail_cover ∪ fail_disj ∪ π fail_col, where: fail_cover = N−(R ∪ G ∪ B), $1   S fail_disj = (R∩G)∪(R∩B)∪(G∩B), fail_col = σ (C ×C)∩E . C∈{R,G,B} $16=$2 For any guessed extension of R, G, B, fail_cover = ∅ iff every node is as- signed at least one colour; similarly, fail_disj = ∅ iff every node is assigned at most one colour, and fail_col = ∅ iff no two adjacent nodes are assigned the same colour. Hence, fail  ∅ iff there exist extensions for R, G, B encod- ing an assignment of colours to nodes such that no two adjacent nodes have the same colour. Such extensions are solutions to the problem instance. 3 ConSQL: modelling as querying In this section we present a new and improved version of our ConSQL language [3]. ConSQL is a non-deterministic extension of sql, with its optimisation-free subset having the same expressive power of NP-Alg. Con- SQL is a super-set of sql: users can hence safely exploit the rich set of language features of sql during problem modelling. The current version of the language has only very few deviations from standard sql, in the aim of greatly easing the modelling task of constrained optimisation problems for the average industry programmer. A detailed description of the syntax and semantics of ConSQL is given in the appendix (available at http://www.dis.uniroma1.it/~tmancini). Here, we introduce the language by an example. Example 2 (University timetabling). Assume a university wants to solve the following course timetabling problem (a variant of the ones presented in [4], in order to show the flexibility of the modelling language), namely finding a schedule for the lectures of a set of courses, in a set of classrooms, by minimising the total number of students that are enrolled in courses hav- ing overlapping lectures, by satisfying constraints about lecture allocation in time and rooms (con1, con2, con6), room equipment (con3), teachers’ un- availability (con4) and conflicts (con7), and course lengths (con5). Data reside in a relational database with the tables listed in Table 1 (pri- mary keys are underlined; foreign key constraints are omitted for brevity). A ConSQL specification for this problem is as follows: create SPECIFICATION Timetabling ( // A guessed view (see new construct CHOOSE), encoding a ‘guessed’ timetable, as an // assignment of courses to room/period pairs (w. some possibly unassigned, i.e., null) create view TT as select p.id as p, r.id as r, CHOOSE(select id as c from Course) CAN BE NULL from Period p, Room r // A view defined in terms of TT (dependent guessed view) create view ConflCourses as select c1.id as c1, c2.id as c2 from Course c1, Course c2, TT t1, TT t2 where c1.id < c2.id and t1.c = c1.id and t2.c = c2.id and t1.p = t2.p // Objective function, relying on view ConflCourses MINIMIZE select count(*) from Student s, ConflCourses cc, Enrolled e1, Enrolled e2 where e1.student = s and e1.course = cc.c1 and e2.student = s and e2.course = cc.c2 4 Student(id, name, . . . ) Set of students with id and other attributes Teacher(id, name, . . . ) Set of teachers Course(id, teacher, num_lect) Set of courses to be scheduled, with teacher and nb. of lectures planned Period(id, day, week, start, end) Set of time-periods, plus information on day, start and finish times Room(id, capacity) Available rooms with their capacity Enrolled(student, course) Info on enrolment of students in the various courses Equip(id, name) Equipment available for teaching (e.g., VGA projector) EquipAvail(equip, room) Presence of equipment in the various rooms EquipNeeded(course, equip) Equipment needed for the various courses TeacherUnav(teacher, period) Unavailability constraints for teachers: lectures of their courses cannot be scheduled in these periods Table 1: Relational database schema for the University timetabling problem. // Constraints: // con1. No two lectures of the same course on the same day check "con1" (not exists ( select * from TT t1, TT t2, Period p1, Period p2 where t1.c = t2.c and t1.c is not null and t1.p = p1.id and t2.p = p2.id and p1.day = p2.day and t1.p != t2.p )) // con2. Capacity constraint, in terms of ordinary SQL view create view Audience as select e.course as c, count(*) as nb_stud from Enrolled e group by e.course check "con2" ( not exists ( select * from TT t, Room r, Audience a where t.r=r.id and t.c=a.course and r.capacity c.nb_lect )) // con6. At most 3 lectures of the same course per week check "con6" (3>all( select count(*) from TT t, Period p where t.p = p.id group by t.c, p.week )) // con7. Courses taught by the same teacher not in the same period check "con7" ( not exists ( select * from TT tt1, TT tt2, Course c1, Course c2 where c1.id < c2.id and c1.teacher = c2.teacher and tt1.p = tt2.p )) ); As it can be seen, a problem specification is defined in terms of the new construct create SPECIFICATION, embedding the definition of a set of views (via the standard sql create view construct), an optional objective function (via the new constructs MINIMIZE and MAXIMIZE) and a set of constraints (via the standard sql check keyword). Some of the views are guessed, in that they have special columns, called guessed column-sets. Such columns are defined by the new construct CHOOSE, which is the main enabler of the non-deterministic mechanism added to sql in order to increase its expressive power. The semantics is as follows: • The solver is asked to populate non-deterministically the guessed column- sets of views, choosing tuples from the query argument of CHOOSE. • This choice must be such that all constraints are satisfied and the objective function, if present, takes an optimal value. In the example, our goal is to assign the lectures of some course to room/period pairs (guessed view TT), with some pairs possibly being empty (CAN BE NULL), in such a way that the number of students enrolled in courses with conflicting lectures is minimised and that all the constraints con1–con7 are 5 p r c p r c p1 ra ? p1 ra c3 p1 rb ? p1 rb c1 p2 ra ? p2 ra - p2 rb ? p2 rb c2 p3 ra ? p3 ra - p3 rb ? p3 rb c2 (a) (b) Figure 1: (a) Initialisation of guessed view TT on a sample database with 3 periods and 2 rooms. (b) A possible extension of TT if database has ≥ 3 courses (‘-’ denotes null.) satisfied. The ‘residual’ (i.e., the pure sql) part of view TT has schema (p, r), storing all period/room pairs. This is extended ‘to the right’ with a guessed column-set of width 1 having the schema (c) of the argument query. The overall view has schema (p, r, c). While the values in the first two columns are fixed (period/room pairs), those in the entries of the third column are non- deterministically picked from the sets of values occurring as ids for courses. Some of them can be assigned to null, allowing us to model situations, like ours, where we need partial functions. Figure 1(a) shows guessed view TT at the initialisation stage, when the database holds three periods with ids p1, p2, p3, and two rooms with ids ra and rb. Figure 1(b) shows one possible ex- tension of the guessed view, when the database stores courses having ids c1, c2, c3 (plus, possibly, others). Note that, in case the modifier distinct were used, it would be impossible for two period/room pairs to be assigned to the same course. This is analogous to the all-different constraint used in CP. Extensions of guessed views need to satisfy all constraints in order to be considered solutions to the problem. These are expressed by using the standard sql keyword check applied to an arbitrary sql Boolean expression (various forms allowed by standard SQL to specify conditions, hence con- straints, are described in the appendix). Finally, an objective function can be defined by using one of the two new keywords MINIMIZE and MAXIMIZE applied to an aggregate arbitrary query returning a single scalar defined over database tables, ordinary views, guessed views. 4 Solving NP-Alg expressions over massive data In principle, in order to solve a constrained optimisation problem, different competing techniques could be adopted, e.g., CP, Propositional Satisfiability (SAT), MP. In presence of massive data-sets, such complete approaches may show severe limitations, since they require the systematic exploration of the huge search space. In these cases, incomplete approaches like local-search [6], may help. These algorithms sacrifice completeness for efficiency, and may be advantageous especially on large instances, and in scenarios where com- puting a solution with a sufficient high quality is satisfactory. Local-search has been employed successfully to different formalisms, like SAT (cf. bg- walksat [13]) and CP (cf., e.g., comet [12]), the successful resolution of large scheduling and timetabling problems [10] or other large constrained optimisation problems [12]. In this section, after discussing some convenient syntactic extensions to NP-Alg, and recalling the main notions of classical local-search, we present novel techniques to make local-search applicable on massive data-sets. NP-Alg revisited. The language described in Section 2 can be extended 6 without affecting its expressive power, with means to support bounded inte- gers and arithmetic, typed guessed relations, guessed functions, and guessed permutations. As an example, although in the base version of the language entries of guessed relations take values ranging in DOMD (hence extensions of an r-ary relation range over subsets of DOMDr ), in a (just syntactically) enriched version we could easily force them to be typed and also to model functions [3]. In this way we could easily model, e.g., the Graph k-colouring problem, where the set of the k available colours is given in the additional unary relation K, as follows: Example 3 (Graph k-colouring). 1. Guessed relation: Col : N → K, a total function. 2. fail = σ (Col × Col) ∩ E. $16=$3∧$2=$4 Guessed function Col is represented as a relation having two typed columns, $1 and $2, with entries in $1 ranging over N and those in $2 ranging over K. Being also forced to be a total function, Col is such that any node in N occurs exactly once in the first column of Col. The expression for fail will evaluate to ∅ iff the guessed extension for Col is such that no two different nodes ($1 6= $3) assigned to the same colour ($2 = $4) share an edge. It can be shown that the enriched language where guessed relations are al- ways forced to be typed total functions between two unary database relations has the same expressive power as the base language. This is because: (i) tu- ples of non-unary relations can always be denoted by new constants which act as keys; and (ii) arbitrary guessed relations can always be encoded by characteristic functions. Also, the expression fail can always be rewritten in disjunctive form, as a union of sub-expressions. Since this variation is syntactically much closer to our target practical language ConSQL, in the remainder of this paper we will refer exclusively to it. Hence, from now on, we rely on the following definition: Definition 1 (NP-Alg expression). An NP-Alg expression hS, faili over a relational schema R is defined as: 1. A set S = {S1 , . . . , Sn } of new binary relations, with any Si : Di → Ci forced to be a total function with domain Di and co-domain Ci , with both these relations being unary and belonging to R. 2. An expression fail = ki=1 faili of plain RA on the vocabulary R ∪ S. S The expressions faili encode constraints to satisfy, with their tuples for a given extension of the guessed functions representing constraint violations. The gap between NP-Alg and ConSQL is now much smaller (see Exam- ple 3): guessed functions in NP-Alg neatly correspond to guessed column- sets of ConSQL, and expressions faili correspond to check not exists con- straints. Extensions like having multiple guessed column-sets in the same view, null values, and non-unary relations serving as domains and co-domains are not substantial, and could be simulated by means of additional joins, spe- cial values in the co-domain relations, and introducing unary key fields. 7 Classical local-search. The following notions play key-roles in all local- search algorithms. We recall them in terms of an NP-Alg expression hS, faili when evaluated over a finite database over schema R. Definition 2 (States, search space, costs, solutions). A state, or candidate solution, is an extension S for guessed functions S. The search space is the set of all possible states. The cost of a state S is Σki=1 |faili |, i.e., the total number of tuples returned by sub-expressions faili when evaluated on S. A solution is a state of cost 0. Most of the local-search algorithms described in the literature [6] have, at their core, the following greedy behaviour: current-state ← Init-State() while not Termination-Condition() do move ← choose-improving-move(current-state) if move 6= nil then current-state ← make-move(move) else break The idea is, after a –usually random– initialisation, to iteratively evaluate and perform small changes (moves) to the current state, in order to reduce the cost of the current candidate solution, until some termination condition (in terms of, e.g., the cost of current-state or the amount of computational resources used) is met. The universe of possible moves is chosen in advance by the programmer from the structure of the problem. Although this choice could in principle be arbitrary, in local-search moves are mostly defined to involve a very small change to the current state. The simple structure of guessed functions in NP-Alg suggests a very nat- ural definition for atomic moves, while any other possible definition could be given just by composition. Definition 3 (Move). A triple hS,d,ci with S ∈ S (S : D → C), d ∈ D, c ∈ C. Hence, given any state S of S and the corresponding extension S of guessed function S, a move δ = hS, d, ci changes the extension S of S by modifying the co-domain value assigned to domain value d to c. We denote the state reached from S after performing δ as S ⊕ δ. The new extension of S is denoted by S ⊕ δ. A move δ executed on state S is improving, neutral or worsening iff the cost of state S ⊕ δ is, respectively, less than, equal to, or greater than the cost of state S. Also, given a constraint i, move δ executed on S is improving, neutral or worsening wrt constraint i iff the share of cost of state S ⊕ δ due to constraint i (i.e., |faili |) is, respectively, less than, equal to, or greater than that of state S. The choice of moves to perform usually varies, depending on the partic- ular algorithm considered. As an example, gradient descent chooses an im- proving move randomly, while steepest descent chooses the move that, if per- formed, maximally reduces the cost of the new current state. Since steepest descent needs to consider all possible moves in order to choose the best one, it could be very inefficient on large scale problems. A third very successful algo- rithm is min-conflicts that, once adapted to the structure of the search space in the NP-Alg framework, randomly selects one guessed function S : D → C and one of its domain values d ∈ D, and then chooses the best improving 8 move among all those regarding S and d. All these greedy algorithms return nil if no moves lead to a cost reduction, and hence are easily trapped in the so-called local minima, i.e., states that have cost lower than all their neigh- bours (states that could be reached by executing a single move). To avoid being stuck in local minima, and to be able to continue search towards better states, usually these basic algorithms are enhanced with more sophisticated (not pure-greedy) techniques. As an example, simulated annealing executes also worsening moves, with a probability decreasing during time, while tabu- search dynamically filters the set of available moves, by forbidding those that would bring search back to recently visited states. Also, to increase robust- ness, such algorithms are enhanced with random restarts, joint together in batches, and/or wrapped by even more sophisticated meta-heuristics [6]. Classical local-search algorithms typically use problem constraints only to evaluate the quality of a move, i.e., its impact in the cost of the search state achieved if such a move is performed. In the context of constrained optimisation over massive data, this approach quickly becomes impractical. Example 4 (University timetabling, continued). Assume that database re- lations have sizes |Room| = r, |Period| = p, |Course| = c. By Definition 3, a steepest descent algorithm needs, at any step, to evaluate a constant number prc of candidate moves (each period/room pair can be reassigned to one of the other c−1 courses plus null) to find and choose the one that maximally reduces the cost of the new current state. By using a naive approach, for any such move, 7 sql queries, one for each constraint, should be evaluated on the new database obtained by changing the course assigned to one room/period pair. Even in not-so-large scenarios, e.g., p = 960 (8 periods/day, 5 days/week for 6 months, observe that scheduling is not periodic, given, e.g., teachers’ unavailabilities), r = 30, c = 40, the number of candidate moves to con- sider at each step is over 106 . Now, assuming that each query takes only 10−3 seconds to run, and even ignoring time for modifying and rolling-back the database each time, a naive version of steepest descent would need about 106 · 7 · 10−3 = 7 · 103 seconds to execute one single move! Constraint-directed dynamic neighbourhood reduction (DNR). To overcome these difficulties, the concept of constraint-directed neighbour- hoods has been recently proposed [2]. The idea is to focus on the problem constraints also to isolate subsets of the moves that possess some useful prop- erties, such as those that are guaranteed to be improving wrt one or more constraints. These ideas can be exploited and extended in our context in a very elegant and natural way, thanks to the great simplicity of NP-Alg. Consider an NP-Alg expression hS, faili, where fail = ki=1 faili , and S all faili are conjunctive queries, i.e., of the form faili = σφi (Si1 × · · · × Sis × Ri1 · · · × Rir ), with Sij ∈ S (j ∈ [1 . . s]) and Rip ∈ R (p ∈ [1 . . r]). In ConSQL this is equivalent to say that constraints can be expressed as not exists select-from-where queries, as it is often the case. Let us now focus on a single constraint i. Assume that, in a given state S, faili evaluates to a non-empty set of tuples Vi (that can be regarded as violations of constraint 9 i). In order to find a solution we need to resolve all violations of all con- straints. Let us limit our attention on the extension S ∈ S of a distinguished guessed function S (binary relation encoding a function S : D → C with D, C unary relations in R) occurring m ≥ 1 times in the expression for faili . To emphasise the m occurrences of S, we can –wlog– rewrite faili as follows (after also changing references to columns in φi accordingly): faili = σφi (S (1) × · · · × S (m) × T), (1) with S (1) , . . . , S (m) being the m occurrences of S, and T the Cartesian prod- uct of all the relations other than S in the constraint expression. Since a move over S improving wrt constraint i must resolve at least one violation in Vi ⊆ S (1) × · · · × S (m) × T, we give the following definition: Definition 4 (Promising moves). The set of promising moves over guessed function S wrt constraint i is defined  by the following  RA query: Prom Si = π σ (Vi × D × C) (2) D,C χ∧¬φ0i Wm where χ= j=1 (D.$1=Vi .$(2j − 1)) and φ0i is obtained from φi by replacing any atom a in which S (x).$2 occurs (x ∈ [1..m]) with ((S (x).$1 =D.$1) → a0 ) ∧ ((S (x).$1 6= D.$1) → a) where a0 is obtained from a by replacing any occurrence of S (x) .$2 with C.$1. Despite its apparent complexity, the meaning of formula (2) is quite in- tuitive. A generic tuple λ in Vi has the form hd1 , c1 , . . . dm , cm , ti, with d1 , . . . , dm ∈ D and c1 , . . . , cm ∈ C, and t a sub-tuple over schema T. PromSi computes moves over S (hence pairs δ = hd, ci ∈ D × C, cf. the projection operator) such that: (i) The domain value of δ, d, is equal to at least one among d1 , . . . dm of at least one violation λ ∈ Vi (condition χ); and (ii) For such a λ, tuple λ0 , built by replacing in λ any component cj with the new value c iff the corresponding dj is equal to d, does not satisfy condition φi . Since in RA we cannot modify tuples when evaluating a condition, in (2) we take the alternative approach of changing the condition itself into φ0i . Requirement (i) guarantees that synthesised moves involve only domain val- ues that appear in violations λ ∈ Vi , hence, if performed, resolve at least λ. Requirement (ii), instead, ensures that, after performing δ, the newly introduced tuple in S, i.e., δ, does not lead to a new violation λ0 , at least with tuple t. The following result holds (proofs omitted for space reasons): Theorem 1. In any state S, for any constraint i, and for any S ∈ S having extension S in S, all moves δ over S that, if executed from S, are improving wrt constraint i belong to Prom Si . In other words, since a move is improving iff it is improving wrt at least one constraint, by computing, in the current state, Prom Si for all guessed functions S and all constraints i, we obtain a subset of the possible moves that contains all the improving moves, filtering out many (hopefully most) of the others. To better appreciate the advantages of this computation, let us reconsider the University timetabling problem. 10 TT Prom Audience con2 Audience p r c p r c c nb_stud p1 r1 c1 p2 r2 c1 p1 r2 - Vcon2 p2 r2 · · · c1 27 p1 r3 c3 c2 26 t.p t.r t.c r.id r.capacity a.c a.nb_stud p2 r2 c5 p2 r1 - p2 r2 c7 c3 37 p2 r2 c6 p2 r2 c6 c6 40 c6 43 c4 35 p3 r2 c7 c7 40 c7 67 p2 r2 - p2 r3 c5 p3 r2 c1 c5 48 p3 r1 - ... ... ... ... ... ... ... c6 43 p3 r2 · · · p3 r2 c7 p3 r2 c6 c7 67 p3 r3 c6 p3 r2 - ... ... ... ... ... ... (a) (b) (c) (d) Figure 2: (a) View Audience. (b) A portion of the extension of guessed view TT in current state. (c) Portion of Vcon2 due to the given portion of TT. (d) Promising moves over TT computed starting from the displayed portion of Vcon2 . Example 5 (University timetabling, continued). Assume that the problem instance refers to periods p1–p40, rooms r1, r2, r3 with capacities of 30, 40, 50 students, respectively, and courses c1–c7. The number of students enrolled in the various courses is given by view Audience in Figure 2(a). Suppose now that, in the current state, whose extension for TT is partially given in Figure 2(b), the expression for con2 evaluates to the set of violations Vcon2 of Figure 2(c). Tuples in Vcon2 give us precious information to drive the search. As an example, the first tuple shows that the current candidate timetabling has the problem that course c6 is given in period p2 in a room, r2, which is too small. Analogously for the other tuples. The query for Prom TT con2 (once generalised to handle domain and co-domain relations of arity greater than 1) returns the set of moves that resolve at least one violation in Vcon2 without producing a new one, at least with the same tuples of table Room and view Audience. Figure 2(d) shows part of the result of Prom TT con2 containing the moves synthesised from the first violation. Note that, by using the naive approach, prc = 40 · 3 · 7 = 840 moves needed to be considered. With the presented dynamic reduction instead, at most 7 moves for each violation of con2 need to be considered. Hence, the more progressed the greedy search, and the closer we are to a solution, the fewer the moves that need to be considered. DNR can be exploited also when we do not need knowledge of all improving moves. Consider, e.g., min-conflicts: at any iteration this algorithm ran- domly selects one guessed function S : D → C and one of its domain values d ∈ D, and then chooses the best improving move among all those regarding S and d. Given that for a move to be improving, it needs to resolve at least one violation of one constraint, we can design a violation-directed (and gener- alised) version of min-conflicts in such a way that it first selects a random vio- lation λ ∈ ki=1 Vi , and then, for each pair of components hdj , cj i ∈ λ caused S by an occurrence of any guessed function Sj : Dj → Cj in the definition of the corresponding constraint, it considers as promising all moves of the kind hSj , dj , c0 i, c0 ∈ Cj , c0 6= cj , i.e., all moves that would make λ disappear in the target state. We omit a formal discussion of this variation for space reasons, but we will describe some preliminary experimental results in Section 5. Incremental evaluation of constraints (IEC). Although DNR allows us to reduce the set of moves that need to be considered during a greedy search, it tells us nothing about the quantitative impact (in terms of cost change) of the surviving moves over i and the other constraints. To do this, 11 the naive algorithm would be to execute any such move and evaluate all problem constraints over the new database, then roll-back. In the important and usual case of constraints of the form (1), the fol- lowing results show that: (i) we can compute the exact cost variation upon all moves in a given set by running only 2 queries for each constraint; and (ii) after one such move has been chosen, we can update incrementally its impact in terms of violations added and resolved for all the constraints. Definition 5 (Resolved Si and Added Si ). Given an NP-Alg expression hS, faili, one of the guessed functions S ∈ S, and one sub-expression faili having the form (1) and evaluating to Vi in state S, let MS be a relation over the schema of S encoding an arbitrary set of moves over S. We define the queries:   Resolved Si = σ0 MS×Vi , Added Si = Ψ σ 0 MS ×(S (1)×· · ·×S (m)×T) ,  χ ξ χ00 ∧φi where χ0 = m W 00 Wm (j) j=1 (MS [$1] = Vi [$(2j − 1)]), χ = j=1 (MS [$1] = S [$1]), and φ0i as in Definition 4. The operator Ψ in Added Si is a non-standard replacement operator (that could easily be simulated in sql by means of functions and if statements in the target-list) that changes each tuple τ of the input expression into ξ(τ ). Here, ξ(τ ) produces tuple τ 0 as follows: ( τ [MS .$1, MS .$2] if τ [S (i) .$1] = τ [MS .$1] ∀i ∈ [1. .m] : τ 0 [S (i) .$1, S (i) .$2] = τ [S (i) .$1, S (i) .$2] otherwise while for any other component $j of τ , we have that τ 0 [$j] = τ [$j]. These queries play a key-role in the incremental evaluation of constraints: Theorem 2 (Incremental evaluation of constraints). Given S, Vi and MS as in Definition 5, in the state S ⊕ δ reached after performing any move δ ∈ MS from state state S, the expression faili for each constraint i evaluates to: σ (Resolved Si ) ∪ π σ Added Si ,   Vi − π −MS MS =δ −MS MS =δ with π−MS denoting the projection operator that filters out columns due to MS . Also, Resolved Si ∩ Added Si = ∅. Theorem 2 tells us that, given an arbitrary set of moves over a guessed func- tion S that occurs in the expression faili , we can compute incrementally the result of faili in all the states reached by performing each of these moves, given the result in the current state. In particular, by taking MS = Prom Si , we can compute the exact impact, both in terms of cost variations and in the extension of Vi , of all promising moves. Query Resolved Si returns a set of tuples, each representing a move δ = hd, ci in MS concatenated with a violation in Vi in which value d occurs in the first column of at least one S (i) . Changing the co-domain value associ- ated to d to c will make this violation disappear from Vi in the target state. As for Added Si , it computes the contributions, in terms of added tuples to the result of the query faili , of the execution of each move δ ∈ MS . Such 12 p r c t.p t.r t.c r.id r.cap a.c a.nb_stud p2 r2 c1p2 r2 c6 c6 40 c6 43 p r c t.p t.r t.c r.id r.cap a.c a.nb_stud p2 r2 c5p2 r2 c6 c6 40 c6 43 p2 r2 c5 p2 r2 c5 c5 40 c5 48 p2 r2 c7p2 r2 c6 c6 40 c6 43 p2 r2 c7 p2 r2 c7 c7 40 c7 67 p2 r2 - p2 r2 c6 c6 40 c6 43 p3 r2 c5 p3 r2 c5 c5 40 c5 48 p3 r2 c1p3 r2 c7 c7 40 c7 67 p3 r2 c6 p3 r2 c6 c6 40 c6 43 p3 r2 c6p3 r2 c7 c7 40 c7 67 p3 r2 - p3 r2 c7 c7 40 c7 67 (a) (b) Figure 3: Resolved TT TT TT con2 (a) and Added con2 (b) for some moves in P romcon2 of Figure 2(d). contributions are tuples of the Cartesian product in which value d occurs at least once (condition χ00 ), and ‘modified’ by properly taking into account the new value c that the moves under consideration assign. Again, since in RA we cannot modify tuples under evaluation, we take the dual approach of changing the condition from φ to φ0 . However, since we also must return the modified tuples, we need to use the non-standard Ψ operator, which changes the tuples in the input relations in the same way as done by φ0 . Moreover, given that Resolved Si ∩ Added Si = ∅, just by grouping and counting tuples in those results, we can compute with one additional query the exact cost of all moves in set MS over the constraints. Also, if all Vi s are materialised, we obtain incremental maintenance of the results of the corre- sponding faili s by deleting and inserting tuples returned by these queries. Example 6 (University timetabling, continued). Figure 3 shows the results of computing queries Resolved TT TT con2 (part (a)) and Added con2 (part (b)) re- garding the subset of moves in P romTT con2 explicitly shown in Figure 2(d). As an example, move (p2, r2, c1) will reduce the cost of con2 by 1 (since it occurs once in Resolved TT TT con2 , see Figure 3(a), and it does not occur in Added con2 , see Figure 3(b)). Analogously, move (p2, r2, c5) proves to be neutral wrt con2. Some brief comments follow on the complexity of the three queries. For each guessed function S and constraint i, and assuming that Vi is materialised and incrementally updated thanks to results of Theorem 2, we have that: (i) P romSi can be computed with two joins, the first of which, given the tight condition χ, does not lead to an explosion of the number of tuples (any tuple in Vi can match at most with m tuples of D, with m being usually very small); (ii) Resolved Si involves a single join operation; (iii) Added Si involves one more join operation than the expression of constraint i. Also in this case, the presence of the tight condition χ00 avoids in practice the combinatorial explosion of the number of tuples due to the added join. 5 Implementation, experiments, future work We implemented a new ConSQL solver based on the ideas presented in Sec- tion 4, and built on top of a relational DBMS. The system, written in Java, takes as input a problem specification in ConSQL and uses standard sql commands for choosing, evaluating, and performing moves, hence being inde- pendent of the DBMS used. The user can also annotate her specification with a declarative description of the search strategy to be used, given in terms of the local-search algorithms (and parameters) to run (gradient descent, steep- est descent, tabu search, min-conflicts, simulated annealing) and their com- position (via random restarts or batches). Details are given in the appendix. 13 In order to test the effectiveness of the techniques above on the scal- ability of the approach, we performed preliminary experiments, evaluating the performance gain of DNR+IEC with respect to the naive evaluation of moves and constraints. It is worth observing that our purpose was not to beat state-of-the-art solvers (and in fact we do not even comment on sat- isfaction), but rather to seek answers to the following questions: what is the impact of DNR+IEC on: (i) the reduction of the number of moves to evaluate; (ii) the overall performance of the greedy part of the search. We experimented with two problems, Graph colouring and University timetabling, and two different greedy algorithms: steepest descent and the violation-directed version of min-conflicts mentioned in Section 4. The rea- sons behind our choices are as follows: steepest descent needs to evaluate, at every iteration, all possible moves, hence it is a good indicator of the im- pact of DNR+IEC in terms of effectiveness in neighbourhood reduction and the consequent performance gain. On the other hand, min-conflicts needs to evaluate a much smaller set of moves at any step: hence it can be much more efficient than the former, especially on large scale instances. However, we will see that also in this case DNR+IEC may play an important role. Experiments have been performed on an Athlon64 X2 Dual Core 3800+ PC with 4GB RAM, using MySQL DBMS v. 5.0.75, with no particular opti- misations. Instances of both problems have been solved running both algo- rithms with and without DNR+IEC. In order to perform fair comparisons, the two runs of the same algorithm on the same instance have been executed by using the same random seed. As for steepest descent, this implies that the two runs started from the same state, and executed the same sequence of moves, terminating at the same local minimum. This does not hold for min- conflicts, since when IEC is not activated, constraints are not materialised, and the order with which tuples are retrieved cannot be controlled without paying a very high price in terms of efficiency. As for the choice of problems (specification are given in the appendix) and instances: specification for graph colouring is very simple and consists of one guessed column-set and one constraint (see Example 3). Given its compactness, it gives clean information about the impact of our techniques on a per-constraint basis. We used some quite large benchmark instances taken from http://mat.gsia.cmu.edu/COLOR/instances.html, looking for minimal colourings, with number of needed colours known in advance. As for University timetabling, in order to deal with real world instances, we changed our formulation to that of [4], where schedulings are periodic, and some data preprocessing is already done (e.g., constraint about equipment is already reduced to a relation listing rooms suitable for the various courses). Also, we had to ignore some constraints which would need non-conjunctive queries, not yet supported by DNR+IEC. The final formulation resulted in 8 ConSQL constraints (taking into account course lengths, room suitabil- ity, teachers’ periodic unavailabilities, conflicts among courses with common students, and minimum course spread). As for the instances, we used some of those of Track 3 of the 2007 Intl. Timetabling Competition (http://www. cs.qub.ac.uk/itc2007), taken from http://tabu.diegm.uniud.it/ctt. 14 Some typical-case results are reported in Table 2. Enabling DNR+IEC considerably boosted performance: impressive speed-ups were experienced on all instances, especially when all moves need to be evaluated (steepest descent). Also in the cases where DNR+IEC runs did not terminate in the 1- hour timeout, the number of iterations executed is always much higher than with DNR+IEC disabled. However, DNR+IER brings advantages also when complete knowledge on the possible moves is not needed (min-conflicts), al- though speed-ups are unsurprisingly lower. Finally, we observed (see Fig- ure 4) the strong expected reductions of the size of the neighbourhood, thanks to the action of DNR (steepest descent). In particular, DNR is able to filter- out often > 30% of the moves at the beginning of search, and constantly more than 80% (with very few exceptions) when close to a local minimum. Given the promising, even if preliminary results, we are working on ex- tending DNR and IEC to constraints defined by queries with aggregates and groupings, on evaluating them on other heuristics, and on investigating database optimisations, e.g., automated indices generation, in order to ex- ecute queries as efficiently as possible and tackle much larger scenarios, as those mentioned in Section 1. Acknowledgements. This work has been partly carried out during visits of T. Mancini to Uppsala University, supported by the Swedish Foundation for Intl Cooperation in Research and Higher Education (STINT, grant KG2003- 4723) and Blanceflor Foundation. P. Flener and J. Pearson are supported by grant 2007-6445 of the Swedish Research Council (VR). Authors thank the anonymous reviewers for their comments and suggestions. References [1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley, 1995. [2] M. Ågren, P. Flener, and J. Pearson. Revisiting constraint-directed search. Information and Computation, 207(3):438–457, 2009. [3] M. Cadoli and T. Mancini. Combining Relational Algebra, sql, Constraint Modelling, and local search. Theory and Practice of Logic Programming, 7(1&2):37–65, 2007. [4] F. De Cesco, L. Di Gaspero, and A. Schaerf. Benchmarking curriculum-based course timetabling: Formulations, data formats, instances, validation, and results. In Proc. of PATAT’08, 2008. [5] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman & Co., 1979. [6] H. H. Hoos and T. Stützle. Stochastic Local Search, Foundations and Applications. Elsevier/Morgan Kaufmann, 2004. [7] D. Mitchell and E. Ternovska. A framework for representing and solving NP search problems. In Proc. of AAAI 2005, pages 430–435, 2005. AAAI Press/MIT Press. [8] C. H. Papadimitriou. Computational Complexity. Addison Wesley, 1994. [9] R. Ramakrishnan and J. D. Ullman. A survey of deductive database systems. J. of Logic Programming, 23(2):125–149, 1995. [10] A.Schaerf. A survey of automated timetabling. Artif. Intell. Review,13(2):87-127,1999. [11] G. Terracina, N. Leone, V. Lio, and C. Panetta. Experimenting with recursive queries in database and logic programming systems. Theory and Practice of Logic Programming, 8(2):129–165, 2008. [12] P. Van Hentenryck and L. Michel. Constraint-Based Local Search. MIT Press, 2005. [13] W. Zhang, A. Rangan, and M. Looks. Backbone guided local search for maximum satisfiability. In Proc. of IJCAI 2003, pages 1179–1186, 2003. Morgan Kaufmann. 15 Graph colouring Steepest descent: sec (#iter) Min-conflicts: sec (#iter) instance nodes edges cols DNR+IEC on DNR+IEC off speedup DNR+IEC on DNR+IEC off speedup anna 138 493 11 39 (21) 2682 (21) 69x 11 (14) 38 (33) 1.5x games120 120 638 9 502 (40) – (37) 8x 31 (48) 43 (48) 1.4x homer 561 1629 13 247 (20) – (2) 146x 36 (22) 53 (22) 1.5x le450_5a 450 5714 5 – (140) – (1) 140x 112 (22) 1667 (240) 1.4x miles1000 128 3216 42 286 (38) – (1) 478x 418 (25) 1266 (60) 1.3x miles250 128 387 8 10 (28) 1734 (28) 173x 6 (10) 20 (36) 0.9x miles500 128 1170 20 38 (28) – (7) 379x 41 (14) 173 (46) 1.3x miles750 128 2113 31 143 (36) – (2) 453x 199 (25) 479 (46) 1.3x mulsol.i.1 197 3925 49 – (4) – (0) ? 637 (36) 1083 (47) 1.3x mulsol.i.2 188 3885 31 2683 (60) – (1) 81x 235 (22) 879 (64) 1.3x mulsol.i.3 184 3916 31 1506 (54) – (1) 129x 296 (28) 1228 (91) 1.3x mulsol.i.4 185 3946 31 974 (56) – (1) 207x 288 (27) 1238 (91) 1.3x mulsol.i.5 186 3973 31 1827 (53) – (1) 104x 146 (13) 991 (72) 1.2x myciel6 95 755 7 14 (42) 1214 (42) 87x 10 (39) 19 (47) 1.6x queen11_11 121 3960 11 152 (52) – (9) 137x 36 (14) 200 (63) 1.2x queen13_13 169 6656 13 – (36) – (2) 18x 58 (8) 896 (104) 1.2x zeroin.i.2 211 3541 30 287 (53) – (1) 665x 274 (26) 917 (68) 1.3x University timetabling Min-conflicts: sec (#iter) instance courses lectures rooms periods DNR+IEC on DNR+IEC off speedup comp01 30 160 6 30 410 (5) – (6) 7x comp02 82 283 16 25 – (8) – (2) 4x comp03 72 251 16 25 – (12) – (2) 6x comp04 79 286 18 25 – (10) – (1) 10x comp05 54 152 9 36 511 (8) – (11) 5x comp06 108 361 18 25 – (2) – (0) ? comp07 131 434 20 25 – (1) – (0) ? comp08 86 324 18 25 – (8) – (1) ? comp09 76 279 18 25 3079 (15) – (2) 9x comp10 115 370 18 25 – (1) – (0) ? comp11 30 162 5 45 – (23) – (4) 6x comp12 88 218 11 36 – (16) – (3) 5x comp13 82 308 19 25 – (17) – (0) ? comp14 85 275 17 25 – (3) – (1) 3x comp15 72 251 16 25 – (10) – (1) 10x comp16 108 366 20 25 – (1) – (0) ? comp17 99 339 17 25 – (3) – (1) 3x comp18 47 138 9 36 142 (40) – (47) 21x comp19 74 277 16 25 1621 (8) – (2) 9x comp20 121 390 19 25 – (1) – (0) ? comp21 94 327 18 25 – (1) – (0) ? Table 2: ConSQL results in running steepest descent and min-conflicts on Graph colouring (top) and University timetabling (down), with and without DNR+IEC (1-hour timeout). Speedup = (iter (on)/time(on))/(iter (off )/time(off )). Since the size of university timetabling instances did never allow steepest descent to perform more than one iteration in 1 hour, such data has been omitted. With a longer timeout, we observed behaviours similar to those of Graph colouring. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Figure 4: Ratio of number of promising moves wrt total number of moves, as a function of the state of run (0%=start, 100%=local min. reached) for all graph colouring instances in Table 2(top) for which the steepest descent run with DNR+IEC terminated in 1 hour. 16