The Data Readiness Problem for Relational Databases Rada Chirkova1 , Jon Doyle1 , and Juan L. Reutter2 1 North Carolina State University 2 Pontificia Universidad Católica de Chile Abstract. We consider the problem of determining whether organiza- tions facing a new data-transformation task can avoid building a new transformation procedure from scratch by reusing their stored proce- dures. Because it can be difficult to obtain exact descriptions of what stored procedures do, our framework abstracts data-transforming tools as black-box procedures, in which a procedure description indicates the parts of the database that might be modified by the procedure and con- straints on the states of the database that must hold before and after the application of this procedure. In this paper we present our framework and study the problem of de- termining, given a database and a set of procedures, whether there is a sequence of procedures from this set such that their application to the database results in the satisfaction of a boolean query. This data readi- ness problem is undecidable in general, but we show decidability for a broad and realistic class of procedures. 1 Introduction The databases of many organizations nowadays periodically undergo transforma- tions, due to applications of data-improvement operations, merging of multiple repositories, or management decisions. These transformations are commonly car- ried out by means of stored procedures or other similar artifacts that are kept together with the database, and may have to be applied periodically, becoming at times part of the normal daily operations of the organizations. For example, it is not uncommon to find institutions with separate databases for their ac- countancy and operations divisions, in which the integration is carried out by a stored procedure that runs at the end of every working day. Whenever a new data-transformation task arises, organizations facing the cost of assembling a new procedure to solve this task may ask instead whether one can reuse some of the procedures that are already available. However, to answer this question we need to be able to reason about the outcomes of procedures, or even of sequences of applications of procedures. Several lines of research have been studying these outcomes when procedures are understood as part of the normal operations of an institution (see excellent surveys [14, 8, 1]); many of these works assume a complete description of all the procedures involved in these operations. At the same time, it is not always feasible to obtain an exact description of the inner workings of a procedure (as in when its creator(s) no longer work for the company, see, e.g., [13]). In such cases, one has to work with informal or vague descriptions of what procedures do: “This procedure copies relation A into relation B,” or “this procedure removes all nulls from this relation.” To model such uncertainty, we do not assume that we have a precise description of the operations of the procedures, and adopt instead a black-box view of a procedure. That is, we describe procedures in terms of which parts of the data might be modified by the procedure, as well as by the constraints that specify the required states of the data before and after applying the procedure. Motivating example: Suppose a medical analyst wishes to know the emer- gency rooms used by patients with a certain medical insurance. The data owned by the analyst reside in relation LocVisits (facility,pId ,timestp), with the at- tributes standing, respectively, for the id of the facility where the emergency room is, the social-security number of a patient, and a timestamp marking the date of the visit. The analyst has also been given two procedures he can execute as-is but not modify: One is Pmigrate , which is supposed to migrate data into LocVisits from relation EVisits owned by another analysis company. The other procedure, Pinsur , augments LocVisits with an attribute insId con- taining the insurance id’s of patients, and whose data are drawn from relation Patients(pId, insId) owned by the local authority. Given an insurance id I, the analyst can capture the desired information via query SELECT facility FROM LocVisits WHERE insId = I, posed over LocVisits modified by adding attribute insId containing the insurance id’s of patients. It is natural for the analyst to ask: Can I use any available procedures to transform my data so that this query can be posed on my database? In other words, is there a way to apply these procedures so that I could guarantee that my database satisfies certain fitness-for-use [12] criteria? We propose a formal framework in which data-transforming tools are ab- stracted as black-box procedures, described by the following information: – A specification of which parts of the database the procedure is modifying; – Conditions to be satisfied for the procedure to be applicable; – Conditions that will be satisfied once the procedure has been applied; and – Any additional guarantees on parts of the data that must not be modified. In this paper we study this framework for procedures that do not alter the schema of databases, such as the procedure that migrates the information of the analyst in the example above. We study basic questions arising in the framework, such as whether a procedure can be applied to the outcome of a given procedure over a given instance, and whether the outcome of a (sequence of) procedures is nonempty. Finally, we consider what we call the data-readiness problem: Given an instance I, a set Π of procedures, and a boolean query over instances (that intuitively expresses a desired property of the data), is there a way to construct a sequence of procedures from Π so that each instance in the outcome satisfies this property? While undecidable in its general form, we show that this problem is decidable for some broad classes of procedures. For space reasons we omit proofs from this draft. All of them can be found in the full version of this paper [5]. 2 Preliminaries Since we aim to model procedures over real databases, we write the paper using a specific named assumption over instances and queries. Schemas and Instances. Assume three disjoint sets: a countably infinite set of attribute names A = {A1 , A2 , . . .} totally ordered by ≤A , a countably infinite domain of values (or elements) D, and a countably infinite set of relation names R = {R1 , R2 , . . .}. A relational schema over A and R is a partial function S : R → 2A , which associates a finite set of attributes with a finite set of relation symbols. We say that R is in S if S(R) is defined. An instance I of schema S assigns a set RI of tuples to each relation R in S, so that if S(R) = {A1 , . . . , An } then RI ⊆ Dn , with the set of tuples structured so that the elements of each tuple (a1 , . . . , an ) appear in the assumed attribute order, that is, A1