Information Discovery in Polystores: the Augmented Way (Discussion Paper) Antonio Maccioni1 and Riccardo Torlone2 1 Collective[i], New York City, USA amaccioni@collectivei.com 2 Roma Tre University, Rome, Italy riccardo.torlone@uniroma3.it Abstract. Polystores provide a loosely coupled integration of heteroge- neous data sources based on the direct access, with the local language, to each storage engine for exploiting its distinctive features. In this frame- work, given the absence of a middleware exposing a global schema, it is hard to know if a query to one system can be satisfied by data stored elsewhere in the polystore. We address this problem by illustrating query augmentation, a data manipulation operator for polystores based on the automatic enrichment of the answer to a local query with related data in the rest of the polystore. Augmentation can be used to implement augmented search and augmented exploration: two effective methods for information discovery in polystores that avoid middleware layers, ab- stract query languages, and shared data models. 1 Introduction The concept of polyglot persistence, which consists of using different database technologies to handle different data storage needs [9], is spreading within enter- prises. Recent research has shown that, on average, each enterprise application relies on at least two or three different types of database engines. Example 1. Let us consider a company called Polyphony selling music online. As shown in Fig. 1, each department uses a storage system that best fits its specific business objectives: (i) the sales department guarantees ACID properties for its transactions database with a relational system, (ii) the warehouse department supports its activities with a document store catalogue, where each item is rep- resented by a JSON document, and (iii) the marketing department uses a graph database of similar-items supporting recommendations. In addition, a key-value store containing discounts on products is shared among the above departments. Copyright c 2019 for the individual papers by the papers’ authors. Copying per- mitted for private and academic purposes. This volume is published and copyrighted by its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy. Fig. 1. A polyglot environment. This approach, however, makes even harder the problem of data integration given the strong heterogeneity of the systems in play [4]. The traditional solution to this issue is based on a middleware layer involving a unified language, a common interface, or a universal data model [10]. However, the addition of this layer to the software stack adds computational overhead at runtime and, more importantly, hides the specificity and functionality for which these systems were adopted. In addition, it is hard to maintain, having an inherent complexity that increases significantly as new database systems take part to the environment. Polystore systems (or simply, polystores) have been proposed recently as an alternative solution for this approach [12]. The basic idea is to provide a loosely coupled integration of data sources and allow the direct access, with the local language, to each specific storage engine to exploit its distinctive features. This solution meets the requirements of scenarios in which users are often aware of a single (or a few) database only but does not know anything about other databases (neither the content, nor the way to query them and, sometimes, not even their existence). On the other hand, it poses new challenges for accessing and combining all the available data. To recall a discussion about this approach, the issue is that “if I knew what query to ask, I would ask it, but I don’t” [12]. In this paper, we present (query) augmentation, a new construct for data ma- nipulation in polystores that, based on a simple notion of probabilistic relation- ship between objects in different data stores, allows the automatic enrichment of a dataset stored in a local database with data outside the database but avail- able in the polystore. The implementation of this operator does not require the addition of an abstraction layer involving query translation and therefore has a minimal impact on the applications running on top of the data layer. The goal is to provide a soft mechanism for data integration in polystores that complements other approaches, such as those based on cross-db joins [3]. Two effective methods for data access in polystore can be defined with the augmentation construct: augmented search and augmented exploration. Augmented search is a form of query relaxation [7] and consists of the au- tomatic expansion of the result of a query over a local database with data that is relevant to the query but is stored elsewhere in the polystore. This is useful in scenarios where information is shared across the organization and the various databases complement or overlap each other. Assume for instance that Lucy, an employee of Polyphony working in the sales department, only knows SQL but needs all the information available on the album “Wish”. Then, she just needs to submit, in augmented mode, an SQL query over the database transactions in Fig. 1. By exploiting augmentation, the result of the query is the augmented ob- ject reported below, revealing details on the product that are not in the database of the sales department, including the fact that it is currently on a 40% discount. < a32, Cure, Wish > ⇒ (catalogue:{ title: Wish, ⇓ artist id: a1, (discounts: 40%) artist: The Cure, year: 1992, ... } ) In an augumented search, each retrieved element e is associated with the prob- ability that e is related to the result of the query. Such probability is derived off-line from integrity constraints and data mining techniques. Colors (as in the example above) and rankings can be used in practice to represent, in a more intuitive way, external data and the degree of relevance. Augmented exploration makes use of the augmentation operator to provide a more interactive and flexible way to access data, which consists of a guided expansion of the result of a local query. For example, if Lucy submits in ex- ploratory mode another SQL query to the database transactions for retrieving all the sales whose total income is greater than 15, she obtains the tuple that fol- lows, in which the links suggest that further information, related to the returned tuple, is available elsewhere and allows her to decide with a click (e.g., on the user name) how to deepen the result. As above, the result is probabilistic. on−click 9999999K ( { id: c1, name: John, surname: Doe, city: NYC, ... } ) This process is iterative and provides a method for database exploration [2], where the user can freely find her way through the polystore, by just clicking on the links as soon as they are made available. We have implemented our approach in QUEPA3 , a polystore system equipped with the augmentation operator that provides the access methods dis- cussed above and is compatible with most modern database engines. QUEPA operates in a plug-and-play mode and does not affect the access modalities of the various storage systems, thus reducing the need for ad-hoc configurations and for a middleware involving unified query languages or shared data models. 3 A demo of QUEPA has been shown in [5]. In addition, we have developed an optimization technique that relies on machine learning methods to tune the execution of the augmentation operator. In the rest of the paper we briefly describe the underlying data model (Sec- tion 2), the augmentation mechanism (Section 3), and its implementation (Sec- tion 4). Details and experimental results are in the full version of the paper [6]. 2 A data model for polystores In PDM (Polystore Data Model) a polystore P is made of a set of databases P = {D1 , . . . , Dn } stored in a variety of data management systems S1 , . . . , Sn respectively (relational, key-value, graph, etc.). A database D ∈ P consists of a set of data collections each of which is a set of (data) objects. An object o ∈ C is just a key-value pair: o = hk, vi where k identifies uniquely o in C and v is an atomic piece of data. A tuple and a JSON document are examples of data objects in a relational database and in a document store of a polystore, respectively. By definition, given a database D of a polystore P, a data collection C in D and a data object o = hk, vi in C, we can uniquely identify o in P by means of k, C and D. We call k̂ = D.C.k the global-key of o in P. Example 2. In the polystore in Figure 1 hd1, { id : d1, title : Wish, . . .}i is an object of the albums collection in the catalogue database whereas hs8, (s8, John Doe, 20.0)i is an object of the sales collection in the transactions database. The global-key of the latter is transactions.sales.s8. This model captures polystores involving any database system satisfying the minimum requirement that every stored data object can be identified and ac- cessed by means of a key. Note that the granularity of objects inevitably depends on the designer’s choice [1]. The other main ingredient of PDM is the ability to correlate data objects of possibly different databases of the polystore by means of the following basic notion, which we call p-relation (for relation in a polystore). Definition 1 (p-relation). A p-relation on two objects o1 and o2 , denoted by o1 Rp o2 , represents the existence of a relation R between o1 and o2 with probability p (0 < p ≤ 1), where R can be one of the following types: – the identity, denoted by ∼: a reflexive, symmetric and transitive relation (i.e., an equivalence relation), representing the fact that o1 and o2 refer to the same real-world entity; – the matching, denoted by : a reflexive and symmetric relation (not neces- sarily transitive), representing the fact that o1 and o2 share some common information. We assume that in a polystore p-relations are “consistent” in the sense that they satisfy the following condition: for each triple of objects o1 , o2 and o3 such that o1 o2 and o2 ∼ o3 it is the case that o1 o3 . While this is a natural condition (two equivalent objects should match the same objects), it guarantees that the augmentation construct behaves consistently with equivalent objects. Example 3. Consider the polystore in Fig. 1. By denoting the objects with their global keys we have for instance that: catalogue.albums.d1 ∼0.8 dis- count.drop.k1:cure:wish, and transactions.inventory.a32 1 transactions.sales-details.i4. Basically, while the identity relation serves to represent multiple occurrences of the same entity in the polystore, the matching relation models general relation- ships between data different from the identity (e.g., those typically captured by foreign keys in relational databases or by links in graph databases). On the prac- tical side, p-relations are derived from the metadata associated with databases in the polystore (e.g., from integrity constraints) or are discovered using prob- abilistic mining techniques. For the latter task, we rely on the state-of-the-art techniques for probabilistic record linkage [8], that is, algorithms able to score the likelihood that a pair of objects in different databases match. 3 Augmentation in polystores The augmentation construct. The basic operator of our approach takes as input an object o of a polystore and returns the augmented set αn (o), which iteratively returns data objects in the polystore that are related to o with a certain probability. This probability is computed by combining the probabilities of the relationships that connect o with the retrieved objects. Definition 2. Let o be a set of objects in a polystore P. The augmentation αn of level n ≥ 0 of o is a set o0 of objects op , where o ∈ P and p is the probability of membership of o to o0 , defined as follows (m > 0): – α0 (o) = o ∪ {op | o ∼p o0 ∧ o0 ∈ o} – αm (o) = αm−1 (o) ∪ {op̂ | (o p0 o0 ) ∧ (o0p ∈ αm−1 (o)) ∧ (p̂ = p · p0 )} Example 4. Let us consider the polystore in Fig. 2 in which nodes represents objects, ovals represent different databases, solid lines represent identity p- relations, and dashed lines represent matching p-relations. Labels associated with p-relations denote their probability. Then, we have α0 ({o4 }) = {o4 , o0.9 1 }, α1 ({o4 }) = {o4 , o0.9 0.8 0.7 3 0.9 0.8 0.7 0.7 1 , o7 , o8 }, and α ({o4 }) = {o4 , o1 , o7 , o8 , o10 }. Note that, by construction, for every k ≥ 0, if o ∈ αk (o) then o0 ∈ αk (o) for each o0 ∼ o. Indeed, if o belongs to αk (o) because of an object o00 ∈ αk−1 (o) such that o00 o, the consistency condition above implies that if o ∼ o0 , we have also 00 that o o0 and so, by Definition 2, o0 ∈ αk (o). The augmented construct can be at the basis of two alternative ways to access a polystore, as described in the following. Augmented Search. An augmented search consists of the expansion of the result of a query over a local database with data that are relevant to the query but are stored elsewhere in the polystore. Consider again a polystore P composed of a set of databases stored in different data management systems each equipped with a specific query language. Now, let QS denote a query expressed in the query language of the storage system S and let QS (D) be the set of objects in the result of QS over a database D stored in S. 0.7 0.9 0.8 0.9 o1 o4 o6 o8 1 o2 o5 o7 o9 0.6 0.6 o3 o10 D1 D2 D3 D4 Fig. 2. A set of databases in a polystore Definition 3. The augmentation of level n ≥ 0 of a query QS over a database D stored in S, denoted by QS (n)(D), consists in the augmentation of level n ≥ 0 of the result of QS over D ordered according to the probability of its elements. Example 5. Let Q be a query over the database D1 in Fig. 2 that returns the objects o1 and o2 . Then we have Q(0)(D1 ) = {o1 , o2 , o0.9 4 } and Q(1)(D1 ) = (o1 , o2 , o0.9 4 , o0.7 0.6 0.6 7 , o8 , o3 ). Augmented Exploration Exploratory computing is a new trend in big data access that aims at helping users to make sense of very big data sets by means of step-by-step interaction with the system oriented to the progressive refinement of the data retrieval process [2]. The augmentation construct can provide effec- tive support for exploratory computing in a polystore through a process that we call augmented exploration. Intuitively, augmented exploration consists of a guided expansion of the result of a query over a local database with related data stored elsewhere in the polystore. It works as follows: we start with a query QS expressed in the query language of the storage system S in the polystore and execute QS over a database D stored in S. We then select, from the answer of the query an object o and apply to o the augmentation construct of level 0 (step 1) and order the result according to the probability of each element. Again, we select, from the result we obtain, an object o1 and apply to it the augmentation construct of level 1 (step 2) and order the result. We then proceed similarly, by selecting an object oi from the result of the previous step and apply the aug- mentation construct of level 1 to oi , until the user is satisfied with her search. This can be formalized as follows. Definition 4. An augmented exploration of a polystore P starting from a query QS over a database D stored in S consists of a sequence of k steps: [(o0 → o0 ); (o1 → o1 ); . . . ; (ok → ok )] where: – o0 ∈ QS (D) and o0 = α0 ({o0 }), – oi ∈ oi−1 and oi = α1 ({oi }) (i > 0). Example 6. Consider the query in Example 5 wich returns the set of objects {o1 , o2 }. Then, a possible augmented exploration involves the steps: (o1 → {o0.9 0.6 0.8 0.7 4 }); (o2 → {o3 }); (o4 → {o7 , o8 }) Fig. 3. Architecture of QUEPA. 4 A tool supporting augmentation We have fully implemented our approach in a system called QUEPA [5]. Its architecture is illustrated in Fig. 3 and includes the main following components: – Augmenter: it implements the augmentation operator and orchestrates aug- mented query answering. Various smart strategies have been adopted for its optimization aimed at being: (i) network-efficient, by retrieving all the ob- jects from one database with a single query and thus reducing the number of queries to local databases and the communication between them, (ii) CPU- efficient, by assigning independent queries to parallel threads and leveraging the multi-core nature of modern CPU, and (iii) memory-efficient, by suit- ably caching the last accessed objects and thus reducing the I/O requests. In addition, since traditional cost-based optimizers are difficult to implement in a polystore, given the limited knowledge of each database in play, we have developed an adaptive, rule-based optimizer to dynamically predict the best combination of the above strategies for the the query under consideration. – A+ index: it is a global, graph-based index in which each global-key is rep- resented in one node and there are two types of edges connecting nodes that represent identity and matching p-relations, respectively. – Collector: it is in charge of discovering, gathering and storing p-relations in the A+ index. This is done by collecting fresh metadata, as soon as new datasets are added, and by leveraging existing techniques for record link- age [8]. Specifically, we used BLAST [11] for a non-supervised step of object blocking and DuKe4 for the following step of pairwise matching. On the basis of the final matching score provided by DuKe, we decide whether a p-relation is an identity or a matching. In addition to the creation of the A+ index from scratch, we have developed a simple, yet effective, learning mechanism that 4 https://github.com/larsga/Duke adds matching p-relations depending on the navigation history of users when they operate in exploratory mode. – Connectors: they are used to interact with the polystore. Each connector is able to communicate with a specific database system by sending queries in the local language and returning the result. Data objects are parsed into an internal representation. – Validator: it is used to pre-evaluate a query and to assess rapidly whether it can be augmented or not. The validator first checks if the query is correct (step Á) and possibly rewrites it (step Â) before its execution over the target database (step Ã). The local answer a is returned to the augmenter which is now ready to compute the augmentation (step Ä). It gets from the A+ index the global keys of data objects reachable from those in a with n applications of the augmentation primitive (step Å). These global keys are used to retrieve data objects from the polystore with local queries Qi (step Æ). Finally, the augmented answer is returned to the user (step Ç). The interaction in the augmented exploration is similar but simplified since, at each step, only a single data object is augmented. A comprehensive campaign of experiments done with QUEPA has shown that our approach is feasible, effective, and, unlike other approaches, scales nicely as the polystore grows in the number of stores and size of databases [6]. References 1. P. Atzeni, F. Bugiotti, L. Cabibbo, and R. Torlone. Data modeling in the NoSQL world. Computer Standards and Interfaces, 2016. 2. M. Buoncristiano et al. Database challenges for exploratory computing. SIGMOD Record, 44(2):17–22, 2015. 3. J. Duggan et al. The BigDAWG polystore system. SIGMOD Record, 44(2):11–16, 2015. 4. L. M. Haas. The power behind the throne: Information integration in the age of data-driven discovery. In SIGMOD, page 661, 2015. 5. A. Maccioni, E. Basili, and R. Torlone. QUEPA: QUerying and exploring a poly- store by augmentation. In SIGMOD, pages 2133–2136, 2016. 6. A. Maccioni and R. Torlone. Augmented access for querying and exploring a polystore. In ICDE, pages 77–88, 2018. 7. D. Martinenghi and R. Torlone. Taxonomy-based relaxation of query answering in relational databases. VLDB J., 23(5):747–769, 2014. 8. I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183–1210, 12 1969. 9. P. J. Sadalage and M. Fowler. NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Addison-Wesley Professional, 1st edition, 2012. 10. A. P. Sheth and J. A. Larson. Federated database systems for managing dis- tributed, heterogeneous, and autonomous databases. ACM Computing Surveys, 22(3):183–236, Sep 1990. 11. G. Simonini, S. Bergamaschi, and H. V. Jagadish. BLAST: a loosely schema-aware meta-blocking approach for entity resolution. PVLDB, 9(12):1173–1184, 2016. 12. M. Stonebraker. The case for polystores. ACM SIGMOD Blog. http://wp.sigmod. org/?p=1629, July, 2015.