Enabling Global Big Data Computations Damianos Chatziantoniou Panos Louridas Athens University of Economics and Business Athens University of Economics and Business Athens, Greece Athens, Greece damianos@aueb.gr louridas@aueb.gr ABSTRACT span systems, platforms will need to be integrated and federated.” Most analytics projects focus on the management of the 3Vs of Data integration involves combining data residing in different big data and use specific stacks to support this variety. However, sources and providing users with a unified view of them [15]. they constrain themselves to “local” data, data that exists within Data integration can be seen as constructing a data warehouse, or or “close” to the organization, or external data imported to lo- creating a virtual database (federated/mediated systems). While cal systems. And yet, as it has been recently pointed out, “the data warehousing was the way to go in the past—mainly due to value of data explodes when it can be linked with other data.” the dominance of relational systems in data management—there In this paper we present our vision for a global marketplace of are well-thought arguments to reconsider a federated approach analytics—either in the form of per-entity metrics or per-entity in big data applications [20]. Polystores [10], closely related to data, provided by globally accessible data management tasks— federated databases, address the need for managing information where a data scientist can pick and combine data at will in her represented in different data models. This is similar to this pa- data mining algorithms, possibly combining with her own data. per’s motivation: using the answer of computations defined over The main idea is to use the dataframe, a popular data structure different data models and query languages. However, we focus in R and Python. Currently, the columns of a dataframe contain on standardizing the output of a computation and use it in a computations or data found within the data infrastructure of the conceptual model, rather than integrating data model and query organization. We propose to extend the concept of a column. A capabilities in the system. It is worth mentioning that defining column is now a collection of key-value pairs, produced any- global views over heterogeneous data sources is not a big data-era where by a remotely accessed program (e.g., an SQL query, a issue and has been extensively discussed in the past (e.g., [2]). MapReduce job, even a continuous query.) The key is used for We argue that a standardized and protocol-based approach the outer join with the existing dataframe, the value is the con- can significantly facilitate the unified dissemination, federation tent of the column. This whole process should be orchestrated by and analysis of data. Once the output of big data computations a set of well-defined, standardized APIs. We argue that the pro- (from simple SQL queries to complex predictive models) can be posed architecture presents numerous challenges and could be standardized and accessed globally, anyone can use it in his own beneficial for big data interoperability. In addition, it can be used analysis framework. to build mediation systems involving local or global columns. Section 2 presents an example from the telecom domain and Columns correspond to attributes of entities, where the primary motivates the paper. It introduces the concept of global dataframes: key of the entity is the key of the involved columns. dataframes constructed by columns that are globally accessible and represent a data management task. Section 3 describes the big picture: a dataframe composed of globally addressed columns. 1 INTRODUCTION Section 4 presents the architecture and the necessary APIs to Currently, most big data deployments follow a highly ad hoc, non- support the management and usage of these global columns. The disciplined approach, entailing a high degree of data replication challenges of such an architecture (performance, transactionality and heterogeneity, both in terms of storing options and analy- issues, distribution, etc.) are introduced in Section 5. We conclude sis tasks. The system administrator has to choose one (or more) with conclusions in Section 6. data management systems from a plethora of alternatives and facilitate the enterprise’s reporting needs utilizing a wide range of query languages and analysis techniques. Data management 2 MOTIVATION systems involve traditional RDBMSs, Hadoop clusters, NoSQL Consider the churn prediction problem in a telecom environment databases, and others. Reporting and analysis tasks include plain in the presence of structured and unstructured data. For this SQL, spreadsheet scripts, MapReduce jobs, R/Java/Python pro- purpose, a predictive model had to be designed and implemented grams, complex event processing queries, machine learning algo- taking into account the many possible variables (features) char- rithms, and others. A not-so-new challenge resurfaces: interoper- acterizing the customer. The goal was to equip the data analyst ability. How can these systems interact? How can these systems with a simple tool that enables fast and interactive experimen- interoperate? tation by using features from multiple data sources, involving This necessity has been identified by the current authors different data management systems and data formats. In our case, in [7, 8] and more recently by the Beckman report [1]. The Beck- the company had a variety of data sources, such as: man report recognized the problems the “diversity in the data management landscape” creates and asserted “the need for co- • A traditional RDBMS containing basic customer-related existence of multiple Big Data systems and analysis platforms data such as gender, age, address and various demograph- is certain” and that in order “to support Big Data queries that ics. © 2018 Copyright held by the owner/author(s). Published in the Workshop • A relational data warehouse storing billing, usage and Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna, traffic activity per contract key—a contract may involve Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted several customer IDs. under the terms of the Creative Commons license CC-by-nc-nd 4.0. between columns), as well as in Spark. The fundamental differ- ence between our proposal and these approaches is that we treat, in principle, our data as columns coming from heterogeneous sources, and dataframes are composed on the fly from them. In R and pandas dataframes are typically created from data imported from different, possibly heterogeneous sources, but the dataframe Figure 1: Defining a Tabular View over Multiple Data reflects the structure of the underlying data. One could adopt a Sources columnar key-value approach, for example by using exclusively Series in pandas or atomic vectors and lists in R, but that would probably defeat the purpose and the philosophy of these frame- • Flat files produced by statistical packages such as SAS and works. Concerning Spark, again dataframes reflect the structure SPSS, containing data transformations and precomputed of underlying data; moreover, most likely data are copied to an measures per contract key on different datasets. underlying HDFS substrate. • CRM data stored in a relational database, containing meta- This is the pattern in most big data tasks: building a dataframe data of customer-agent interactions, including agent’s over different datasets, to serve as input to data mining algo- notes (text) on the call. rithms, visualization tools and reporting systems. Essentially this • Email correspondence between customers and the cus- could be modeled conceptually as a single tabular data represen- tomer service center of the company (text). tation of joined results coming from different data management • Audio files stored in the file system, containing conversa- systems. Each result consists of the keys used for the outer join tions between customers and agents. and the corresponding values to be added as a column; it can therefore be represented as a set of key-value pairs. This formal- The ultimate goal was to provide the data scientist with a ism is quite appropriate to represent columnar data in a dataframe. simple way (a tool of some kind) allowing her to choose and It is simple, prone to distributed, fault-tolerant and scalable im- experiment in an ad-hoc manner with multiple tabular views of plementations, and can easily, naturally in a way, represent most customer-related data. Each different combination of columns well-known data models. At the same time, key-value engines, yields a different view. Intuitively, there are two simple yet inef- such as Redis, have been organically developed, through analysis fective ways of achieving that: a) by collecting all data in a single of real applications at Google, Amazon, Facebook, LinkedIn and data repository and performing reporting tasks on top of it (a data elsewhere. lake approach), or b) by programmatically producing it by using Going one step further, one can imagine globally available an RDBMS for example as an intermediate storage point. Both key-value structures, produced outside the organization and used of those are impractical. The first one imposes significant costs in the same manner. For example, an analytics provider could of moving data around and data lakes have received criticism generate for each Facebook user some social metrics (e.g., num- in terms of governance, security, and lack of consistency [13], ber of checkins in the last month), useful in a data scientist’s while research tries to mitigate these problems leading to a “data analysis. We envision a global “environment” for these key-value swamp” [11, 12]. The second one requires significant manual structures that analysts pick and try into their data mining algo- intervention and is not flexible to schematic updates (changes) rithms or embed into their visualizations and reporting. We call in the underlying data sources. these widely available key-value structures globalized analytics. The solution we chose was to keep data in their respective The challenge is to provide a framework for this environment, to host systems and define “tabular” views over these systems (a make this process simple, useful and efficient. In the next section mediator approach): start from one or more “base” columns (e.g., we present our proposal for how such a framework can be build. contract ID, customer ID, area code) and incrementally extend The framework will leverage the idea of distributed key-value this “basic” schema with columns containing data or computa- pairs used to compose dataframes, over heterogeneous data, al- tions coming from different data sources. For example, one could lowing users to compose and manipulate tabular views of their define the first column (base) to contain the IDs of customers. data on the fly. Then, she could add columns corresponding to the age, gender and educational background for each customer, coming from the 3 THE BIG PICTURE Demographics RDBMS. Then, she could add a column containing, for each customer, the set of emails the customer has sent in the The main abstraction for representing data is a column, consisting last six months. This is a set of texts coming from the CRM data- of a set of key-value pairs, called a key-valued structure (KVS). base. Then, she could add a column that computes the average Columns can be joined to create dataframes. Columns have the sentiment of these emails, using some Python script. Finally, she following characteristics: could add a column corresponding to the customer’s monthly • Columns may be distributed among different machines. average usage in the last six months, coming from the Billing That means that a dataframe can comprise data residing Data Warehouse (DW). This process is depicted in Figure 1. in different machines, and the data is joined on the fly to The idea is similar to Multi-Feature SQL [3, 6], MD-Joins [5], create an integrated dataframe. Grouping variables [4] and Associated Sets [9]. In these papers, • The column keys must be unique, but the value associated one can express a series of outer-joins combined with aggrega- with a key need not be atomic, so that values can be lists tion, possibly correlated, in a succinct and concise manner. At or sets. Therefore, a column can represent both a vector the same time, several efficient evaluation techniques (based on of atomic values, as well as associations between keys and parallel, distributed and in-memory processing) are presented collection of values. In this way a column can act as the for these queries. More recently, the same idea is expressed us- stage between a mapping and a reduce stage in a typical ing dataframes in R and Python pandas (without correlation MapReduce job. • The values of a column may exist in three states: defined, The column data could be stored by any underlying mecha- not available, and observer. The defined and not available nism, which could be a relational database, or a NoSQL key-value states correspond to known and known unknowns, respec- store, or a dynamic data source. The underlying data source is tively. The observer state corresponds to values that we only relevant to the data scientist at column creation time, where expect to be filled in the future, possibly more than once. she has to describe the column; it then remains invisible. The data The observer status allows us to handle data dynamically, from source does not need to physically host the data. It may produce disparate sources, without requiring that all data be available and the data that will be filling in the column. In this scenario, the frozen at each source. In effect, this extends the semantics of our data in the column are in the observable state as explained in proposed model with reactive programming concepts, e.g., see [17], Section 3. and facilitates the handling of streams. Streams are collections of The data of a column are handled by a Column Manager (CM, multiple values that are pushed to their destinations—in contrast or simply manager). A manager is responsible for providing the to a pull model, where we request values from a stream, we data in the form of key-value pairs. Managers accept incoming can also use a push model, where the source of the stream emits requests from Column Consumers (CCs, or simply consumers). values that enter the stream. In this way, a stream is an observable. Consumers and managers communicate according to a CM-CC A column can be an observer that receives the values emitted by protocol that defines the location where the underlying KVS will the observable. be stored. The location of the KVSs is independent of both the Columns can be joined based on the values of their keys. A consumer and the manager. They may be stored at either of them, dataframe is a collection of columns that have the same set of or somewhere else entirely; moreover, they may be produced keys. The set of keys must be determined at column creation programmatically or be saved in a distributed, redundant manner time, so that the values will be filled in from the underlying data. among different machines. The location can be negotiated: for Dataframes themselves do not contain data. Their constituent example, the consumer may offer a location that the manager columns do. That means that when we are presented with a will accept, the consumer may ask the manager to respond with a dataframe and we interact with it, we are in fact interacting location, or the consumer may suggest a location and the manager with the underlying columns. Some data may be local, if the may respond with a different one. In this way, KVSs can be reused corresponding columns local, but in general data may be remote, among different consumers. To cover different scenarios, KVSs even not yet present when in the observable state. reside in a globally addressable storage space. Note that there is no assumption that the keys of the columns The idea is simple: A consumer wants to use data from a are consistent. Indeed, keys over heterogeneous data sources are manager. The data could be result of an SQL statement, or a not consistent in most cases. However, at a certain point there MapReduce job, or a script, etc. The consumer communicates must be a mapping step, which can either be transparent to our with the manager and passes to it the address of the KVS. The system, or it can be handled through an intermediate column. consumer also communicates with the KVS and passes to it the Similarly, values need not be consistent: imagine two monetary set of keys whose values will be filled in by the manager; the columns being joined in the dataframe, but being expressed in communication between the consumer and the KVS may take different currencies 1 . Such issues could be tackled with transfor- place well before the communication between the consumer and mation tasks. the manager. The manager finds the keys in the KVS and fills the corresponding values. To create a dataframe, a consumer communicates with the 4 ARCHITECTURE managers that handle the columns it wants. It passes to each We propose a layer of columns backed by a commonly-referenced of the managers the same set of keys and agrees with them on memory space for establishing global views in a tabular data the addresses of the KVSs that will contain the data. Then the format. The columns contain data, indexed by their keys. The consumer can obtain the data from the KVS and present them underlying key-value structures contain minimal schema infor- to the data scientist as an integrated dataframe. Note that the mation: consumer may access the KVSs at any time, asynchronously, even • The values may be atomic, or they may be lists or sets. Lists before the managers complete the KVSs. This way, dataframes are ordered, while sets are not; lists may contain multiple can incorporate columns corresponding to stream computations. times the same value (at different positions), while sets The above can be implemented with a two-layered architecture. obey the usual set semantics. At the upper layer we have the consumer-manager communica- • Lists and sets may be composed by other lists and sets, tion. At the lower level we have the set of KVSs. Consumers and interchangeably, and atomic items. Therefore, lists and managers communicate with the KVSs using a column-to-KVS sets can represent arbitrary complex nested structures. communication protocol. Figure 2 summarizes the aforemen- • Atomic values do not need any schema assumptions, so a tioned discussion. column could contain both numeric and string data. For Currently there are efforts pointing towards a separate ad- practical and efficiency reasons, however, an implemen- dressable memory layer, such as RAMCloud [18] and Piccolo [19], tation could choose to represent columns using specific which both share the notion of in-memory addressable “tables” underlying datatypes. For example, if a column is known supporting key-value operations. Clarifications, challenges and to contain integers, the column could be declared to be of opportunities of the proposed architecture are presented below. integer type to speed up calculations. If a column contains multiple types then it would be represented as a generic 5 CHALLENGES AND OPPORTUNITIES object type. Commonly-Referenced Memory Space. The architecture im- plies the presence of a well-defined API so CCs can create and 1 Our thanks to the anonymous reviewer who provided this example. manage KVSs. The development of such an API requires careful Transactionality Issues. A potentially challenging aspect in CC CM (Column Consumer) (Column Manager) the proposed architecture is the issue of transactional consis- CC-to-CM tency at the KVS layer. We currently consider non-materialized Protocol dataframes, so we do not have to deal with transactions and Column-to-KVS Column-to-KVS Protocol Protocol isolation at the data sources. However, transactionality issues still arise, in the case of complex workflows where multiple CCs Globally Addressable KVS Space constantly request execution from a CM. For example, consider two separate CC-to-CM connections where both CMs populate the same KVS. As another example, if a CM is using a remote address to store a continuously running query that returns a KVS KVS huge set of key-value pairs, is that large result updated atomi- cally or incrementally? If some data feeds are slow and some are KVS KVS fast, one might get an inconsistent (nonserializable) view of the KVS layer. What (if anything) can the framework do to manage Figure 2: Defining a Tabular View over Multiple Data transactional requirements across systems? For instance, when Sources. a CC creates a KVS (and thus becomes “owner” of the KVS), it could also specify the required isolation level for that KVS. Query Response Times. Mediation approaches do not have, consideration. While CRUDE operations are clearly understood, in general, good query response times. This is one of the main a discussion is required for the exact format and behavior of each. reasons for building data warehouses and having Extraction, In particular, the read operation should allow some filtering of Transformation, and Loading (ETL) processes: storing data into the KVS, either through a simple predicate over keys and values one system, using a single data model, and having an efficient or by providing a set of keys to be selected. Our first implemen- query processing engine. However, the goal of this work is not tation [7] allows filtering conditions over just the key, but in performance, but functionality and interoperability. We want to many industrial applications complex expressions involving val- enable users to easily construct data frames in heterogeneous ues are not uncommon. Other issues that have to be addressed environments, employing multiple programming languages. Usu- are: what if a CC does not delete a KVS that has created? Should ally, this data frame will feed some learning algorithm, instead the corresponding KVS management system implement garbage of being used for online queries. In this context, some of the collection? Who is the “owner” of a KVS? What is the “lifetime” columns of a dataframe could be formed as queries over a re- of a created KVS? Which CCs are allowed to access this KVS lational system using a metadata catalog like Hive Metastore. and in what mode (read/write)? For example consider Webdis, an However, statistical information existing in the Metastore is rele- HTTP interface for Redis that provides some insights on these vant to the query that produces the column and is transparent to issues. A similar kind of middleware between data producers our architecture that will simply use the end result in an outer and consumers in the form of publish-subscribe is suggested join. in [14]. A commonly-referenced memory layer is also proposed in Tachyon system [16], constrained however within a cluster. Opportunities. The proposed architecture can also be used to generalize various existing distributed data management frame- Globally Addressable Key-Value Sets. This is a conceptual layer, works, such as distributed relational query processors, MapRe- consisting of systems that provide KVS management according duce evaluation algorithms and column-oriented processing en- to the proposed framework. To do so, it should implement the gines. However, given the diversity in data management systems, column-to-KVS API mentioned above, and allow access to a KVS it opens up a wide range of interesting possibilities, both in terms through an address, internet-wide, following some standardized of infrastructures and optimization opportunities. The interested addressing scheme. The scheme should capture location hierar- reader can refer to [8]. chies (e.g., domains, sub-domains, etc.) and identify the position in the memory hierarchy of a KVS. There is no restriction on 6 DISCUSSION AND CONCLUSIONS what such a system could be. It could store KVSs anywhere in In this paper, we presented a layered architecture to data interop- the memory hierarchy: main-memory, distributed-cache, disk; erability based on a ubiquitous universe of remotely accessibly it could guarantee (or not) fault-tolerance, availability, etc. In key-value sets. The architecture uses a number of concepts that addition, it should provide answers on how it handles ownership, can and will be formalized in an extended version of this paper— lifetime and access control of KVSs. doing so here would be beyond the scope of a vision paper. In essence, with the proposed architecture we completely decouple Suitability for Stream Engines. The layered architecture essen- the computation and memory layer of any data management tially introduces a referencing layer (i.e., indirection) between scenario. By doing so we are able to generalize, abstract and ef- communicating programs (the CC and the CM). This is particu- fectively encapsulate all the key components of distributed data larly appropriate for collaborating applications involving stream computation, storage and management. We believe that such data: a stream management CM can continuously produce aggre- an approach is a first step towards an interoperable universe of gated data (e.g., the average stock price over a sliding window of big data systems. Along this way however there are numerous 10 minutes) consumed by the CC. The asynchronous access to the challenges able to serve as fruitful research directions. shared KVS allows the data consumer to retrieve data whenever it deems appropriate (e.g., [18]); alternatively, the data can be REFERENCES observed continuously by the consumer in a reactive approach [1] Daniel Abadi, Rakesh Agrawal, Anastasia Ailamaki, Magdalena Balazinska, (recall Section 3). Philip A. Bernstein, Michael J. Carey, Surajit Chaudhuri, Jeffrey Dean, AnHai Doan, Michael J. Franklin, Johannes Gehrke, Laura M. Haas, Alon Y. Halevy, Joseph M. Hellerstein, Yannis E. Ioannidis, H. V. Jagadish, Donald Kossmann, Samuel Madden, Sharad Mehrotra, Tova Milo, Jeffrey F. Naughton, Raghu Ramakrishnan, Volker Markl, Christopher Olston, Beng Chin Ooi, Christopher Ré, Dan Suciu, Michael Stonebraker, Todd Walter, and Jennifer Widom. 2016. The Beckman report on database research. Commun. ACM 59, 2 (2016), 92–99. https://doi.org/10.1145/2845915 [2] Silvana Castano, Valeria De Antonellis, and Sabrina De Capitani di Vimercati. 2001. Global Viewing of Heterogeneous Data Sources. IEEE Trans. Knowl. Data Eng. 13, 2 (2001), 277–297. https://doi.org/10.1109/69.917566 [3] Damianos Chatziantoniou. 1999. Evaluation of Ad Hoc OLAP : In-Place Computation. In ACM/IEEE International Conference on Scientific and Statistical Database Management (SSDBM). 34–43. [4] Damianos Chatziantoniou. 2007. Using grouping variables to express complex decision support queries. Data Knowl. Eng. 61, 1 (2007), 114–136. https: //doi.org/10.1016/j.datak.2006.05.001 [5] Damianos Chatziantoniou, Michael Akinde, Ted Johnson, and Samuel Kim. 2001. The MD-Join: An Operator for Complex OLAP. In IEEE International Conference on Data Engineering. 524–533. [6] Damianos Chatziantoniou and Kenneth Ross. 1996. Querying Multiple Fea- tures of Groups in Relational Databases. In 22nd International Conference on Very Large Databases (VLDB). 295–306. [7] Damianos Chatziantoniou and Florents Tselai. 2014. Introducing Data Con- nectivity in a Big Data Web. In Proceedings of the Third Workshop on Data analytics in the Cloud, DanaC 2014, June 22, 2014, Snowbird, Utah, USA, In conjunction with ACM SIGMOD/PODS Conference. 7:1–7:4. https://doi.org/10. 1145/2627770.2627773 [8] Damianos Chatziantoniou and Florents Tselai. 2016. The Data Manage- ment Entity: A Simple Abstraction to Facilitate Big Data Systems Inter- operability. In Proceedings of the Workshops of the EDBT/ICDT 2016 Joint Conference, EDBT/ICDT Workshops 2016, Bordeaux, France, March 15, 2016. http://ceur-ws.org/Vol-1558/paper38.pdf [9] Damianos Chatziantoniou and Elias Tzortzakakis. 2009. ASSET queries: a declarative alternative to MapReduce. SIGMOD Record 38, 2 (2009), 35–41. https://doi.org/10.1145/1815918.1815926 [10] Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magdalena Balazinska, Bill Howe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and Stanley B. Zdonik. 2015. The BigDAWG Polystore System. SIGMOD Record 44, 2 (2015), 11–16. https://doi.org/10.1145/2814710.2814713 [11] Mina Farid, Alexandra Roatis, Ihab F. Ilyas, Hella-Franziska Hoffmann, and Xu Chu. 2016. CLAMS: Bringing Quality to Data Lakes. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD ’16). ACM, New York, NY, USA, 2089–2092. https://doi.org/10.1145/2882903.2899391 [12] Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An Intel- ligent Data Lake System. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD ’16). ACM, New York, NY, USA, 2097–2100. https://doi.org/10.1145/2882903.2899389 [13] Nick Heudecker and Andrew White. 2014. The Data Lake Fallacy: All Water and Little Substance. Gartner report. (July 23 2014). [14] Rajive Joshi. 2007. Data-Oriented Architecture: A Loosely-Coupled Real-Time SOA. (2007). Real-Time Innovattions, Inc., Technical Report. [15] Maurizio Lenzerini. 2002. Data Integration: A Theoretical Perspective. In Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 3-5, Madison, Wisconsin, USA. 233–246. https://doi.org/10.1145/543613.543644 [16] Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In Proceedings of the ACM Symposium on Cloud Computing, Seattle, WA, USA, November 3–5, 2014. 6:1–6:15. https://doi.org/10.1145/2670979.2670985 [17] Erik Meijer. 2012. Your Mouse is a Database. Commun. ACM 55, 5 (May 2012), 66–73. https://doi.org/10.1145/2160718.2160735 [18] John K. Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan, Guru M. Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan Stutsman. 2009. The case for RAMClouds: scalable high-performance storage entirely in DRAM. Operating Systems Review 43, 4 (2009), 92–105. https: //doi.org/10.1145/1713254.1713276 [19] Russell Power and Jinyang Li. 2010. Piccolo: Building Fast, Distributed Pro- grams with Partitioned Tables. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2010, October 4–6, 2010, Vancouver, BC. 293–306. http://www.usenix.org/events/osdi10/tech/full_ papers/Power.pdf [20] Michael Stonebraker. 2015. The Case for Polystores. ACM SIGMOD Blog. (July 13 2015). http://wp.sigmod.org/?p=1629