=Paper= {{Paper |id=Vol-2062/paper8 |storemode=property |title=Enabling Global Big Data Computations |pdfUrl=https://ceur-ws.org/Vol-2062/paper08.pdf |volume=Vol-2062 |authors=Damianos Chatziantoniou,,Panos Louridas |dblpUrl=https://dblp.org/rec/conf/dolap/ChatziantoniouL18 }} ==Enabling Global Big Data Computations== https://ceur-ws.org/Vol-2062/paper08.pdf
                              Enabling Global Big Data Computations
                     Damianos Chatziantoniou                                                            Panos Louridas
           Athens University of Economics and Business                                  Athens University of Economics and Business
                         Athens, Greece                                                               Athens, Greece
                       damianos@aueb.gr                                                              louridas@aueb.gr

ABSTRACT                                                                            span systems, platforms will need to be integrated and federated.”
Most analytics projects focus on the management of the 3Vs of                       Data integration involves combining data residing in different
big data and use specific stacks to support this variety. However,                  sources and providing users with a unified view of them [15].
they constrain themselves to “local” data, data that exists within                  Data integration can be seen as constructing a data warehouse, or
or “close” to the organization, or external data imported to lo-                    creating a virtual database (federated/mediated systems). While
cal systems. And yet, as it has been recently pointed out, “the                     data warehousing was the way to go in the past—mainly due to
value of data explodes when it can be linked with other data.”                      the dominance of relational systems in data management—there
In this paper we present our vision for a global marketplace of                     are well-thought arguments to reconsider a federated approach
analytics—either in the form of per-entity metrics or per-entity                    in big data applications [20]. Polystores [10], closely related to
data, provided by globally accessible data management tasks—                        federated databases, address the need for managing information
where a data scientist can pick and combine data at will in her                     represented in different data models. This is similar to this pa-
data mining algorithms, possibly combining with her own data.                       per’s motivation: using the answer of computations defined over
The main idea is to use the dataframe, a popular data structure                     different data models and query languages. However, we focus
in R and Python. Currently, the columns of a dataframe contain                      on standardizing the output of a computation and use it in a
computations or data found within the data infrastructure of the                    conceptual model, rather than integrating data model and query
organization. We propose to extend the concept of a column. A                       capabilities in the system. It is worth mentioning that defining
column is now a collection of key-value pairs, produced any-                        global views over heterogeneous data sources is not a big data-era
where by a remotely accessed program (e.g., an SQL query, a                         issue and has been extensively discussed in the past (e.g., [2]).
MapReduce job, even a continuous query.) The key is used for                           We argue that a standardized and protocol-based approach
the outer join with the existing dataframe, the value is the con-                   can significantly facilitate the unified dissemination, federation
tent of the column. This whole process should be orchestrated by                    and analysis of data. Once the output of big data computations
a set of well-defined, standardized APIs. We argue that the pro-                    (from simple SQL queries to complex predictive models) can be
posed architecture presents numerous challenges and could be                        standardized and accessed globally, anyone can use it in his own
beneficial for big data interoperability. In addition, it can be used               analysis framework.
to build mediation systems involving local or global columns.                          Section 2 presents an example from the telecom domain and
Columns correspond to attributes of entities, where the primary                     motivates the paper. It introduces the concept of global dataframes:
key of the entity is the key of the involved columns.                               dataframes constructed by columns that are globally accessible
                                                                                    and represent a data management task. Section 3 describes the
                                                                                    big picture: a dataframe composed of globally addressed columns.
1    INTRODUCTION                                                                   Section 4 presents the architecture and the necessary APIs to
Currently, most big data deployments follow a highly ad hoc, non-                   support the management and usage of these global columns. The
disciplined approach, entailing a high degree of data replication                   challenges of such an architecture (performance, transactionality
and heterogeneity, both in terms of storing options and analy-                      issues, distribution, etc.) are introduced in Section 5. We conclude
sis tasks. The system administrator has to choose one (or more)                     with conclusions in Section 6.
data management systems from a plethora of alternatives and
facilitate the enterprise’s reporting needs utilizing a wide range
of query languages and analysis techniques. Data management                         2   MOTIVATION
systems involve traditional RDBMSs, Hadoop clusters, NoSQL                          Consider the churn prediction problem in a telecom environment
databases, and others. Reporting and analysis tasks include plain                   in the presence of structured and unstructured data. For this
SQL, spreadsheet scripts, MapReduce jobs, R/Java/Python pro-                        purpose, a predictive model had to be designed and implemented
grams, complex event processing queries, machine learning algo-                     taking into account the many possible variables (features) char-
rithms, and others. A not-so-new challenge resurfaces: interoper-                   acterizing the customer. The goal was to equip the data analyst
ability. How can these systems interact? How can these systems                      with a simple tool that enables fast and interactive experimen-
interoperate?                                                                       tation by using features from multiple data sources, involving
   This necessity has been identified by the current authors                        different data management systems and data formats. In our case,
in [7, 8] and more recently by the Beckman report [1]. The Beck-                    the company had a variety of data sources, such as:
man report recognized the problems the “diversity in the data
management landscape” creates and asserted “the need for co-
                                                                                        • A traditional RDBMS containing basic customer-related
existence of multiple Big Data systems and analysis platforms
                                                                                          data such as gender, age, address and various demograph-
is certain” and that in order “to support Big Data queries that
                                                                                          ics.
© 2018 Copyright held by the owner/author(s). Published in the Workshop
                                                                                        • A relational data warehouse storing billing, usage and
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna,               traffic activity per contract key—a contract may involve
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted         several customer IDs.
under the terms of the Creative Commons license CC-by-nc-nd 4.0.
                                                                        between columns), as well as in Spark. The fundamental differ-
                                                                        ence between our proposal and these approaches is that we treat,
                                                                        in principle, our data as columns coming from heterogeneous
                                                                        sources, and dataframes are composed on the fly from them. In R
                                                                        and pandas dataframes are typically created from data imported
                                                                        from different, possibly heterogeneous sources, but the dataframe
Figure 1: Defining a Tabular View over Multiple Data
                                                                        reflects the structure of the underlying data. One could adopt a
Sources
                                                                        columnar key-value approach, for example by using exclusively
                                                                        Series in pandas or atomic vectors and lists in R, but that would
                                                                        probably defeat the purpose and the philosophy of these frame-
    • Flat files produced by statistical packages such as SAS and       works. Concerning Spark, again dataframes reflect the structure
      SPSS, containing data transformations and precomputed             of underlying data; moreover, most likely data are copied to an
      measures per contract key on different datasets.                  underlying HDFS substrate.
    • CRM data stored in a relational database, containing meta-            This is the pattern in most big data tasks: building a dataframe
      data of customer-agent interactions, including agent’s            over different datasets, to serve as input to data mining algo-
      notes (text) on the call.                                         rithms, visualization tools and reporting systems. Essentially this
    • Email correspondence between customers and the cus-               could be modeled conceptually as a single tabular data represen-
      tomer service center of the company (text).                       tation of joined results coming from different data management
    • Audio files stored in the file system, containing conversa-       systems. Each result consists of the keys used for the outer join
      tions between customers and agents.                               and the corresponding values to be added as a column; it can
                                                                        therefore be represented as a set of key-value pairs. This formal-
   The ultimate goal was to provide the data scientist with a           ism is quite appropriate to represent columnar data in a dataframe.
simple way (a tool of some kind) allowing her to choose and             It is simple, prone to distributed, fault-tolerant and scalable im-
experiment in an ad-hoc manner with multiple tabular views of           plementations, and can easily, naturally in a way, represent most
customer-related data. Each different combination of columns            well-known data models. At the same time, key-value engines,
yields a different view. Intuitively, there are two simple yet inef-    such as Redis, have been organically developed, through analysis
fective ways of achieving that: a) by collecting all data in a single   of real applications at Google, Amazon, Facebook, LinkedIn and
data repository and performing reporting tasks on top of it (a data     elsewhere.
lake approach), or b) by programmatically producing it by using             Going one step further, one can imagine globally available
an RDBMS for example as an intermediate storage point. Both             key-value structures, produced outside the organization and used
of those are impractical. The first one imposes significant costs       in the same manner. For example, an analytics provider could
of moving data around and data lakes have received criticism            generate for each Facebook user some social metrics (e.g., num-
in terms of governance, security, and lack of consistency [13],         ber of checkins in the last month), useful in a data scientist’s
while research tries to mitigate these problems leading to a “data      analysis. We envision a global “environment” for these key-value
swamp” [11, 12]. The second one requires significant manual             structures that analysts pick and try into their data mining algo-
intervention and is not flexible to schematic updates (changes)         rithms or embed into their visualizations and reporting. We call
in the underlying data sources.                                         these widely available key-value structures globalized analytics.
   The solution we chose was to keep data in their respective           The challenge is to provide a framework for this environment, to
host systems and define “tabular” views over these systems (a           make this process simple, useful and efficient. In the next section
mediator approach): start from one or more “base” columns (e.g.,        we present our proposal for how such a framework can be build.
contract ID, customer ID, area code) and incrementally extend           The framework will leverage the idea of distributed key-value
this “basic” schema with columns containing data or computa-            pairs used to compose dataframes, over heterogeneous data, al-
tions coming from different data sources. For example, one could        lowing users to compose and manipulate tabular views of their
define the first column (base) to contain the IDs of customers.         data on the fly.
Then, she could add columns corresponding to the age, gender
and educational background for each customer, coming from the           3   THE BIG PICTURE
Demographics RDBMS. Then, she could add a column containing,
for each customer, the set of emails the customer has sent in the       The main abstraction for representing data is a column, consisting
last six months. This is a set of texts coming from the CRM data-       of a set of key-value pairs, called a key-valued structure (KVS).
base. Then, she could add a column that computes the average            Columns can be joined to create dataframes. Columns have the
sentiment of these emails, using some Python script. Finally, she       following characteristics:
could add a column corresponding to the customer’s monthly                  • Columns may be distributed among different machines.
average usage in the last six months, coming from the Billing                 That means that a dataframe can comprise data residing
Data Warehouse (DW). This process is depicted in Figure 1.                    in different machines, and the data is joined on the fly to
   The idea is similar to Multi-Feature SQL [3, 6], MD-Joins [5],             create an integrated dataframe.
Grouping variables [4] and Associated Sets [9]. In these papers,            • The column keys must be unique, but the value associated
one can express a series of outer-joins combined with aggrega-                with a key need not be atomic, so that values can be lists
tion, possibly correlated, in a succinct and concise manner. At               or sets. Therefore, a column can represent both a vector
the same time, several efficient evaluation techniques (based on              of atomic values, as well as associations between keys and
parallel, distributed and in-memory processing) are presented                 collection of values. In this way a column can act as the
for these queries. More recently, the same idea is expressed us-              stage between a mapping and a reduce stage in a typical
ing dataframes in R and Python pandas (without correlation                    MapReduce job.
     • The values of a column may exist in three states: defined,          The column data could be stored by any underlying mecha-
       not available, and observer. The defined and not available       nism, which could be a relational database, or a NoSQL key-value
       states correspond to known and known unknowns, respec-           store, or a dynamic data source. The underlying data source is
       tively. The observer state corresponds to values that we         only relevant to the data scientist at column creation time, where
       expect to be filled in the future, possibly more than once.      she has to describe the column; it then remains invisible. The data
   The observer status allows us to handle data dynamically, from       source does not need to physically host the data. It may produce
disparate sources, without requiring that all data be available and     the data that will be filling in the column. In this scenario, the
frozen at each source. In effect, this extends the semantics of our     data in the column are in the observable state as explained in
proposed model with reactive programming concepts, e.g., see [17],      Section 3.
and facilitates the handling of streams. Streams are collections of        The data of a column are handled by a Column Manager (CM,
multiple values that are pushed to their destinations—in contrast       or simply manager). A manager is responsible for providing the
to a pull model, where we request values from a stream, we              data in the form of key-value pairs. Managers accept incoming
can also use a push model, where the source of the stream emits         requests from Column Consumers (CCs, or simply consumers).
values that enter the stream. In this way, a stream is an observable.   Consumers and managers communicate according to a CM-CC
A column can be an observer that receives the values emitted by         protocol that defines the location where the underlying KVS will
the observable.                                                         be stored. The location of the KVSs is independent of both the
   Columns can be joined based on the values of their keys. A           consumer and the manager. They may be stored at either of them,
dataframe is a collection of columns that have the same set of          or somewhere else entirely; moreover, they may be produced
keys. The set of keys must be determined at column creation             programmatically or be saved in a distributed, redundant manner
time, so that the values will be filled in from the underlying data.    among different machines. The location can be negotiated: for
Dataframes themselves do not contain data. Their constituent            example, the consumer may offer a location that the manager
columns do. That means that when we are presented with a                will accept, the consumer may ask the manager to respond with a
dataframe and we interact with it, we are in fact interacting           location, or the consumer may suggest a location and the manager
with the underlying columns. Some data may be local, if the             may respond with a different one. In this way, KVSs can be reused
corresponding columns local, but in general data may be remote,         among different consumers. To cover different scenarios, KVSs
even not yet present when in the observable state.                      reside in a globally addressable storage space.
   Note that there is no assumption that the keys of the columns           The idea is simple: A consumer wants to use data from a
are consistent. Indeed, keys over heterogeneous data sources are        manager. The data could be result of an SQL statement, or a
not consistent in most cases. However, at a certain point there         MapReduce job, or a script, etc. The consumer communicates
must be a mapping step, which can either be transparent to our          with the manager and passes to it the address of the KVS. The
system, or it can be handled through an intermediate column.            consumer also communicates with the KVS and passes to it the
Similarly, values need not be consistent: imagine two monetary          set of keys whose values will be filled in by the manager; the
columns being joined in the dataframe, but being expressed in           communication between the consumer and the KVS may take
different currencies 1 . Such issues could be tackled with transfor-    place well before the communication between the consumer and
mation tasks.                                                           the manager. The manager finds the keys in the KVS and fills the
                                                                        corresponding values.
                                                                           To create a dataframe, a consumer communicates with the
4    ARCHITECTURE                                                       managers that handle the columns it wants. It passes to each
We propose a layer of columns backed by a commonly-referenced           of the managers the same set of keys and agrees with them on
memory space for establishing global views in a tabular data            the addresses of the KVSs that will contain the data. Then the
format. The columns contain data, indexed by their keys. The            consumer can obtain the data from the KVS and present them
underlying key-value structures contain minimal schema infor-           to the data scientist as an integrated dataframe. Note that the
mation:                                                                 consumer may access the KVSs at any time, asynchronously, even
     • The values may be atomic, or they may be lists or sets. Lists    before the managers complete the KVSs. This way, dataframes
       are ordered, while sets are not; lists may contain multiple      can incorporate columns corresponding to stream computations.
       times the same value (at different positions), while sets           The above can be implemented with a two-layered architecture.
       obey the usual set semantics.                                    At the upper layer we have the consumer-manager communica-
     • Lists and sets may be composed by other lists and sets,          tion. At the lower level we have the set of KVSs. Consumers and
       interchangeably, and atomic items. Therefore, lists and          managers communicate with the KVSs using a column-to-KVS
       sets can represent arbitrary complex nested structures.          communication protocol. Figure 2 summarizes the aforemen-
     • Atomic values do not need any schema assumptions, so a           tioned discussion.
       column could contain both numeric and string data. For              Currently there are efforts pointing towards a separate ad-
       practical and efficiency reasons, however, an implemen-          dressable memory layer, such as RAMCloud [18] and Piccolo [19],
       tation could choose to represent columns using specific          which both share the notion of in-memory addressable “tables”
       underlying datatypes. For example, if a column is known          supporting key-value operations. Clarifications, challenges and
       to contain integers, the column could be declared to be of       opportunities of the proposed architecture are presented below.
       integer type to speed up calculations. If a column contains
       multiple types then it would be represented as a generic         5   CHALLENGES AND OPPORTUNITIES
       object type.
                                                                           Commonly-Referenced Memory Space. The architecture im-
                                                                        plies the presence of a well-defined API so CCs can create and
1 Our thanks to the anonymous reviewer who provided this example.       manage KVSs. The development of such an API requires careful
                                                                           Transactionality Issues. A potentially challenging aspect in
            CC                                           CM
     (Column Consumer)                             (Column Manager)    the proposed architecture is the issue of transactional consis-
                                 CC-to-CM                              tency at the KVS layer. We currently consider non-materialized
                                  Protocol
                                                                       dataframes, so we do not have to deal with transactions and
                Column-to-KVS                 Column-to-KVS
                   Protocol                      Protocol              isolation at the data sources. However, transactionality issues
                                                                       still arise, in the case of complex workflows where multiple CCs
    Globally Addressable KVS Space                                     constantly request execution from a CM. For example, consider
                                                                       two separate CC-to-CM connections where both CMs populate
                                                                       the same KVS. As another example, if a CM is using a remote
                                                                       address to store a continuously running query that returns a
                           KVS                            KVS          huge set of key-value pairs, is that large result updated atomi-
                                                                       cally or incrementally? If some data feeds are slow and some are
            KVS                              KVS
                                                                       fast, one might get an inconsistent (nonserializable) view of the
                                                                       KVS layer. What (if anything) can the framework do to manage
Figure 2: Defining a Tabular View over Multiple Data                   transactional requirements across systems? For instance, when
Sources.                                                               a CC creates a KVS (and thus becomes “owner” of the KVS), it
                                                                       could also specify the required isolation level for that KVS.
                                                                          Query Response Times. Mediation approaches do not have,
consideration. While CRUDE operations are clearly understood,          in general, good query response times. This is one of the main
a discussion is required for the exact format and behavior of each.    reasons for building data warehouses and having Extraction,
In particular, the read operation should allow some filtering of       Transformation, and Loading (ETL) processes: storing data into
the KVS, either through a simple predicate over keys and values        one system, using a single data model, and having an efficient
or by providing a set of keys to be selected. Our first implemen-      query processing engine. However, the goal of this work is not
tation [7] allows filtering conditions over just the key, but in       performance, but functionality and interoperability. We want to
many industrial applications complex expressions involving val-        enable users to easily construct data frames in heterogeneous
ues are not uncommon. Other issues that have to be addressed           environments, employing multiple programming languages. Usu-
are: what if a CC does not delete a KVS that has created? Should       ally, this data frame will feed some learning algorithm, instead
the corresponding KVS management system implement garbage              of being used for online queries. In this context, some of the
collection? Who is the “owner” of a KVS? What is the “lifetime”        columns of a dataframe could be formed as queries over a re-
of a created KVS? Which CCs are allowed to access this KVS             lational system using a metadata catalog like Hive Metastore.
and in what mode (read/write)? For example consider Webdis, an         However, statistical information existing in the Metastore is rele-
HTTP interface for Redis that provides some insights on these          vant to the query that produces the column and is transparent to
issues. A similar kind of middleware between data producers            our architecture that will simply use the end result in an outer
and consumers in the form of publish-subscribe is suggested            join.
in [14]. A commonly-referenced memory layer is also proposed
in Tachyon system [16], constrained however within a cluster.              Opportunities. The proposed architecture can also be used to
                                                                       generalize various existing distributed data management frame-
    Globally Addressable Key-Value Sets. This is a conceptual layer,   works, such as distributed relational query processors, MapRe-
consisting of systems that provide KVS management according            duce evaluation algorithms and column-oriented processing en-
to the proposed framework. To do so, it should implement the           gines. However, given the diversity in data management systems,
column-to-KVS API mentioned above, and allow access to a KVS           it opens up a wide range of interesting possibilities, both in terms
through an address, internet-wide, following some standardized         of infrastructures and optimization opportunities. The interested
addressing scheme. The scheme should capture location hierar-          reader can refer to [8].
chies (e.g., domains, sub-domains, etc.) and identify the position
in the memory hierarchy of a KVS. There is no restriction on           6    DISCUSSION AND CONCLUSIONS
what such a system could be. It could store KVSs anywhere in           In this paper, we presented a layered architecture to data interop-
the memory hierarchy: main-memory, distributed-cache, disk;            erability based on a ubiquitous universe of remotely accessibly
it could guarantee (or not) fault-tolerance, availability, etc. In     key-value sets. The architecture uses a number of concepts that
addition, it should provide answers on how it handles ownership,       can and will be formalized in an extended version of this paper—
lifetime and access control of KVSs.                                   doing so here would be beyond the scope of a vision paper. In
                                                                       essence, with the proposed architecture we completely decouple
   Suitability for Stream Engines. The layered architecture essen-
                                                                       the computation and memory layer of any data management
tially introduces a referencing layer (i.e., indirection) between
                                                                       scenario. By doing so we are able to generalize, abstract and ef-
communicating programs (the CC and the CM). This is particu-
                                                                       fectively encapsulate all the key components of distributed data
larly appropriate for collaborating applications involving stream
                                                                       computation, storage and management. We believe that such
data: a stream management CM can continuously produce aggre-
                                                                       an approach is a first step towards an interoperable universe of
gated data (e.g., the average stock price over a sliding window of
                                                                       big data systems. Along this way however there are numerous
10 minutes) consumed by the CC. The asynchronous access to the
                                                                       challenges able to serve as fruitful research directions.
shared KVS allows the data consumer to retrieve data whenever
it deems appropriate (e.g., [18]); alternatively, the data can be      REFERENCES
observed continuously by the consumer in a reactive approach            [1] Daniel Abadi, Rakesh Agrawal, Anastasia Ailamaki, Magdalena Balazinska,
(recall Section 3).                                                         Philip A. Bernstein, Michael J. Carey, Surajit Chaudhuri, Jeffrey Dean, AnHai
     Doan, Michael J. Franklin, Johannes Gehrke, Laura M. Haas, Alon Y. Halevy,
     Joseph M. Hellerstein, Yannis E. Ioannidis, H. V. Jagadish, Donald Kossmann,
     Samuel Madden, Sharad Mehrotra, Tova Milo, Jeffrey F. Naughton, Raghu
     Ramakrishnan, Volker Markl, Christopher Olston, Beng Chin Ooi, Christopher
     Ré, Dan Suciu, Michael Stonebraker, Todd Walter, and Jennifer Widom. 2016.
     The Beckman report on database research. Commun. ACM 59, 2 (2016), 92–99.
     https://doi.org/10.1145/2845915
 [2] Silvana Castano, Valeria De Antonellis, and Sabrina De Capitani di Vimercati.
     2001. Global Viewing of Heterogeneous Data Sources. IEEE Trans. Knowl.
     Data Eng. 13, 2 (2001), 277–297. https://doi.org/10.1109/69.917566
 [3] Damianos Chatziantoniou. 1999. Evaluation of Ad Hoc OLAP : In-Place
     Computation. In ACM/IEEE International Conference on Scientific and Statistical
     Database Management (SSDBM). 34–43.
 [4] Damianos Chatziantoniou. 2007. Using grouping variables to express complex
     decision support queries. Data Knowl. Eng. 61, 1 (2007), 114–136. https:
     //doi.org/10.1016/j.datak.2006.05.001
 [5] Damianos Chatziantoniou, Michael Akinde, Ted Johnson, and Samuel Kim.
     2001. The MD-Join: An Operator for Complex OLAP. In IEEE International
     Conference on Data Engineering. 524–533.
 [6] Damianos Chatziantoniou and Kenneth Ross. 1996. Querying Multiple Fea-
     tures of Groups in Relational Databases. In 22nd International Conference on
     Very Large Databases (VLDB). 295–306.
 [7] Damianos Chatziantoniou and Florents Tselai. 2014. Introducing Data Con-
     nectivity in a Big Data Web. In Proceedings of the Third Workshop on Data
     analytics in the Cloud, DanaC 2014, June 22, 2014, Snowbird, Utah, USA, In
     conjunction with ACM SIGMOD/PODS Conference. 7:1–7:4. https://doi.org/10.
     1145/2627770.2627773
 [8] Damianos Chatziantoniou and Florents Tselai. 2016. The Data Manage-
     ment Entity: A Simple Abstraction to Facilitate Big Data Systems Inter-
     operability. In Proceedings of the Workshops of the EDBT/ICDT 2016 Joint
     Conference, EDBT/ICDT Workshops 2016, Bordeaux, France, March 15, 2016.
     http://ceur-ws.org/Vol-1558/paper38.pdf
 [9] Damianos Chatziantoniou and Elias Tzortzakakis. 2009. ASSET queries: a
     declarative alternative to MapReduce. SIGMOD Record 38, 2 (2009), 35–41.
     https://doi.org/10.1145/1815918.1815926
[10] Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magdalena Balazinska,
     Bill Howe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and
     Stanley B. Zdonik. 2015. The BigDAWG Polystore System. SIGMOD Record
     44, 2 (2015), 11–16. https://doi.org/10.1145/2814710.2814713
[11] Mina Farid, Alexandra Roatis, Ihab F. Ilyas, Hella-Franziska Hoffmann, and
     Xu Chu. 2016. CLAMS: Bringing Quality to Data Lakes. In Proceedings of the
     2016 International Conference on Management of Data (SIGMOD ’16). ACM,
     New York, NY, USA, 2089–2092. https://doi.org/10.1145/2882903.2899391
[12] Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An Intel-
     ligent Data Lake System. In Proceedings of the 2016 International Conference
     on Management of Data (SIGMOD ’16). ACM, New York, NY, USA, 2097–2100.
     https://doi.org/10.1145/2882903.2899389
[13] Nick Heudecker and Andrew White. 2014. The Data Lake Fallacy: All Water
     and Little Substance. Gartner report. (July 23 2014).
[14] Rajive Joshi. 2007. Data-Oriented Architecture: A Loosely-Coupled Real-Time
     SOA. (2007). Real-Time Innovattions, Inc., Technical Report.
[15] Maurizio Lenzerini. 2002. Data Integration: A Theoretical Perspective. In
     Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on
     Principles of Database Systems, June 3-5, Madison, Wisconsin, USA. 233–246.
     https://doi.org/10.1145/543613.543644
[16] Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014.
     Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks.
     In Proceedings of the ACM Symposium on Cloud Computing, Seattle, WA, USA,
     November 3–5, 2014. 6:1–6:15. https://doi.org/10.1145/2670979.2670985
[17] Erik Meijer. 2012. Your Mouse is a Database. Commun. ACM 55, 5 (May 2012),
     66–73. https://doi.org/10.1145/2160718.2160735
[18] John K. Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob
     Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan, Guru M.
     Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan
     Stutsman. 2009. The case for RAMClouds: scalable high-performance storage
     entirely in DRAM. Operating Systems Review 43, 4 (2009), 92–105. https:
     //doi.org/10.1145/1713254.1713276
[19] Russell Power and Jinyang Li. 2010. Piccolo: Building Fast, Distributed Pro-
     grams with Partitioned Tables. In Proceedings of the 9th USENIX Symposium
     on Operating Systems Design and Implementation, OSDI 2010, October 4–6,
     2010, Vancouver, BC. 293–306. http://www.usenix.org/events/osdi10/tech/full_
     papers/Power.pdf
[20] Michael Stonebraker. 2015. The Case for Polystores. ACM SIGMOD Blog.
     (July 13 2015). http://wp.sigmod.org/?p=1629