Integrating Analytics with Relational Databases

                                                              Mark Raasveldt
                                      supervised by Hannes Mühleisen and Stefan Manegold
                                                Centrum Wiskunde & Informatica
                                                    Amsterdam, Netherlands
                                                           m.raasveldt@cwi.nl


ABSTRACT
In order to uncover insights and trends, it is an increas-
ingly common practice for companies of all shapes and sizes
to gather large quantities of data and to then analyze that
data. This data can come from a multitude of different
sources, ranging from data gathered about consumer behav-
ior to data gathered from sensors. The most prevalent way                                      (a) Client-Server connection.
of storing and managing data has traditionally been a re-
lational database management system (RDBMS). However,
there is currently a disconnect between the tools used for
analysis of data and the tools used for storing that data.
Instead of working directly with RDBMSes, these tools are
build to work in a stand-alone fashion, and offer integration
with RDBMSes as an afterthought. The focus of my PhD
research is on investigating different methods of combining
popular analytical tools (such as R or Python) with database
management systems in an efficient and user-friendly fash-                        (b) In-database processing.   (c) Embedded database.
ion.
                                                                              Figure 1: Different ways of connecting analytical tools with
                                                                              a database management system.
1.    INTRODUCTION
   There is a disconnect between data-intensive analytical
tools and traditional database management systems. Data
scientists using these tools often prefer to manually manage                  all required data and intermediates into memory, leading to
their data by storing it either as structured text (such as                   frequent out of memory problems or poor performance due
CSV or XML files), or as binary files [8]. This approach                      to swapping.
of managing data introduces a lot of problems, especially                        These issues could be solved through the use of a RDBMS.
when a large amount of data from different sources has to                     The RDBMS can prevent data corruption through ACID
be managed or combined. Flat file storage requires tremen-                    properties, it can automatically manage data storage for
dous manual effort to maintain, is often difficult to reason                  the user and make data easier to reason about by enforc-
about because of the lack of a rigid schema and is difficult to               ing a rigid schema. In addition, the RDBMS can perform
share between multiple users. Furthermore, modifying the                      efficient execution on larger-than-memory data, and allows
data is prone to corruption because of lack of transactional                  concurrent read and write access to the data in a safe way.
guarantees and atomic write actions. Another consequence                         Popular analytical tools such as R or Python can be used
of this disconnect is that data scientists have re-implemented                in conjunction with database systems. There are SQLite
many common database operations inside libraries such as                      bindings for these languages, and it is possible to connect
dplyr [16] or Pandas [9]. Instead of performing joins or ag-                  to a database server using standard client connector proto-
gregations using a RDBMS, they perform them using these                       cols. However, data scientists prefer to use flat file storage
libraries. However, these libraries suffer from having to load                methods over these existing approaches, because these con-
                                                                              nections are either inefficient or inconvenient to use.
                                                                                 The focus of this work is on identifying the problems en-
                                                                              countered when combining a RDBMS with analytical tools,
                                                                              and on implementing various solutions to overcome these
                                                                              issues to allow for both a more efficient and more flexible
                                                                              combination of these tools. Figure 1 shows the three main
                                                                              methods in which a relational database can be combined
Proceedings of the VLDB 2018 PhD Workshop, August 27, 2018. Rio de            with an analytical tool. We investigate each of these meth-
Janeiro, Brazil.
Copyright (C) 2018 for this paper by its authors. Copying permitted for       ods, and attempt to improve them from both a usability and
private and academic purposes.                                                a performance perspective.


                                                                          1
                                                                           a binary format that is close to the native database format,
                                                                           we can export large tables an order of magnitude faster than
                                                                           existing solutions.
                                                                              However, even with a client protocol optimized for this
                                                                           scenario, there is still a significant amount of time required
                                                                           to push data “over the wire”. In addition, as this approach
                                                                           only replaces the loading of data from a flat file storage
                                                                           system with the loading of data from the RDBMS into the
                                                                           client, it still requires the entire dataset and intermediates
                                                                           to fit inside the clients’ main memory.

                                                                           3.   IN-DATABASE PROCESSING
                                                                              In order to avoid the cost of exporting the data from the
                                                                           database, the analysis can be performed inside the database
                                                                           server. This method, known as in-database processing, is
                                                                           shown in Figure 1b.
     Figure 2: Communication between a client and server.                     In-database processing can be performed in a database-
                                                                           agnostic way by rewriting the analysis pipeline in a set of
                                                                           standard-compliant SQL queries. However, most data anal-
                                                                           ysis, data mining and classification operators are difficult
                                                                           and inefficient to express in SQL. The SQL standard de-
2.     CLIENT-SERVER CONNECTION                                            scribes a number of built-in scalar functions and aggregates,
   The standard method of combining a standalone program                   such as AVG and SUM [7]. However, this small number of
with a RDBMS is through a client-server connection. This                   functions and aggregates is not sufficient to perform complex
is visualized in Figure 1a. In this way, the database server               data analysis tasks [15].
is completely separate from the analytical tool. It runs as                   Instead of writing the analysis pipelines in SQL, user-
either a separate process on the same machine or on a differ-              defined functions or user-defined aggregates in procedural
ent machine entirely. The analytical tool can issue queries to             languages such as C/C++ can be used to implement clas-
the database, after which the server will compute the answer               sification and machine learning algorithms. This is the ap-
to the query and transfer the results to the client through                proach taken by Hellerstein et al. [4]. However, these func-
the socket. This process is shown in Figure 2.                             tions still require significant rewrites of existing analytical
   In order to perform analytics on data stored inside the                 pipelines written in vectorized scripting languages. In ad-
database, the data is exported from the database to the an-                dition, writing user-defined functions in these languages re-
alytical tool over the socket connection, after which that                 quire in-depth knowledge of the database internals and the
data is processed in the client. The main advantage to this                execution model used by the database [3].
approach is that it is mostly database agnostic, as the stan-                 In order to make it easier to perform in-database ana-
dardized ODBC or JDBC connectors can be used to connect                    lytics, we introduced MonetDB/Python UDFs [12] in the
to almost every database. In addition, it is relatively easy to            Open-Source DBMS MonetDB [6]. These user-defined func-
integrate into existing pipelines as the loading from flat files           tions can be written in Python, and process code in a vec-
can be replaced by loading from a database without having                  torized way. The input and output variables of the func-
to touch the rest of the pipeline.                                         tions and aggregates can be provided as either standardized
   However, this approach is problematic when dealing with                 NumPy arrays [14] or Pandas DataFrames [9]. In this way,
a large amount of data as is often required in modern analyt-              the user-defined functions mimic the execution of regular an-
ical pipelines. The time spent on serializing large result sets            alytical Python programs and can be written without any
and transferring them from the server to the client can be                 knowledge of the database internals. Because of their vector-
a significant bottleneck. In addition, this approach requires              ized nature, the heavy interpreter overhead is not incurred
the full dataset to fit inside the clients’ memory.                        once for every tuple but only once for every invocation of the
   In our work [13], we perform a survey of popular RDBMS                  function. Combined with the use of zero-copy techniques for
and note that they are not optimized for the scenario of                   both the input and output columns these functions can be
high-volume data export. They take a significant amount of                 executed efficiently on large datasets.
time to export a relatively small amount of data even when
the server and client are located on the same machine or               1   SELECT MEDIAN ( SQRT ( i * 2) ) FROM tbl ;
connected through a high-throughput network connection.
This is because existing client protocols were designed for                            Listing 1: Chain of SQL operators.
the transfer of a small amount of rows in OLTP workloads,
and have significant per-tuple and per-value overheads that
result in the slow export of large tables.
   To remedy this problem, we investigate the different de-
sign choices that can be made when designing a result set
serialization format, and we propose a new client protocol
that is optimized for the transfer of large amounts of data
from the server to the client. By using a column-major
chunk-wise format that utilizes lightweight compression and


                                                                   2
       Figure 3: Parallel operator chain of Listing 1.

                                                                      Figure 4: Transfer of the TPC-H lineitem table from the
                                                                      database to the client process.
   Parallelism. Another advantage of UDFs is that they
can take advantage of the databases’ automatic paralleliza-
tion model. In MonetDB, parallel execution is achieved
by marking individual operators as either parallelizable or           4.   EMBEDDED DATABASE
blocking. When a chain of parallel operators is executed on              Both the previously managed approaches require the user
a column, the column is split up into several chunks and the          to have a running database server. This requires signifi-
operator is executed once on each chunk. When a blocking              cant manual effort from the user, as the database server
operator is encountered, the chunks are packed into a sin-            must be installed, tuned and continuously maintained. For
gle column and the blocking operator is executed on that              small-scale data analysis, the effort spent on maintaining
column. This process is visualized in Figure 3.                       the database server often negates the benefits of using one.
   MonetDB/Python UDFs can be parallelized in the same                   Embedding a database inside the client program, as shown
way. The functions can be set to either allow parallelization,        in Figure 1c, is more applicable for these use cases. As the
in which case they are executed as a parallelizable operator,         database can be installed and run from within the client
or to disallow parallelization, in which case they will operate       program, maintaining and setting up the database is much
with the entire column as input. User-defined aggregates are          simpler than with full-fledged database systems.
parallelized over the different groups, where the aggregate is           The most commonly used embedded database is SQLite [2].
called once for each group with the tuples belonging to that          However, SQLite is first and foremost designed for transac-
group as input. The aggregates computed for each group                tional workloads. It is a row-store database that uses a
are then gathered and combined to form the final result.              volcano-based processing model for query execution. While
   Development Workflow. A challenge when developing                  popular analytical tools such as Python and R do have
user-defined functions is that, since they are executed inside        SQLite bindings, it does not perform well when used for ana-
the database server, standard tools and integrated develop-           lytical purposes. Even exclusively using SQLite as a storage
ment environments (IDEs) cannot be used to develop them.              engine typically does not work out well in these scenarios.
As a result, developers cannot use sophisticated debugging            Often only select columns of a table are used in analyses,
techniques (e.g., Interactive Debugging) and have to resort           and its row-wise storage layout forces it to always load en-
to inefficient debugging strategies to make their code work.          tire tables. This can lead to very poor performance when
   In order to make it easier to develop MonetDB/Python               dealing with wide data.
UDFs, we extended the client of MonetDB to allow for local               To fill this gap, we created MonetDBLite [10], an Open-
testing of user-defined functions [5]. The required data (or          Source embedded database based on the popular columnar
a sample of it) is automatically shipped from the database            database MonetDB. Much like SQLite, it is an in-process
to the client together with the source code of the UDF. It            database that can be installed and run directly from within
can then be executed locally and run in either a stand-alone          popular analytical tools without any external dependencies.
interactive debugger or a full-fledged IDE.                           However, unlike SQLite it is designed for analytical work-
   Model Management. Another issue that arises is the                 loads, and as such performs significantly better when ex-
management of different machine learning models. Current              ecuting analytical queries that operate on large amounts
systems, such as TensorFlow, allow the models to be written           of data. Because of the columnar layout of the database
to disk as individual files. However, much like handling data         and zero-copy semantics, data can be copied between the
as flat files, handling models as flat files is cumbersome and        database and the analytical tool for a constant cost, and no
error-prone.                                                          large costs need to be paid when extracting only a subset of
   In our work [11], we investigate how we can do model man-          columns from a wide table.
agement using a relational database. By storing the models               This efficient data transfer is illustrated in the experi-
in a relational database, we can store the models along-              ment in Figure 4, where we transfer the lineitem table from
side their training information or meta-information gathered          the TPC-H benchmark [1] from the database to the client
about the model. This allows us to query and apply the                process using MonetDBLite, SQLite, MonetDB and Post-
models based on this information, as well as apply multiple           greSQL. We observe that data can be exported from Mon-
models in parallel for ensemble learning.                             etDBLite an order of magnitude faster than over either a


                                                                  3
socket connection (in the case of MonetDB and PostgreSQL)                 [2] G. Allen and M. Owens. The Definitive Guide to
or from the row-storage model of SQLite.                                      SQLite. Apress, Berkely, CA, USA, 2nd edition, 2010.
                                                                          [3] Q. Chen, M. Hsu, and R. Liu. Extend UDF
                                                                              Technology for Integrated Analytics. In T. Pedersen,
5.   RESEARCH DIRECTIONS                                                      M. Mohania, and A. Tjoa, editors, Data Warehousing
   While our current solutions have made it both easier and                   and Knowledge Discovery, volume 5691 of Lecture
more efficient to combine relational databases with analyt-                   Notes in Computer Science, pages 256–270. Springer
ical tools, connecting them efficiently and effortlessly is by                Berlin Heidelberg, 2009.
no means a solved problem. In this section, we will describe              [4] J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang,
the open research problems that we have identified and how                    E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng,
we plan on tackling them in the future.                                       K. Li, and A. Kumar. The MADlib Analytics Library:
   Automatic Code Shipping. While user-defined func-                          Or MAD Skills, the SQL. Proc. VLDB Endow.,
tions allow for efficient in-database analytics, it still requires            5(12):1700–1711, Aug. 2012.
significant manual transformation effort to take an exist-
                                                                          [5] P. Holanda, M. Raasveldt, and M. Kersten. Don’t
ing analytical pipeline and make it run inside the database
                                                                              Hold My UDFs Hostage - Exporting UDFs For
server. Ideally, we would be able to automatically translate
                                                                              Debugging Purposes. In Proceedings of the 28th
an existing analytical pipeline and execute it on data resid-
                                                                              International Conference on Simpósio Brasileiro de
ing in the database without requiring manual user effort.
                                                                              Banco de Dados, SSBD 2017, Uberlndia, Brazil, 2017.
   A solution to this problem could be to take a program
                                                                          [6] S. Idreos, F. Groffen, N. Nes, S. Manegold,
that uses a database connector to connect to a database,
                                                                              S. Mullender, and M. Kersten. MonetDB: Two
and run the code directly inside the database server. In-
                                                                              Decades of Research in Column-oriented Database
stead of connecting with the database through a socket, the
                                                                              Architectures. IEEE Data Eng. Bull, 2012.
SQL code could be directly executed inside the server and
the results could be used inside the analytical tool without              [7] ISO. ISO/IEC 9075:1992, Database Language SQL.
requiring data transfer. This approach does negate the po-                    Technical report, International Organization for
tential advantages of automatic parallelization, however.                     Standardization (ISO), July 1992.
   Alternatively, the code could be analyzed and translated               [8] S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer.
into user-defined functions that can be executed within the                   Enterprise Data Analysis and Visualization: An
database server and could potentially be parallelized. An-                    Interview Study. IEEE Transactions on Visualization
alyzing if arbitrary code could be safely parallelized is not                 and Computer Graphics, 18(12):2917–2926, Dec. 2012.
possible, though, as it would be equivalent to solving the                [9] W. McKinney. Data Structures for Statistical
Halting problem. However, it would already be useful if a                     Computing in Python. In S. van der Walt and
limited subset of operations could be automatically shipped                   J. Millman, editors, Proceedings of the 9th Python in
and executed in parallel inside the database server. For ex-                  Science Conference, pages 51 – 56, 2010.
ample, a number of commonly used operations of the Pandas                [10] M. Raasveldt. MonetDBLite: An Embedded
and NumPy libraries could be supported.                                       Analytical Database. SIGMOD ’18: Proceedings of the
   UDF Co-optimization. Currently, MonetDB/Python                             2018 ACM International Conference on Management
UDFs are executed as black-box functions. As a result, there                  of Data, 2018.
is almost no room for automatic optimization of the actual               [11] M. Raasveldt, P. Holanda, H. Mühleisen, and
code. The only optimization we apply is the parallelization                   S. Manegold. Deep Integration of Machine Learning
of the functions, however, even this requires the user to tell                Into Column Stores. In Proceedings of the 21st
us whether or not the function is parallelizable.                             International Conference on Extending Database
   Lazy evaluation could allow us to optimize the UDFs                        Technology (EDBT), 2018.
more. Rather than executing the function in an eager man-                [12] M. Raasveldt and H. Mühleisen. Vectorized UDFs in
ner, we could defer the execution of certain operations on                    Column-Stores. In Proceedings of the 28th
the input columns (e.g. common NumPy and Pandas oper-                         International Conference on Scientific and Statistical
ations). This would allow us to build a computation graph,                    Database Management, SSDBM 2016, Budapest,
and either (1) run parts of that computation graph inside                     Hungary, July 18-20, 2016, pages 16:1–16:12, 2016.
the databases’ execution engine (where it could be executed              [13] M. Raasveldt and H. Mühleisen. Don’t Hold My Data
in parallel and take advantage of existing indexes), or (2)                   Hostage: A Case for Client Protocol Redesign. Proc.
feed information extracted from the computation graph to                      VLDB Endow., 10(10):1022–1033, June 2017.
the database optimizer.                                                  [14] S. van der Walt, S. Colbert, and G. Varoquaux. The
                                                                              NumPy Array: A Structure for Efficient Numerical
Acknowledgments                                                               Computation. Computing in Science Engineering,
  This work was funded by the Netherlands Organisation                        13(2):22–30, March 2011.
for Scientific Research (NWO), project “Process Mining for               [15] H. Wang and C. Zaniolo. User-Defined Aggregates in
Multi-Objective Online Control”.                                              Database Languages. In R. Connor and A. Mendelzon,
                                                                              editors, Research Issues in Structured and
                                                                              Semistructured Database Programming, volume 1949
6.   REFERENCES                                                               of Lecture Notes in Computer Science, pages 43–60.
 [1] TPC Benchmark H (Decision Support) Standard                              Springer Berlin Heidelberg, 2000.
     Specification. Technical report, Transaction                        [16] H. Wickham. Package ’dplyr’: A Grammar of Data
     Processing Performance Council, June 2013.                               Manipulation, 2017.


                                                                     4