Integrating Analytics with Relational Databases Mark Raasveldt supervised by Hannes Mühleisen and Stefan Manegold Centrum Wiskunde & Informatica Amsterdam, Netherlands m.raasveldt@cwi.nl ABSTRACT In order to uncover insights and trends, it is an increas- ingly common practice for companies of all shapes and sizes to gather large quantities of data and to then analyze that data. This data can come from a multitude of different sources, ranging from data gathered about consumer behav- ior to data gathered from sensors. The most prevalent way (a) Client-Server connection. of storing and managing data has traditionally been a re- lational database management system (RDBMS). However, there is currently a disconnect between the tools used for analysis of data and the tools used for storing that data. Instead of working directly with RDBMSes, these tools are build to work in a stand-alone fashion, and offer integration with RDBMSes as an afterthought. The focus of my PhD research is on investigating different methods of combining popular analytical tools (such as R or Python) with database management systems in an efficient and user-friendly fash- (b) In-database processing. (c) Embedded database. ion. Figure 1: Different ways of connecting analytical tools with a database management system. 1. INTRODUCTION There is a disconnect between data-intensive analytical tools and traditional database management systems. Data scientists using these tools often prefer to manually manage all required data and intermediates into memory, leading to their data by storing it either as structured text (such as frequent out of memory problems or poor performance due CSV or XML files), or as binary files [8]. This approach to swapping. of managing data introduces a lot of problems, especially These issues could be solved through the use of a RDBMS. when a large amount of data from different sources has to The RDBMS can prevent data corruption through ACID be managed or combined. Flat file storage requires tremen- properties, it can automatically manage data storage for dous manual effort to maintain, is often difficult to reason the user and make data easier to reason about by enforc- about because of the lack of a rigid schema and is difficult to ing a rigid schema. In addition, the RDBMS can perform share between multiple users. Furthermore, modifying the efficient execution on larger-than-memory data, and allows data is prone to corruption because of lack of transactional concurrent read and write access to the data in a safe way. guarantees and atomic write actions. Another consequence Popular analytical tools such as R or Python can be used of this disconnect is that data scientists have re-implemented in conjunction with database systems. There are SQLite many common database operations inside libraries such as bindings for these languages, and it is possible to connect dplyr [16] or Pandas [9]. Instead of performing joins or ag- to a database server using standard client connector proto- gregations using a RDBMS, they perform them using these cols. However, data scientists prefer to use flat file storage libraries. However, these libraries suffer from having to load methods over these existing approaches, because these con- nections are either inefficient or inconvenient to use. The focus of this work is on identifying the problems en- countered when combining a RDBMS with analytical tools, and on implementing various solutions to overcome these issues to allow for both a more efficient and more flexible combination of these tools. Figure 1 shows the three main methods in which a relational database can be combined Proceedings of the VLDB 2018 PhD Workshop, August 27, 2018. Rio de with an analytical tool. We investigate each of these meth- Janeiro, Brazil. Copyright (C) 2018 for this paper by its authors. Copying permitted for ods, and attempt to improve them from both a usability and private and academic purposes. a performance perspective. 1 a binary format that is close to the native database format, we can export large tables an order of magnitude faster than existing solutions. However, even with a client protocol optimized for this scenario, there is still a significant amount of time required to push data “over the wire”. In addition, as this approach only replaces the loading of data from a flat file storage system with the loading of data from the RDBMS into the client, it still requires the entire dataset and intermediates to fit inside the clients’ main memory. 3. IN-DATABASE PROCESSING In order to avoid the cost of exporting the data from the database, the analysis can be performed inside the database server. This method, known as in-database processing, is shown in Figure 1b. Figure 2: Communication between a client and server. In-database processing can be performed in a database- agnostic way by rewriting the analysis pipeline in a set of standard-compliant SQL queries. However, most data anal- ysis, data mining and classification operators are difficult and inefficient to express in SQL. The SQL standard de- 2. CLIENT-SERVER CONNECTION scribes a number of built-in scalar functions and aggregates, The standard method of combining a standalone program such as AVG and SUM [7]. However, this small number of with a RDBMS is through a client-server connection. This functions and aggregates is not sufficient to perform complex is visualized in Figure 1a. In this way, the database server data analysis tasks [15]. is completely separate from the analytical tool. It runs as Instead of writing the analysis pipelines in SQL, user- either a separate process on the same machine or on a differ- defined functions or user-defined aggregates in procedural ent machine entirely. The analytical tool can issue queries to languages such as C/C++ can be used to implement clas- the database, after which the server will compute the answer sification and machine learning algorithms. This is the ap- to the query and transfer the results to the client through proach taken by Hellerstein et al. [4]. However, these func- the socket. This process is shown in Figure 2. tions still require significant rewrites of existing analytical In order to perform analytics on data stored inside the pipelines written in vectorized scripting languages. In ad- database, the data is exported from the database to the an- dition, writing user-defined functions in these languages re- alytical tool over the socket connection, after which that quire in-depth knowledge of the database internals and the data is processed in the client. The main advantage to this execution model used by the database [3]. approach is that it is mostly database agnostic, as the stan- In order to make it easier to perform in-database ana- dardized ODBC or JDBC connectors can be used to connect lytics, we introduced MonetDB/Python UDFs [12] in the to almost every database. In addition, it is relatively easy to Open-Source DBMS MonetDB [6]. These user-defined func- integrate into existing pipelines as the loading from flat files tions can be written in Python, and process code in a vec- can be replaced by loading from a database without having torized way. The input and output variables of the func- to touch the rest of the pipeline. tions and aggregates can be provided as either standardized However, this approach is problematic when dealing with NumPy arrays [14] or Pandas DataFrames [9]. In this way, a large amount of data as is often required in modern analyt- the user-defined functions mimic the execution of regular an- ical pipelines. The time spent on serializing large result sets alytical Python programs and can be written without any and transferring them from the server to the client can be knowledge of the database internals. Because of their vector- a significant bottleneck. In addition, this approach requires ized nature, the heavy interpreter overhead is not incurred the full dataset to fit inside the clients’ memory. once for every tuple but only once for every invocation of the In our work [13], we perform a survey of popular RDBMS function. Combined with the use of zero-copy techniques for and note that they are not optimized for the scenario of both the input and output columns these functions can be high-volume data export. They take a significant amount of executed efficiently on large datasets. time to export a relatively small amount of data even when the server and client are located on the same machine or 1 SELECT MEDIAN ( SQRT ( i * 2) ) FROM tbl ; connected through a high-throughput network connection. This is because existing client protocols were designed for Listing 1: Chain of SQL operators. the transfer of a small amount of rows in OLTP workloads, and have significant per-tuple and per-value overheads that result in the slow export of large tables. To remedy this problem, we investigate the different de- sign choices that can be made when designing a result set serialization format, and we propose a new client protocol that is optimized for the transfer of large amounts of data from the server to the client. By using a column-major chunk-wise format that utilizes lightweight compression and 2 Figure 3: Parallel operator chain of Listing 1. Figure 4: Transfer of the TPC-H lineitem table from the database to the client process. Parallelism. Another advantage of UDFs is that they can take advantage of the databases’ automatic paralleliza- tion model. In MonetDB, parallel execution is achieved by marking individual operators as either parallelizable or 4. EMBEDDED DATABASE blocking. When a chain of parallel operators is executed on Both the previously managed approaches require the user a column, the column is split up into several chunks and the to have a running database server. This requires signifi- operator is executed once on each chunk. When a blocking cant manual effort from the user, as the database server operator is encountered, the chunks are packed into a sin- must be installed, tuned and continuously maintained. For gle column and the blocking operator is executed on that small-scale data analysis, the effort spent on maintaining column. This process is visualized in Figure 3. the database server often negates the benefits of using one. MonetDB/Python UDFs can be parallelized in the same Embedding a database inside the client program, as shown way. The functions can be set to either allow parallelization, in Figure 1c, is more applicable for these use cases. As the in which case they are executed as a parallelizable operator, database can be installed and run from within the client or to disallow parallelization, in which case they will operate program, maintaining and setting up the database is much with the entire column as input. User-defined aggregates are simpler than with full-fledged database systems. parallelized over the different groups, where the aggregate is The most commonly used embedded database is SQLite [2]. called once for each group with the tuples belonging to that However, SQLite is first and foremost designed for transac- group as input. The aggregates computed for each group tional workloads. It is a row-store database that uses a are then gathered and combined to form the final result. volcano-based processing model for query execution. While Development Workflow. A challenge when developing popular analytical tools such as Python and R do have user-defined functions is that, since they are executed inside SQLite bindings, it does not perform well when used for ana- the database server, standard tools and integrated develop- lytical purposes. Even exclusively using SQLite as a storage ment environments (IDEs) cannot be used to develop them. engine typically does not work out well in these scenarios. As a result, developers cannot use sophisticated debugging Often only select columns of a table are used in analyses, techniques (e.g., Interactive Debugging) and have to resort and its row-wise storage layout forces it to always load en- to inefficient debugging strategies to make their code work. tire tables. This can lead to very poor performance when In order to make it easier to develop MonetDB/Python dealing with wide data. UDFs, we extended the client of MonetDB to allow for local To fill this gap, we created MonetDBLite [10], an Open- testing of user-defined functions [5]. The required data (or Source embedded database based on the popular columnar a sample of it) is automatically shipped from the database database MonetDB. Much like SQLite, it is an in-process to the client together with the source code of the UDF. It database that can be installed and run directly from within can then be executed locally and run in either a stand-alone popular analytical tools without any external dependencies. interactive debugger or a full-fledged IDE. However, unlike SQLite it is designed for analytical work- Model Management. Another issue that arises is the loads, and as such performs significantly better when ex- management of different machine learning models. Current ecuting analytical queries that operate on large amounts systems, such as TensorFlow, allow the models to be written of data. Because of the columnar layout of the database to disk as individual files. However, much like handling data and zero-copy semantics, data can be copied between the as flat files, handling models as flat files is cumbersome and database and the analytical tool for a constant cost, and no error-prone. large costs need to be paid when extracting only a subset of In our work [11], we investigate how we can do model man- columns from a wide table. agement using a relational database. By storing the models This efficient data transfer is illustrated in the experi- in a relational database, we can store the models along- ment in Figure 4, where we transfer the lineitem table from side their training information or meta-information gathered the TPC-H benchmark [1] from the database to the client about the model. This allows us to query and apply the process using MonetDBLite, SQLite, MonetDB and Post- models based on this information, as well as apply multiple greSQL. We observe that data can be exported from Mon- models in parallel for ensemble learning. etDBLite an order of magnitude faster than over either a 3 socket connection (in the case of MonetDB and PostgreSQL) [2] G. Allen and M. Owens. The Definitive Guide to or from the row-storage model of SQLite. SQLite. Apress, Berkely, CA, USA, 2nd edition, 2010. [3] Q. Chen, M. Hsu, and R. Liu. Extend UDF Technology for Integrated Analytics. In T. Pedersen, 5. RESEARCH DIRECTIONS M. Mohania, and A. Tjoa, editors, Data Warehousing While our current solutions have made it both easier and and Knowledge Discovery, volume 5691 of Lecture more efficient to combine relational databases with analyt- Notes in Computer Science, pages 256–270. Springer ical tools, connecting them efficiently and effortlessly is by Berlin Heidelberg, 2009. no means a solved problem. In this section, we will describe [4] J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, the open research problems that we have identified and how E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, we plan on tackling them in the future. K. Li, and A. Kumar. The MADlib Analytics Library: Automatic Code Shipping. While user-defined func- Or MAD Skills, the SQL. Proc. VLDB Endow., tions allow for efficient in-database analytics, it still requires 5(12):1700–1711, Aug. 2012. significant manual transformation effort to take an exist- [5] P. Holanda, M. Raasveldt, and M. Kersten. Don’t ing analytical pipeline and make it run inside the database Hold My UDFs Hostage - Exporting UDFs For server. Ideally, we would be able to automatically translate Debugging Purposes. In Proceedings of the 28th an existing analytical pipeline and execute it on data resid- International Conference on Simpósio Brasileiro de ing in the database without requiring manual user effort. Banco de Dados, SSBD 2017, Uberlndia, Brazil, 2017. A solution to this problem could be to take a program [6] S. Idreos, F. Groffen, N. Nes, S. Manegold, that uses a database connector to connect to a database, S. Mullender, and M. Kersten. MonetDB: Two and run the code directly inside the database server. In- Decades of Research in Column-oriented Database stead of connecting with the database through a socket, the Architectures. IEEE Data Eng. Bull, 2012. SQL code could be directly executed inside the server and the results could be used inside the analytical tool without [7] ISO. ISO/IEC 9075:1992, Database Language SQL. requiring data transfer. This approach does negate the po- Technical report, International Organization for tential advantages of automatic parallelization, however. Standardization (ISO), July 1992. Alternatively, the code could be analyzed and translated [8] S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. into user-defined functions that can be executed within the Enterprise Data Analysis and Visualization: An database server and could potentially be parallelized. An- Interview Study. IEEE Transactions on Visualization alyzing if arbitrary code could be safely parallelized is not and Computer Graphics, 18(12):2917–2926, Dec. 2012. possible, though, as it would be equivalent to solving the [9] W. McKinney. Data Structures for Statistical Halting problem. However, it would already be useful if a Computing in Python. In S. van der Walt and limited subset of operations could be automatically shipped J. Millman, editors, Proceedings of the 9th Python in and executed in parallel inside the database server. For ex- Science Conference, pages 51 – 56, 2010. ample, a number of commonly used operations of the Pandas [10] M. Raasveldt. MonetDBLite: An Embedded and NumPy libraries could be supported. Analytical Database. SIGMOD ’18: Proceedings of the UDF Co-optimization. Currently, MonetDB/Python 2018 ACM International Conference on Management UDFs are executed as black-box functions. As a result, there of Data, 2018. is almost no room for automatic optimization of the actual [11] M. Raasveldt, P. Holanda, H. Mühleisen, and code. The only optimization we apply is the parallelization S. Manegold. Deep Integration of Machine Learning of the functions, however, even this requires the user to tell Into Column Stores. In Proceedings of the 21st us whether or not the function is parallelizable. International Conference on Extending Database Lazy evaluation could allow us to optimize the UDFs Technology (EDBT), 2018. more. Rather than executing the function in an eager man- [12] M. Raasveldt and H. Mühleisen. Vectorized UDFs in ner, we could defer the execution of certain operations on Column-Stores. In Proceedings of the 28th the input columns (e.g. common NumPy and Pandas oper- International Conference on Scientific and Statistical ations). This would allow us to build a computation graph, Database Management, SSDBM 2016, Budapest, and either (1) run parts of that computation graph inside Hungary, July 18-20, 2016, pages 16:1–16:12, 2016. the databases’ execution engine (where it could be executed [13] M. Raasveldt and H. Mühleisen. Don’t Hold My Data in parallel and take advantage of existing indexes), or (2) Hostage: A Case for Client Protocol Redesign. Proc. feed information extracted from the computation graph to VLDB Endow., 10(10):1022–1033, June 2017. the database optimizer. [14] S. van der Walt, S. Colbert, and G. Varoquaux. The NumPy Array: A Structure for Efficient Numerical Acknowledgments Computation. Computing in Science Engineering, This work was funded by the Netherlands Organisation 13(2):22–30, March 2011. for Scientific Research (NWO), project “Process Mining for [15] H. Wang and C. Zaniolo. User-Defined Aggregates in Multi-Objective Online Control”. Database Languages. In R. Connor and A. Mendelzon, editors, Research Issues in Structured and Semistructured Database Programming, volume 1949 6. REFERENCES of Lecture Notes in Computer Science, pages 43–60. [1] TPC Benchmark H (Decision Support) Standard Springer Berlin Heidelberg, 2000. Specification. Technical report, Transaction [16] H. Wickham. Package ’dplyr’: A Grammar of Data Processing Performance Council, June 2013. Manipulation, 2017. 4