=Paper=
{{Paper
|id=Vol-3163/paper6
|storemode=property
|title=Homoiconicity For End-to-end Machine Learning with BOSS
|pdfUrl=https://ceur-ws.org/Vol-3163/BICOD21_paper_7.pdf
|volume=Vol-3163
|authors=Hubert Mohr-Daurat,Holger Pirk
|dblpUrl=https://dblp.org/rec/conf/bncod/Mohr-DauratP21
}}
==Homoiconicity For End-to-end Machine Learning with BOSS==
Homoiconicity For End-to-end Machine Learning with BOSS Hubert Mohr-Daurat1 , Holger Pirk1 1 Imperial College London Abstract It is common for machine learning applications to have complex operator pipelines, with several steps before and after the main execution of the core machine learning algorithm. Consequently, data movement between processes has a performance cost for the execution of the application. In this paper, we propose an end-to-end machine learning framework based on the principle of homoiconicity: data and code are represented in a single unified syntax. This approach allows the efficient interpretation of custom operators at the core of a relational database system. We also present the implementation of a data cleaning framework as an initial milestone. This work is a part of a larger research project to implement BOSS, a novel, general-purpose relational database system that stores and processes homoiconic data. Keywords relational database, homoiconicity, symbolic expressions, machine learning, data cleaning, data imputation 1. Introduction are integrated into the pipeline, from simple averages or interpolations to more advanced regression methods. The core part of any machine learning (ML) pipeline is Consequently, ML pipelines are often complex, dele- the model to be learned during training and executed gating these additional data transformations to separate during inference. It is, thus, not surprising that this is processes that cause significant data movement during the most extensively researched and optimized aspect of their execution; to load data into memory, querying and ML applications. passing it to the data cleaning component (before or However, the model training and inference is not the during querying the data), passing it to the core ML com- only component of the data processing pipeline that plays ponent (the learning or the inference) and then again a role in the application’s overall performance. Many moving the output data to the application layer, which applications move large amounts of data as inputs and consumes the result. produce large output datasets. This data movement is Database Management Systems (DBMS), where the costly and, often, principally unnecessary. data lives for any data-heavy processing pipeline, plays In addition, many applications perform internal data a significant role in the data movement problem with a movement for various data operations, such as data aug- high overhead for transferring the data in and out. mentation or data normalisation. Some of these steps Prior research [1] has investigated methods to reduce involve the usage of dedicated libraries and even systems. the data movement overhead out of the DBMS. With The complexity of such pipelines with multiple layers ad-hoc serialisation protocols and efficient compression increases not only the overhead for data movement but technologies, the overhead is reduced but not eliminated also the engineering cost of implementation and mainte- and can still be significant for large-scale data movement. nance of the logic to connect them. Instead of making the data transfer more efficient, an Data cleaning is a particularly interesting example of alternative approach is to reduce the cases where data such data movement-intensive components in the case movement is needed, e.g. by executing the various data of an ML pipeline. Cleaning the data is necessary be- retrieval and transformation steps in the same process or cause the model learning and the inference process are by using shared memory when independent processes extremely sensitive to the data quality. Input data is are required. However, data movement can rarely be therefore generally cleaned using methods such as im- eliminated due to the additional transformations needed puting missing values and searching and fixing incorrect to match the data layout used by the libraries and appli- values (based on common sense, expert knowledge, and cations used in each layer. user-defined constraints). Algorithms for these tasks A more radical solution to this problem is integrating all the layers of data processing steps into a single central BICOD’21: British International Conference on Databases, Mars 28, system. This approach is much more effective at reducing 2022, London, UK the data movement overhead. Envelope-Open h.mohr-daurat19@imperial.ac.uk (H. Mohr-Daurat); hlgr@ic.ac.uk (H. Pirk) The DBMS is a natural starting point for such a central © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). system. DBMS are designed to store, retrieve and pro- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) STORE STORE STORE cess a large amount of data efficiently, taking advantage 1 1+2 1.5 + (A * 2) of strategies and technologies studied, developed and improved over several decades. PARTITIONING However, implementing all the desired features in the DBMS requires new data representations, operators and logic, which are not standard in typical DBMS. This is challenging to implement while preserving the advan- + + int int int symbol tages of the DBMS (such as the query optimizer). x A int float Implementing the desired features in a database comes 1 1 2 ... ... 1.5 with a trade-off. Implementing them as user-defined 2 batch 1 ... ... functions (UDFs) is straightforward but has a high over- ... ... batch 2 batch 3 COLUMN STORAGE head and prevents query optimizations; implementing them at the core of the DBMS backend is more efficient but requires significant effort and expert knowledge. Figure 1: shape-wise partitioning mechanism and storage Instead, we propose implementing a more generic ex- representation example. tensible system that is flexible and takes advantage of the DBMS efficiency to manipulate data. To achieve that, the DBMS stores as data the executable code that extends • ease for the user to implement new ML algorithms the DBMS functionality. Storing the code as data means and tools by assembling them from operators that the code is an integral part of the relational repre- using symbolic expression logic with flexibility sentation and can be manipulated by the DBMS kernel. while keeping native performance This method is fundamentally different from UDFs and triggers, which exist only in the schema and are be re- • with the logic as data paradigm, the approach trieved and run as a black box at query time. With our opens new possibilities for optimizations, unified approach, the DBMS handles this custom logic efficiently, for both the data (as usually handled by the query taking advantage of the optimization strategies produced optimizer) and for the logic (as usually handled by by DBMS research. the compiler) by exposing the logic as symbolic Fundamentally, homoiconicity refers to this idea that data to the database at every level, from query code can be read, modified and stored like data. The plan to storage layers. practically most common forms of homoiconic data are symbolic expressions (s-expressions). An s-expression is • symbolic expressions are not used only to express represented as a (potentially nested) list and can, there- the logic of the queries. They are also stored in the fore, be modified as well as evaluated. For example, ’1 database and evaluated efficiently at query time, + 2’ would be represented by the list ’(Plus 1 2)’. Lisp- whose benefits and implementation are detailed like languages use s-expressions to represent all code as in the use case in the next section. s-expressions and allow Turing-complete programming [2]. They even allow the representation of unknown or undefined values in the form of ”Symbols”. Unfor- 2. Storing and Processing tunately, lisp implementations are ill-suited to support data-intensive applications due to prohibitively high in- Homoiconic Data terpretation overhead (see Section 4). Inspired by these homoiconic languages, we are imple- As data cleaning is a data-intensive task, we, naturally, menting a DBMS called BOSS for Bulk-Oriented Symbol- selected the decomposed storage model (DSM) [3] to rep- Store. This database stores symbolic expressions in the resent (plain) relational data. On top of DSM data, we im- form of symbols and complex expressions. However, in plemented the BOSS kernel following the X100-model[4]: our implementation, the evaluation of the expressions is BOSS’ query processor partitions data into micro-batches performed for a large batch of data at once, thus amortiz- to keep intermediate results cache-resident. In addition ing interpretation overhead. to plain relational data, however, BOSS needs to process Applied to the ML domain, native operators are imple- homoiconic expressions that represent computation ef- mented in the database only for the basic blocks of the ficiently. Interpreting such expressions usually incurs ML operations (e.g. transformations, activations). Only significant overhead. the logic of the ML algorithm, less generic, is left to be To amortize this overhead, we developed a novel tech- implemented in userspace with symbolic expressions. nique called shape-wise partitioning & decomposition as The benefits of this approach are: shown in Figure 1. First, collections of expressions are (horizontally) partitioned by shape, i.e., all expressions BulkBOSS that only differ in their base type arguments form a par- 100000 10000 MonetDB 1000 100 tition. Second, the components of the expressions are Wolfram Query Time (ms) Query Time (ms) 1000 100 10 (recursively) decomposed, thus storing base data in a de- 10 1 composed form. In Figure 1, e.g., batch 1 stores plain 1 0.1 0.1 integers, while batch 2 stores expressions that add two 0.01 0.01 integers and batch 3 stores expressions that multiply two 0.001 0.001 1 2 5 10 1 2 5 10 0.001 0.01 0.1 0.001 0.01 0.1 HyAI SF integers and add them to a third. As shape-wise partition- BulkBOSS ImputeDB ing ensures homogeneous partitions, entire partitions Q1 Q6 Wolfram can be processed with a minimal number of function calls (one per subexpression). This ensures highly CPU- Figure 2: Performance comparison tests for TPC-H Q1/Q6 efficient processing. (relational), and HyAI (imputation). The missing bars for Wol- fram are caused by the engine running out of memory. 3. Use case: data imputation We implemented a data imputation framework in the to the user and benefits from the optimizations allowed context of a project called HyAI1 . The project’s overall by the bulk evaluation method. goal was to optimize the decisions to charge or consume hydrogen batteries using machine learning technologies 4. Evaluation that model future electricity consumption from the anal- ysis of many different data sources. Because the batteries All the experiments have run on a PC with an Intel i9- were intended to be deployed in areas where data trans- 10990K CPU and 32GB RAM DDR4 3600MHz. ferred from these sources might be unreliable, a robust To confirm that relational operators in BOSS are per- imputation framework was implemented in the database forming competitively with state-of-the-art DBMS, we management system so that the user could define custom compared the performance of BOSS with MonetDB using and potentially complex imputation logic. the TPC-H queries Q1 and Q6 with various scale factors. Missing data is commonly handled in database systems The results in Figure 2 shows that BOSS performs on the by inserting null values and internally defining an arbi- same magnitude with MonetDB without an impact from trary value as null or using a boolean flag. This method having to support homoiconicity in our implementation. does not allow to specify why the data is missing and Also shown in Figure 2, we compared the handling of to let the data cleaning process decide how the value is missing data with another implementation, ImputeDB imputed. However, there are multiple reasons why data [5], which has similar functionalities. We evaluated a would be missing. Ideally, the user should be allowed to dataset from HyAI project (50K rows) where 50% of the specify through code logic the intent and decide how to data is missing at random for five columns. Data is im- handle missing data. puted by taking a random value from data in the same One approach for the user to handle missing values column. BOSS performs better with a factor of x261. This is by calling the imputation method on the fly before in- result shows that the bulk approach to implementing im- serting the data into the database. However, this method putation operators is highly efficient. Our method to does not allow the use of future data for the imputation. evaluate expressions efficiently could also benefit more For example, both previous and subsequent values may advanced imputation techniques. be needed for the interpolation of time series data. In addition, the comparison for both relational opera- Alternatively, the user could handle missing values tors and missing data evaluation with another backend as part of the query logic by detecting the missing val- based on the Wolfram engine shows that our implementa- ues and using nested queries or UDFs. However, this tion outperforms by several orders of magnitude the com- method usually incurs an execution overhead that affects mercial homoiconic engine. It confirms that evaluating performance and increases the queries’ complexity. homoiconic data comes with a high cost for interpreting With the imputation framework that we implemented, the expressions, which is solved with the technique we the user stores missing data as complex expressions. The have implemented in BOSS. expressions describes the logic to use for the data impu- tation, e.g. instead of inserting a ’null’ value, the user inserts the expression ’(PreviousValue() + NextValue()) 5. Related Work / 2’ to do a simple local average. As presented in the previous section, this approach provides more flexibility Data cleaning is a critical aspect of many decision-making and analytics applications, particularly in ML-based soft- ware. Some solutions have been proposed to automate 1 < https://www.h2gopower.com/hyai> the process, such as HoloClean [6], an entire system run side-by-side with the DBMS. Alternative solutions have sign, task-specific code can be stored and processed with been proposed to integrate the data cleaning operators the data, thus providing unprecedented extensibility. In as part of the database query, such as ImputeDB as an this paper, we have demonstrated the advantages of this imputation module for the Data Civilizer framework [7]. design for handling data imputation. Both approaches do not allow the user to make deci- For the future, we plan to extend the system with ad- sions during the data cleaning, except for some initial ditional features for data processing pipelines: operators setup in the case of HoloClean. They differ from our to support additional data cleaning use cases, algebraic method, which performs data imputation during query operators needed for model training and inference and but is based on the information provided by the user even domain-specific operators. The system will, further, during data insertion. benefit from relational components such as query opti- Most research on accelerating the ML pipeline and mization, transactional semantics and data visualization. using a unified data management solution focuses on the acceleration of the training step. The approach in [8] has similarities with our work. They propose lambda References expressions as generic user-defined operators, which are [1] M. Raasveldt, H. Mühleisen, Don’t hold my data more efficient than usual UDFs. However, lambda expres- hostage: A case for client protocol redesign (2017). sions are used only in the queries and do not store code [2] J. McCarthy, Recursive functions of symbolic ex- logic in the database. Their focus is primarily on training pressions and their computation by machine, Part and inference and not on data cleaning. Their solution I, CACM (April ’60). also differs from ours since they use JIT-compilation to [3] P. Boncz, M. L. Kersten, Monet: A next-Generation optimize the evaluation of the expressions. DBMS Kernel for Query-Intensive Applications, Tupleware [9] also proposes a similar approach for Ph.D. thesis, Universiteit van Amsterdam, 2002. their general end-to-end analytic solution by analysing [4] M. Zukowski, et al., Monetdb/x100-a dbms in the the code of UDFs and taking advantage of JIT compilation cpu cache., IEEE Data Eng. Bull. (2005). to find the best pipeline strategy when merging the cus- [5] J. Cambronero, et al., Query optimization for dy- tom operator code with the query code. The drawback namic imputation, PVLDB (2017). is a warm-up overhead before the workload execution, [6] T. Rekatsinas, X. Chu, I. F. Ilyas, C. Ré, HoloClean: which is negligible for heavy workloads but can be a Holistic data repairs with probabilistic inference, problem when a fast response is required [10] (e.g. visu- PVLDB (2017). alisation, or high-frequency inference). [7] D. Deng, et al., The data civilizer system, in: Cidr, In addition, there is research work on integrating ML 2017. operators as part of database kernel ([11], [12], [13]). This [8] M. Schüle, et al., In-database machine learning: Gra- approach is efficient since ML operators are integrated dient descent and tensor algebra for main memory deep inside the database implementation but require in- database systems, BTW (2019). vesting time to implement ad-hoc code for each needed [9] A. Crotty, A. Galakatos, K. Dursun, T. Kraska, operator. Similarly, LMFAO [14] propose to represent U. Cetintemel, S. Zdonik, Tupleware: Redefining ML operators as aggregate queries. This method allows Modern Analytics, arXiv:1406.6667 (2014). interesting optimizations but is limited to operators that [10] T. Kraska, Northstar: An interactive data science can be represented as aggregates. Some research takes system, PVLDB (2018). the opposite approach and implements the relational op- [11] B. De Boe, et al., IntegratedML: Every SQL De- erators in a linear algebra kernel [15]. Finally, another veloper is a Data Scientist, in: DEEM@SIGMOD, approach uses a unified intermediate language for both 2020. relational and ML operators ([16], [17]). [12] K. Karanasos, et al., Extending Relational Query Processing with ML Inference, CIDR (2020). 6. Conclusion and Future Work [13] J. Hellerstein, , et al., The MADlib Analytics Library or MAD Skills, the SQL, arXiv:1208.4165 (2012). Data movement is a significant performance factor for [14] M. Schleich, et al., A Layered Aggregate Engine for modern data processing pipelines: the complexity of data Analytics Workloads, in: SIGMOD, , 2019. processing tasks involving steps such as data cleaning, [15] L. Chen, A. Kumar, J. Naughton, J. M. Patel, Towards normalization, model training and visualization already linear algebra over normalized data, PVLDB (2017). stresses interfaces and will only increase. To address [16] A. Shaikhha, et al., Multi-layer optimizations for this problem, we developed a new data processing sys- end-to-end data analytics, in: CGO@PPoPP, 2020. tem architecture around the idea of homoiconicity, i.e., [17] S. Palkar, et al., Weld: A Common Runtime for the uniform representation of data and code. In this de- High Performance Data Analytics (2017).