=Paper= {{Paper |id=Vol-3163/paper6 |storemode=property |title=Homoiconicity For End-to-end Machine Learning with BOSS |pdfUrl=https://ceur-ws.org/Vol-3163/BICOD21_paper_7.pdf |volume=Vol-3163 |authors=Hubert Mohr-Daurat,Holger Pirk |dblpUrl=https://dblp.org/rec/conf/bncod/Mohr-DauratP21 }} ==Homoiconicity For End-to-end Machine Learning with BOSS== https://ceur-ws.org/Vol-3163/BICOD21_paper_7.pdf
Homoiconicity For End-to-end Machine Learning
with BOSS
Hubert Mohr-Daurat1 , Holger Pirk1
1
    Imperial College London


                                             Abstract
                                             It is common for machine learning applications to have complex operator pipelines, with several steps before and after the
                                             main execution of the core machine learning algorithm. Consequently, data movement between processes has a performance
                                             cost for the execution of the application. In this paper, we propose an end-to-end machine learning framework based on
                                             the principle of homoiconicity: data and code are represented in a single unified syntax. This approach allows the efficient
                                             interpretation of custom operators at the core of a relational database system. We also present the implementation of a data
                                             cleaning framework as an initial milestone. This work is a part of a larger research project to implement BOSS, a novel,
                                             general-purpose relational database system that stores and processes homoiconic data.

                                             Keywords
                                             relational database, homoiconicity, symbolic expressions, machine learning, data cleaning, data imputation



1. Introduction                                                                                                       are integrated into the pipeline, from simple averages or
                                                                                                                      interpolations to more advanced regression methods.
The core part of any machine learning (ML) pipeline is                                                                   Consequently, ML pipelines are often complex, dele-
the model to be learned during training and executed                                                                  gating these additional data transformations to separate
during inference. It is, thus, not surprising that this is                                                            processes that cause significant data movement during
the most extensively researched and optimized aspect of                                                               their execution; to load data into memory, querying and
ML applications.                                                                                                      passing it to the data cleaning component (before or
   However, the model training and inference is not the                                                               during querying the data), passing it to the core ML com-
only component of the data processing pipeline that plays                                                             ponent (the learning or the inference) and then again
a role in the application’s overall performance. Many                                                                 moving the output data to the application layer, which
applications move large amounts of data as inputs and                                                                 consumes the result.
produce large output datasets. This data movement is                                                                     Database Management Systems (DBMS), where the
costly and, often, principally unnecessary.                                                                           data lives for any data-heavy processing pipeline, plays
   In addition, many applications perform internal data                                                               a significant role in the data movement problem with a
movement for various data operations, such as data aug-                                                               high overhead for transferring the data in and out.
mentation or data normalisation. Some of these steps                                                                     Prior research [1] has investigated methods to reduce
involve the usage of dedicated libraries and even systems.                                                            the data movement overhead out of the DBMS. With
   The complexity of such pipelines with multiple layers                                                              ad-hoc serialisation protocols and efficient compression
increases not only the overhead for data movement but                                                                 technologies, the overhead is reduced but not eliminated
also the engineering cost of implementation and mainte-                                                               and can still be significant for large-scale data movement.
nance of the logic to connect them.                                                                                      Instead of making the data transfer more efficient, an
   Data cleaning is a particularly interesting example of                                                             alternative approach is to reduce the cases where data
such data movement-intensive components in the case                                                                   movement is needed, e.g. by executing the various data
of an ML pipeline. Cleaning the data is necessary be-                                                                 retrieval and transformation steps in the same process or
cause the model learning and the inference process are                                                                by using shared memory when independent processes
extremely sensitive to the data quality. Input data is                                                                are required. However, data movement can rarely be
therefore generally cleaned using methods such as im-                                                                 eliminated due to the additional transformations needed
puting missing values and searching and fixing incorrect                                                              to match the data layout used by the libraries and appli-
values (based on common sense, expert knowledge, and                                                                  cations used in each layer.
user-defined constraints). Algorithms for these tasks                                                                    A more radical solution to this problem is integrating
                                                                                                                      all the layers of data processing steps into a single central
BICOD’21: British International Conference on Databases, Mars 28,                                                     system. This approach is much more effective at reducing
2022, London, UK
                                                                                                                      the data movement overhead.
Envelope-Open h.mohr-daurat19@imperial.ac.uk (H. Mohr-Daurat);
hlgr@ic.ac.uk (H. Pirk)                                                                                                  The DBMS is a natural starting point for such a central
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                      system. DBMS are designed to store, retrieve and pro-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                                                                               STORE                  STORE                        STORE
cess a large amount of data efficiently, taking advantage                        1                     1+2                       1.5 + (A * 2)

of strategies and technologies studied, developed and
improved over several decades.                                                                   PARTITIONING
   However, implementing all the desired features in the
DBMS requires new data representations, operators and
logic, which are not standard in typical DBMS. This is
challenging to implement while preserving the advan-                                +                            +
                                                                         int


                                                                                           int      int                 symbol

tages of the DBMS (such as the query optimizer).                                                                                       x
                                                                                                                          A                      int   float

   Implementing the desired features in a database comes                 1
                                                                                           1        2
                                                                         ...                                              ...                          1.5
with a trade-off. Implementing them as user-defined
                                                                                                                                                 2
                                                               batch 1                               ...
                                                                                           ...

functions (UDFs) is straightforward but has a high over-
                                                                                                                                                 ...    ...
                                                                                 batch 2
                                                                                                              batch 3

                                                                                                 COLUMN        STORAGE
head and prevents query optimizations; implementing
them at the core of the DBMS backend is more efficient
but requires significant effort and expert knowledge.         Figure 1: shape-wise partitioning mechanism and storage
   Instead, we propose implementing a more generic ex-        representation example.
tensible system that is flexible and takes advantage of the
DBMS efficiency to manipulate data. To achieve that, the
DBMS stores as data the executable code that extends
                                                                         • ease for the user to implement new ML algorithms
the DBMS functionality. Storing the code as data means
                                                                           and tools by assembling them from operators
that the code is an integral part of the relational repre-
                                                                           using symbolic expression logic with flexibility
sentation and can be manipulated by the DBMS kernel.
                                                                           while keeping native performance
This method is fundamentally different from UDFs and
triggers, which exist only in the schema and are be re-                  • with the logic as data paradigm, the approach
trieved and run as a black box at query time. With our                     opens new possibilities for optimizations, unified
approach, the DBMS handles this custom logic efficiently,                  for both the data (as usually handled by the query
taking advantage of the optimization strategies produced                   optimizer) and for the logic (as usually handled by
by DBMS research.                                                          the compiler) by exposing the logic as symbolic
   Fundamentally, homoiconicity refers to this idea that                   data to the database at every level, from query
code can be read, modified and stored like data. The                       plan to storage layers.
practically most common forms of homoiconic data are
symbolic expressions (s-expressions). An s-expression is                 • symbolic expressions are not used only to express
represented as a (potentially nested) list and can, there-                 the logic of the queries. They are also stored in the
fore, be modified as well as evaluated. For example, ’1                    database and evaluated efficiently at query time,
+ 2’ would be represented by the list ’(Plus 1 2)’. Lisp-                  whose benefits and implementation are detailed
like languages use s-expressions to represent all code as                  in the use case in the next section.
s-expressions and allow Turing-complete programming
[2]. They even allow the representation of unknown
or undefined values in the form of ”Symbols”. Unfor-          2. Storing and Processing
tunately, lisp implementations are ill-suited to support
data-intensive applications due to prohibitively high in-        Homoiconic Data
terpretation overhead (see Section 4).
   Inspired by these homoiconic languages, we are imple-      As data cleaning is a data-intensive task, we, naturally,
menting a DBMS called BOSS for Bulk-Oriented Symbol-          selected the decomposed storage model (DSM) [3] to rep-
Store. This database stores symbolic expressions in the       resent (plain) relational data. On top of DSM data, we im-
form of symbols and complex expressions. However, in          plemented the BOSS kernel following the X100-model[4]:
our implementation, the evaluation of the expressions is      BOSS’ query processor partitions data into micro-batches
performed for a large batch of data at once, thus amortiz-    to keep intermediate results cache-resident. In addition
ing interpretation overhead.                                  to plain relational data, however, BOSS needs to process
   Applied to the ML domain, native operators are imple-      homoiconic expressions that represent computation ef-
mented in the database only for the basic blocks of the       ficiently. Interpreting such expressions usually incurs
ML operations (e.g. transformations, activations). Only       significant overhead.
the logic of the ML algorithm, less generic, is left to be       To amortize this overhead, we developed a novel tech-
implemented in userspace with symbolic expressions.           nique called shape-wise partitioning & decomposition as
   The benefits of this approach are:                         shown in Figure 1. First, collections of expressions are
                                                              (horizontally) partitioned by shape, i.e., all expressions
                                                                                                                                               BulkBOSS
that only differ in their base type arguments form a par- 100000
                                                             10000                             MonetDB
                                                                                                          1000
                                                                                                           100
tition. Second, the components of the expressions are                                          Wolfram




                                                                                                                                                            Query Time (ms)
                                                              Query Time (ms)
                                                               1000
                                                                100                                         10
(recursively) decomposed, thus storing base data in a de-        10                                           1
composed form. In Figure 1, e.g., batch 1 stores plain             1                                        0.1
                                                                 0.1
integers, while batch 2 stores expressions that add two        0.01                                       0.01

integers and batch 3 stores expressions that multiply two     0.001                                      0.001




                                                                                                     1
                                                                                                         2
                                                                                                             5
                                                                                                                 10




                                                                                                                                           1
                                                                                                                                               2
                                                                                                                                                   5
                                                                                                                                                       10
                                                                                0.001
                                                                                        0.01
                                                                                               0.1




                                                                                                                      0.001
                                                                                                                              0.01
                                                                                                                                     0.1
                                                                                                                   HyAI
                                                                SF
integers and add them to a third. As shape-wise partition-                                              BulkBOSS  ImputeDB
ing ensures homogeneous partitions, entire partitions                      Q1              Q6           Wolfram

can be processed with a minimal number of function
calls (one per subexpression). This ensures highly CPU- Figure 2: Performance comparison tests for TPC-H Q1/Q6
efficient processing.                                      (relational), and HyAI (imputation). The missing bars for Wol-
                                                              fram are caused by the engine running out of memory.

3. Use case: data imputation
We implemented a data imputation framework in the             to the user and benefits from the optimizations allowed
context of a project called HyAI1 . The project’s overall     by the bulk evaluation method.
goal was to optimize the decisions to charge or consume
hydrogen batteries using machine learning technologies        4. Evaluation
that model future electricity consumption from the anal-
ysis of many different data sources. Because the batteries    All the experiments have run on a PC with an Intel i9-
were intended to be deployed in areas where data trans-       10990K CPU and 32GB RAM DDR4 3600MHz.
ferred from these sources might be unreliable, a robust          To confirm that relational operators in BOSS are per-
imputation framework was implemented in the database          forming competitively with state-of-the-art DBMS, we
management system so that the user could define custom        compared the performance of BOSS with MonetDB using
and potentially complex imputation logic.                     the TPC-H queries Q1 and Q6 with various scale factors.
   Missing data is commonly handled in database systems       The results in Figure 2 shows that BOSS performs on the
by inserting null values and internally defining an arbi-     same magnitude with MonetDB without an impact from
trary value as null or using a boolean flag. This method      having to support homoiconicity in our implementation.
does not allow to specify why the data is missing and            Also shown in Figure 2, we compared the handling of
to let the data cleaning process decide how the value is      missing data with another implementation, ImputeDB
imputed. However, there are multiple reasons why data         [5], which has similar functionalities. We evaluated a
would be missing. Ideally, the user should be allowed to      dataset from HyAI project (50K rows) where 50% of the
specify through code logic the intent and decide how to       data is missing at random for five columns. Data is im-
handle missing data.                                          puted by taking a random value from data in the same
   One approach for the user to handle missing values         column. BOSS performs better with a factor of x261. This
is by calling the imputation method on the fly before in-     result shows that the bulk approach to implementing im-
serting the data into the database. However, this method      putation operators is highly efficient. Our method to
does not allow the use of future data for the imputation.     evaluate expressions efficiently could also benefit more
For example, both previous and subsequent values may          advanced imputation techniques.
be needed for the interpolation of time series data.             In addition, the comparison for both relational opera-
   Alternatively, the user could handle missing values        tors and missing data evaluation with another backend
as part of the query logic by detecting the missing val-      based on the Wolfram engine shows that our implementa-
ues and using nested queries or UDFs. However, this           tion outperforms by several orders of magnitude the com-
method usually incurs an execution overhead that affects      mercial homoiconic engine. It confirms that evaluating
performance and increases the queries’ complexity.            homoiconic data comes with a high cost for interpreting
   With the imputation framework that we implemented,         the expressions, which is solved with the technique we
the user stores missing data as complex expressions. The      have implemented in BOSS.
expressions describes the logic to use for the data impu-
tation, e.g. instead of inserting a ’null’ value, the user
inserts the expression ’(PreviousValue() + NextValue())       5. Related Work
/ 2’ to do a simple local average. As presented in the
previous section, this approach provides more flexibility     Data cleaning is a critical aspect of many decision-making
                                                              and analytics applications, particularly in ML-based soft-
                                                              ware. Some solutions have been proposed to automate
   1
       < https://www.h2gopower.com/hyai>                      the process, such as HoloClean [6], an entire system run
side-by-side with the DBMS. Alternative solutions have          sign, task-specific code can be stored and processed with
been proposed to integrate the data cleaning operators          the data, thus providing unprecedented extensibility. In
as part of the database query, such as ImputeDB as an           this paper, we have demonstrated the advantages of this
imputation module for the Data Civilizer framework [7].         design for handling data imputation.
Both approaches do not allow the user to make deci-                For the future, we plan to extend the system with ad-
sions during the data cleaning, except for some initial         ditional features for data processing pipelines: operators
setup in the case of HoloClean. They differ from our            to support additional data cleaning use cases, algebraic
method, which performs data imputation during query             operators needed for model training and inference and
but is based on the information provided by the user            even domain-specific operators. The system will, further,
during data insertion.                                          benefit from relational components such as query opti-
   Most research on accelerating the ML pipeline and            mization, transactional semantics and data visualization.
using a unified data management solution focuses on the
acceleration of the training step. The approach in [8]
has similarities with our work. They propose lambda             References
expressions as generic user-defined operators, which are
                                                          [1] M. Raasveldt, H. Mühleisen, Don’t hold my data
more efficient than usual UDFs. However, lambda expres-
                                                              hostage: A case for client protocol redesign (2017).
sions are used only in the queries and do not store code
                                                          [2] J. McCarthy, Recursive functions of symbolic ex-
logic in the database. Their focus is primarily on training
                                                              pressions and their computation by machine, Part
and inference and not on data cleaning. Their solution
                                                              I, CACM (April ’60).
also differs from ours since they use JIT-compilation to
                                                          [3] P. Boncz, M. L. Kersten, Monet: A next-Generation
optimize the evaluation of the expressions.
                                                              DBMS Kernel for Query-Intensive Applications,
   Tupleware [9] also proposes a similar approach for
                                                              Ph.D. thesis, Universiteit van Amsterdam, 2002.
their general end-to-end analytic solution by analysing
                                                          [4] M. Zukowski, et al., Monetdb/x100-a dbms in the
the code of UDFs and taking advantage of JIT compilation
                                                              cpu cache., IEEE Data Eng. Bull. (2005).
to find the best pipeline strategy when merging the cus-
                                                          [5] J. Cambronero, et al., Query optimization for dy-
tom operator code with the query code. The drawback
                                                              namic imputation, PVLDB (2017).
is a warm-up overhead before the workload execution,
                                                          [6] T. Rekatsinas, X. Chu, I. F. Ilyas, C. Ré, HoloClean:
which is negligible for heavy workloads but can be a
                                                              Holistic data repairs with probabilistic inference,
problem when a fast response is required [10] (e.g. visu-
                                                              PVLDB (2017).
alisation, or high-frequency inference).
                                                          [7] D. Deng, et al., The data civilizer system, in: Cidr,
   In addition, there is research work on integrating ML
                                                              2017.
operators as part of database kernel ([11], [12], [13]). This
                                                          [8] M. Schüle, et al., In-database machine learning: Gra-
approach is efficient since ML operators are integrated
                                                              dient descent and tensor algebra for main memory
deep inside the database implementation but require in-
                                                              database systems, BTW (2019).
vesting time to implement ad-hoc code for each needed
                                                          [9] A. Crotty, A. Galakatos, K. Dursun, T. Kraska,
operator. Similarly, LMFAO [14] propose to represent
                                                              U. Cetintemel, S. Zdonik, Tupleware: Redefining
ML operators as aggregate queries. This method allows
                                                              Modern Analytics, arXiv:1406.6667 (2014).
interesting optimizations but is limited to operators that
                                                         [10] T. Kraska, Northstar: An interactive data science
can be represented as aggregates. Some research takes
                                                              system, PVLDB (2018).
the opposite approach and implements the relational op-
                                                         [11] B. De Boe, et al., IntegratedML: Every SQL De-
erators in a linear algebra kernel [15]. Finally, another
                                                              veloper is a Data Scientist, in: DEEM@SIGMOD,
approach uses a unified intermediate language for both
                                                              2020.
relational and ML operators ([16], [17]).
                                                         [12] K. Karanasos, et al., Extending Relational Query
                                                              Processing with ML Inference, CIDR (2020).
6. Conclusion and Future Work                            [13] J. Hellerstein, , et al., The MADlib Analytics Library
                                                              or MAD Skills, the SQL, arXiv:1208.4165 (2012).
Data movement is a significant performance factor for [14] M. Schleich, et al., A Layered Aggregate Engine for
modern data processing pipelines: the complexity of data      Analytics Workloads, in: SIGMOD, , 2019.
processing tasks involving steps such as data cleaning, [15] L. Chen, A. Kumar, J. Naughton, J. M. Patel, Towards
normalization, model training and visualization already       linear algebra over normalized data, PVLDB (2017).
stresses interfaces and will only increase. To address [16] A. Shaikhha, et al., Multi-layer optimizations for
this problem, we developed a new data processing sys-         end-to-end data analytics, in: CGO@PPoPP, 2020.
tem architecture around the idea of homoiconicity, i.e., [17] S. Palkar, et al., Weld: A Common Runtime for
the uniform representation of data and code. In this de-      High Performance Data Analytics (2017).