<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>PhD Workshop, August</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Integrating Analytics with Relational Databases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mark Raasveldt supervised by Hannes Mu¨ hleisen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Manegold Centrum Wiskunde</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Informatica Amsterdam</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Netherlands m.raasveldt@cwi.nl</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>27</volume>
      <issue>2018</issue>
      <abstract>
        <p>In order to uncover insights and trends, it is an increasingly common practice for companies of all shapes and sizes to gather large quantities of data and to then analyze that data. This data can come from a multitude of di erent sources, ranging from data gathered about consumer behavior to data gathered from sensors. The most prevalent way of storing and managing data has traditionally been a relational database management system (RDBMS). However, there is currently a disconnect between the tools used for analysis of data and the tools used for storing that data. Instead of working directly with RDBMSes, these tools are build to work in a stand-alone fashion, and o er integration with RDBMSes as an afterthought. The focus of my PhD research is on investigating di erent methods of combining popular analytical tools (such as R or Python) with database management systems in an e cient and user-friendly fashion.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        There is a disconnect between data-intensive analytical
tools and traditional database management systems. Data
scientists using these tools often prefer to manually manage
their data by storing it either as structured text (such as
CSV or XML les), or as binary les [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This approach
of managing data introduces a lot of problems, especially
when a large amount of data from di erent sources has to
be managed or combined. Flat le storage requires
tremendous manual e ort to maintain, is often di cult to reason
about because of the lack of a rigid schema and is di cult to
share between multiple users. Furthermore, modifying the
data is prone to corruption because of lack of transactional
guarantees and atomic write actions. Another consequence
of this disconnect is that data scientists have re-implemented
many common database operations inside libraries such as
dplyr [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] or Pandas [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Instead of performing joins or
aggregations using a RDBMS, they perform them using these
libraries. However, these libraries su er from having to load
      </p>
      <p>(a) Client-Server connection.
all required data and intermediates into memory, leading to
frequent out of memory problems or poor performance due
to swapping.</p>
      <p>These issues could be solved through the use of a RDBMS.
The RDBMS can prevent data corruption through ACID
properties, it can automatically manage data storage for
the user and make data easier to reason about by
enforcing a rigid schema. In addition, the RDBMS can perform
e cient execution on larger-than-memory data, and allows
concurrent read and write access to the data in a safe way.</p>
      <p>Popular analytical tools such as R or Python can be used
in conjunction with database systems. There are SQLite
bindings for these languages, and it is possible to connect
to a database server using standard client connector
protocols. However, data scientists prefer to use at le storage
methods over these existing approaches, because these
connections are either ine cient or inconvenient to use.</p>
      <p>The focus of this work is on identifying the problems
encountered when combining a RDBMS with analytical tools,
and on implementing various solutions to overcome these
issues to allow for both a more e cient and more exible
combination of these tools. Figure 1 shows the three main
methods in which a relational database can be combined
with an analytical tool. We investigate each of these
methods, and attempt to improve them from both a usability and
a performance perspective.</p>
    </sec>
    <sec id="sec-2">
      <title>CLIENT-SERVER CONNECTION</title>
      <p>The standard method of combining a standalone program
with a RDBMS is through a client-server connection. This
is visualized in Figure 1a. In this way, the database server
is completely separate from the analytical tool. It runs as
either a separate process on the same machine or on a di
erent machine entirely. The analytical tool can issue queries to
the database, after which the server will compute the answer
to the query and transfer the results to the client through
the socket. This process is shown in Figure 2.</p>
      <p>In order to perform analytics on data stored inside the
database, the data is exported from the database to the
analytical tool over the socket connection, after which that
data is processed in the client. The main advantage to this
approach is that it is mostly database agnostic, as the
standardized ODBC or JDBC connectors can be used to connect
to almost every database. In addition, it is relatively easy to
integrate into existing pipelines as the loading from at les
can be replaced by loading from a database without having
to touch the rest of the pipeline.</p>
      <p>However, this approach is problematic when dealing with
a large amount of data as is often required in modern
analytical pipelines. The time spent on serializing large result sets
and transferring them from the server to the client can be
a signi cant bottleneck. In addition, this approach requires
the full dataset to t inside the clients' memory.</p>
      <p>
        In our work [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], we perform a survey of popular RDBMS
and note that they are not optimized for the scenario of
high-volume data export. They take a signi cant amount of
time to export a relatively small amount of data even when
the server and client are located on the same machine or
connected through a high-throughput network connection.
This is because existing client protocols were designed for
the transfer of a small amount of rows in OLTP workloads,
and have signi cant per-tuple and per-value overheads that
result in the slow export of large tables.
      </p>
      <p>To remedy this problem, we investigate the di erent
design choices that can be made when designing a result set
serialization format, and we propose a new client protocol
that is optimized for the transfer of large amounts of data
from the server to the client. By using a column-major
chunk-wise format that utilizes lightweight compression and
a binary format that is close to the native database format,
we can export large tables an order of magnitude faster than
existing solutions.</p>
      <p>However, even with a client protocol optimized for this
scenario, there is still a signi cant amount of time required
to push data \over the wire". In addition, as this approach
only replaces the loading of data from a at le storage
system with the loading of data from the RDBMS into the
client, it still requires the entire dataset and intermediates
to t inside the clients' main memory.</p>
    </sec>
    <sec id="sec-3">
      <title>3. IN-DATABASE PROCESSING</title>
      <p>In order to avoid the cost of exporting the data from the
database, the analysis can be performed inside the database
server. This method, known as in-database processing, is
shown in Figure 1b.</p>
      <p>
        In-database processing can be performed in a
databaseagnostic way by rewriting the analysis pipeline in a set of
standard-compliant SQL queries. However, most data
analysis, data mining and classi cation operators are di cult
and ine cient to express in SQL. The SQL standard
describes a number of built-in scalar functions and aggregates,
such as AVG and SUM [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. However, this small number of
functions and aggregates is not su cient to perform complex
data analysis tasks [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        Instead of writing the analysis pipelines in SQL,
userde ned functions or user-de ned aggregates in procedural
languages such as C/C++ can be used to implement
classi cation and machine learning algorithms. This is the
approach taken by Hellerstein et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, these
functions still require signi cant rewrites of existing analytical
pipelines written in vectorized scripting languages. In
addition, writing user-de ned functions in these languages
require in-depth knowledge of the database internals and the
execution model used by the database [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        In order to make it easier to perform in-database
analytics, we introduced MonetDB/Python UDFs [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] in the
Open-Source DBMS MonetDB [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. These user-de ned
functions can be written in Python, and process code in a
vectorized way. The input and output variables of the
functions and aggregates can be provided as either standardized
NumPy arrays [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] or Pandas DataFrames [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In this way,
the user-de ned functions mimic the execution of regular
analytical Python programs and can be written without any
knowledge of the database internals. Because of their
vectorized nature, the heavy interpreter overhead is not incurred
once for every tuple but only once for every invocation of the
function. Combined with the use of zero-copy techniques for
both the input and output columns these functions can be
executed e ciently on large datasets.
1 SELECT MEDIAN ( SQRT ( i * 2) ) FROM tbl ;
Listing 1: Chain of SQL operators.
      </p>
      <p>Parallelism. Another advantage of UDFs is that they
can take advantage of the databases' automatic
parallelization model. In MonetDB, parallel execution is achieved
by marking individual operators as either parallelizable or
blocking. When a chain of parallel operators is executed on
a column, the column is split up into several chunks and the
operator is executed once on each chunk. When a blocking
operator is encountered, the chunks are packed into a
single column and the blocking operator is executed on that
column. This process is visualized in Figure 3.</p>
      <p>MonetDB/Python UDFs can be parallelized in the same
way. The functions can be set to either allow parallelization,
in which case they are executed as a parallelizable operator,
or to disallow parallelization, in which case they will operate
with the entire column as input. User-de ned aggregates are
parallelized over the di erent groups, where the aggregate is
called once for each group with the tuples belonging to that
group as input. The aggregates computed for each group
are then gathered and combined to form the nal result.</p>
      <p>Development Work ow. A challenge when developing
user-de ned functions is that, since they are executed inside
the database server, standard tools and integrated
development environments (IDEs) cannot be used to develop them.
As a result, developers cannot use sophisticated debugging
techniques (e.g., Interactive Debugging) and have to resort
to ine cient debugging strategies to make their code work.</p>
      <p>
        In order to make it easier to develop MonetDB/Python
UDFs, we extended the client of MonetDB to allow for local
testing of user-de ned functions [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The required data (or
a sample of it) is automatically shipped from the database
to the client together with the source code of the UDF. It
can then be executed locally and run in either a stand-alone
interactive debugger or a full- edged IDE.
      </p>
      <p>Model Management. Another issue that arises is the
management of di erent machine learning models. Current
systems, such as TensorFlow, allow the models to be written
to disk as individual les. However, much like handling data
as at les, handling models as at les is cumbersome and
error-prone.</p>
      <p>
        In our work [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], we investigate how we can do model
management using a relational database. By storing the models
in a relational database, we can store the models
alongside their training information or meta-information gathered
about the model. This allows us to query and apply the
models based on this information, as well as apply multiple
models in parallel for ensemble learning.
4.
      </p>
    </sec>
    <sec id="sec-4">
      <title>EMBEDDED DATABASE</title>
      <p>Both the previously managed approaches require the user
to have a running database server. This requires signi
cant manual e ort from the user, as the database server
must be installed, tuned and continuously maintained. For
small-scale data analysis, the e ort spent on maintaining
the database server often negates the bene ts of using one.</p>
      <p>Embedding a database inside the client program, as shown
in Figure 1c, is more applicable for these use cases. As the
database can be installed and run from within the client
program, maintaining and setting up the database is much
simpler than with full- edged database systems.</p>
      <p>
        The most commonly used embedded database is SQLite [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
However, SQLite is rst and foremost designed for
transactional workloads. It is a row-store database that uses a
volcano-based processing model for query execution. While
popular analytical tools such as Python and R do have
SQLite bindings, it does not perform well when used for
analytical purposes. Even exclusively using SQLite as a storage
engine typically does not work out well in these scenarios.
Often only select columns of a table are used in analyses,
and its row-wise storage layout forces it to always load
entire tables. This can lead to very poor performance when
dealing with wide data.
      </p>
      <p>
        To ll this gap, we created MonetDBLite [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], an
OpenSource embedded database based on the popular columnar
database MonetDB. Much like SQLite, it is an in-process
database that can be installed and run directly from within
popular analytical tools without any external dependencies.
However, unlike SQLite it is designed for analytical
workloads, and as such performs signi cantly better when
executing analytical queries that operate on large amounts
of data. Because of the columnar layout of the database
and zero-copy semantics, data can be copied between the
database and the analytical tool for a constant cost, and no
large costs need to be paid when extracting only a subset of
columns from a wide table.
      </p>
      <p>
        This e cient data transfer is illustrated in the
experiment in Figure 4, where we transfer the lineitem table from
the TPC-H benchmark [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] from the database to the client
process using MonetDBLite, SQLite, MonetDB and
PostgreSQL. We observe that data can be exported from
MonetDBLite an order of magnitude faster than over either a
socket connection (in the case of MonetDB and PostgreSQL)
or from the row-storage model of SQLite.
      </p>
    </sec>
    <sec id="sec-5">
      <title>RESEARCH DIRECTIONS</title>
      <p>While our current solutions have made it both easier and
more e cient to combine relational databases with
analytical tools, connecting them e ciently and e ortlessly is by
no means a solved problem. In this section, we will describe
the open research problems that we have identi ed and how
we plan on tackling them in the future.</p>
      <p>Automatic Code Shipping. While user-de ned
functions allow for e cient in-database analytics, it still requires
signi cant manual transformation e ort to take an
existing analytical pipeline and make it run inside the database
server. Ideally, we would be able to automatically translate
an existing analytical pipeline and execute it on data
residing in the database without requiring manual user e ort.</p>
      <p>A solution to this problem could be to take a program
that uses a database connector to connect to a database,
and run the code directly inside the database server.
Instead of connecting with the database through a socket, the
SQL code could be directly executed inside the server and
the results could be used inside the analytical tool without
requiring data transfer. This approach does negate the
potential advantages of automatic parallelization, however.</p>
      <p>Alternatively, the code could be analyzed and translated
into user-de ned functions that can be executed within the
database server and could potentially be parallelized.
Analyzing if arbitrary code could be safely parallelized is not
possible, though, as it would be equivalent to solving the
Halting problem. However, it would already be useful if a
limited subset of operations could be automatically shipped
and executed in parallel inside the database server. For
example, a number of commonly used operations of the Pandas
and NumPy libraries could be supported.</p>
      <p>UDF Co-optimization. Currently, MonetDB/Python
UDFs are executed as black-box functions. As a result, there
is almost no room for automatic optimization of the actual
code. The only optimization we apply is the parallelization
of the functions, however, even this requires the user to tell
us whether or not the function is parallelizable.</p>
      <p>Lazy evaluation could allow us to optimize the UDFs
more. Rather than executing the function in an eager
manner, we could defer the execution of certain operations on
the input columns (e.g. common NumPy and Pandas
operations). This would allow us to build a computation graph,
and either (1) run parts of that computation graph inside
the databases' execution engine (where it could be executed
in parallel and take advantage of existing indexes), or (2)
feed information extracted from the computation graph to
the database optimizer.</p>
      <p>Acknowledgments</p>
      <p>This work was funded by the Netherlands Organisation
for Scienti c Research (NWO), project \Process Mining for
Multi-Objective Online Control".</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>TPC</given-names>
            <surname>Benchmark H (Decision</surname>
          </string-name>
          <article-title>Support) Standard Speci cation</article-title>
          .
          <source>Technical report, Transaction Processing Performance Council</source>
          ,
          <year>June 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Allen</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Owens</surname>
          </string-name>
          . The De nitive Guide to SQLite. Apress, Berkely, CA, USA, 2nd edition,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hsu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>Extend UDF Technology for Integrated Analytics</article-title>
          . In T. Pedersen,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mohania</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A</surname>
          </string-name>
          . Tjoa, editors,
          <source>Data Warehousing and Knowledge Discovery</source>
          , volume
          <volume>5691</volume>
          of Lecture Notes in Computer Science, pages
          <volume>256</volume>
          {
          <fpage>270</fpage>
          . Springer Berlin Heidelberg,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Re</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schoppmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fratkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gorajek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Welton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          .
          <article-title>The MADlib Analytics Library: Or MAD Skills, the SQL</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>5</volume>
          (
          <issue>12</issue>
          ):
          <volume>1700</volume>
          {
          <fpage>1711</fpage>
          ,
          <string-name>
            <surname>Aug</surname>
          </string-name>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Holanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raasveldt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Kersten. Don't Hold My UDFs Hostage - Exporting UDFs For Debugging</surname>
          </string-name>
          <article-title>Purposes</article-title>
          .
          <source>In Proceedings of the 28th International Conference on Simposio Brasileiro</source>
          de Banco de Dados,
          <source>SSBD</source>
          <year>2017</year>
          , Uberlndia, Brazil,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Idreos</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Gro en</article-title>
          , N. Nes,
          <string-name>
            <given-names>S.</given-names>
            <surname>Manegold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mullender</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Kersten</surname>
          </string-name>
          .
          <article-title>MonetDB: Two Decades of Research in Column-oriented Database Architectures</article-title>
          .
          <source>IEEE Data Eng. Bull</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>[7] ISO</article-title>
          . ISO/IEC 9075:
          <year>1992</year>
          ,
          <article-title>Database Language SQL</article-title>
          .
          <source>Technical report</source>
          , International Organization for Standardization (ISO),
          <year>July 1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kandel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paepcke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Heer</surname>
          </string-name>
          .
          <article-title>Enterprise Data Analysis and Visualization: An Interview Study</article-title>
          .
          <source>IEEE Transactions on Visualization and Computer Graphics</source>
          ,
          <volume>18</volume>
          (
          <issue>12</issue>
          ):
          <volume>2917</volume>
          {
          <fpage>2926</fpage>
          ,
          <string-name>
            <surname>Dec</surname>
          </string-name>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W.</given-names>
            <surname>McKinney</surname>
          </string-name>
          .
          <article-title>Data Structures for Statistical Computing in Python</article-title>
          . In S. van der Walt and J. Millman, editors,
          <source>Proceedings of the 9th Python in Science Conference</source>
          , pages
          <volume>51</volume>
          {
          <fpage>56</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Raasveldt</surname>
          </string-name>
          .
          <article-title>MonetDBLite: An Embedded Analytical Database</article-title>
          .
          <source>SIGMOD '18: Proceedings of the 2018 ACM International Conference on Management of Data</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Raasveldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Holanda</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <article-title>Muhleisen, and</article-title>
          <string-name>
            <given-names>S.</given-names>
            <surname>Manegold</surname>
          </string-name>
          .
          <article-title>Deep Integration of Machine Learning Into Column Stores</article-title>
          .
          <source>In Proceedings of the 21st International Conference on Extending Database Technology (EDBT)</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Raasveldt</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>hleisen. Vectorized UDFs in Column-Stores</article-title>
          .
          <source>In Proceedings of the 28th International Conference on Scienti c and Statistical Database Management</source>
          ,
          <string-name>
            <surname>SSDBM</surname>
          </string-name>
          <year>2016</year>
          , Budapest, Hungary,
          <source>July 18-20</source>
          ,
          <year>2016</year>
          , pages
          <issue>16:1</issue>
          {
          <fpage>16</fpage>
          :
          <fpage>12</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Raasveldt</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>hleisen. Don't Hold My Data Hostage: A Case for Client Protocol Redesign</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>10</volume>
          (
          <issue>10</issue>
          ):
          <volume>1022</volume>
          {
          <fpage>1033</fpage>
          ,
          <year>June 2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S. van der</given-names>
            <surname>Walt</surname>
          </string-name>
          , S. Colbert, and
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux. The NumPy</surname>
          </string-name>
          <article-title>Array: A Structure for E cient Numerical Computation</article-title>
          . Computing in Science Engineering,
          <volume>13</volume>
          (
          <issue>2</issue>
          ):
          <volume>22</volume>
          {
          <fpage>30</fpage>
          ,
          <string-name>
            <surname>March</surname>
          </string-name>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Zaniolo.</surname>
          </string-name>
          User-De ned
          <article-title>Aggregates in Database Languages</article-title>
          . In R. Connor and
          <string-name>
            <surname>A</surname>
          </string-name>
          . Mendelzon, editors,
          <source>Research Issues in Structured and Semistructured Database Programming</source>
          , volume
          <volume>1949</volume>
          <source>of Lecture Notes in Computer Science</source>
          , pages
          <volume>43</volume>
          {
          <fpage>60</fpage>
          . Springer Berlin Heidelberg,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wickham</surname>
          </string-name>
          . Package 'dplyr':
          <source>A Grammar of Data Manipulation</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>