=Paper=
{{Paper
|id=Vol-2841/SIMPLIFY_3
|storemode=property
|title=Scale-independent Data Analysis with Database-backed Dataframes: a Case Study
|pdfUrl=https://ceur-ws.org/Vol-2841/SIMPLIFY_3.pdf
|volume=Vol-2841
|authors=Phanwadee Sinthong,Michael Carey,Yuhan Yao
|dblpUrl=https://dblp.org/rec/conf/edbt/Sinthong0Y21
}}
==Scale-independent Data Analysis with Database-backed Dataframes: a Case Study==
<pdf width="1500px">https://ceur-ws.org/Vol-2841/SIMPLIFY_3.pdf</pdf>
<pre>
         Scale-independent Data Analysis with Database-backed
                       Dataframes: a Case Study
            Phanwadee Sinthong                                                Yuhan Yao                               Michael J. Carey
        University of California, Irvine                         University of California, Irvine              University of California, Irvine
               California, USA                                          California, USA                               California, USA
             psinthon@uci.edu                                          yuhany2@uci.edu                             mjcarey@ics.uci.edu

ABSTRACT                                                                               2     BACKGROUND AND RELATED WORK
Big Data analytics are being used in many businesses and organi-                       We are developing AFrame to lighten user workloads and ease
zations. However, the requirements that big data analytics intro-                      the transitioning from a local workstation to a production envi-
duce cannot be solved by a single system, and require distributed                      ronment by providing data scientists with their familiar interface,
system knowledge and data management expertise from data an-                           Pandas DataFrames, and transparently scale its operations to
alysts who should instead be focus on gleaning information from                        execute in a distributed database environment. Here we briefly
the data. AFrame is a data exploration library that provides a data                    review Pandas, AFrame’s architecture, and related work.
scientist’s familiar interface, Pandas DataFrame, and scales its op-
erations to large volumes of data through a big data management                        2.1    Pandas
system to enable a seamless analysis experience moving from                            Pandas [5] is an open source Python data analytic library. Pandas
small to large datasets. In this paper, we present a detailed case                     provides a data structure called DataFrame, designed specifically
study that uses AFrame to perform an end-to-end data analysis                          for data manipulation and analysis. DataFrames are similar to
from data cleaning and modeling through to deployment.                                 a table in Excel with labeled rows and columns. Pandas works
                                                                                       with several popular machine learning libraries such as Scikit-
1    INTRODUCTION                                                                      Learn [11] and Tensorflow [8]. Pandas is one of the most popular
As large volumes of data are generated and available through                           data analytic libraries partly due to its flexible data structure
various sources (e.g., social platforms, IoT, and daily transactions),                 and the rich set of features that the library provides. However,
Big Data analytics are gaining popularity across all types of in-                      Pandas’ shortcoming lies in its scalability, as it only operates on
dustries. Harnessing information and patterns from data can help                       a single workstation and utilizes a single processing core.
businesses identify opportunities, make well-informed decisions,
and manage risks. That, in turn, leads to higher profits and sat-                      2.2    AFrame
isfied customers. Even though Big Data production is evident,                          AFrame is a Python library that provides a Pandas DataFrame-
efficient Big Data processing and analysis remain challenging.                         like syntax to interact with distributed data stored in a big data
Existing data analytic tools and libraries require an extensive                        management system, Apache AsterixDB [9]. AFrame utilizes lazy
knowledge of distributed data management from data analysts                            evaluation to scale Pandas operations by incrementally construct-
who should instead be focusing on tasks like developing and                            ing database queries from each of Pandas DataFrame operations
using machine learning models.                                                         and it only sends the queries over to AsterixDB when results
    AFrame [13] is a data exploration and analysis library that                        are actually called for. AFrame delivers a Pandas DataFrame-
delivers a scale-independent analysis experience by providing                          like experience on Big Data requiring minimum effort from data
a Pandas-like DataFrame interface while transparently scaling                          scientists allowing them to focus on analyzing the data.
the analytical operations to be executed on a database system.
AFrame leverages database data storage and management in                               2.3    Related Work
order to accommodate the rate and volume at which the data                             There are several scalable dataframe libraries that try to deliver a
arrives. It allows data scientists to perform data analysis right                      Pandas-like experience and scale operations onto large volumes
where the data is stored. AFrame simplifies user interactions                          of file-based data using different methods. These libraries either
with large amounts of data, as the framework provides data                             provide a similar Pandas-like interface on a distributed compute
scientists’ familiar interface and operates directly on database                       engine, execute several Pandas DataFrames in parallel, or use
systems without requiring multiple system setups.                                      memory mapping to optimize the computation and speed up the
    In this paper, we illustrate the usability of AFrame through a                     data access. To our knowledge, there has not been any effort to de-
case study that uses it to perform an end-to-end data analysis.                        velop a Pandas-like interface directly on top of database systems
We highlight some of AFrame’s functionalities that help simplify                       where large volumes of data are stored. AFrame can leverage
Big Data analysis through each of the data analytics lifecycle                         database index and query optimizer to efficiently access and op-
stages. The notebook that we use in this paper is also available 1 .                   erate on data at scale without moving it into system-specific
    The rest of this paper is organized as follows: Section 2 dis-                     environment or create an intermediate data representation. Here
cusses background and related work. Section 3 details our case                         we briefly compare and contrast two of the most well-known
study. We conclude and describe our current work in Section 4.                         scalable dataframe libraries, Spark [10] and Dask [12].
1 https://nbviewer.jupyter.org/github/psinthong/SF_CRIME_Notebook/blob/master/
                                                                                          2.3.1 Spark. Apache Spark [2] is a general-purpose cluster
sf_crimes_paper.ipynb
                                                                                       computing framework. Spark provides a DataFrame API, an in-
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-   terface for data analysts to interact with data in a distributed file
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus)
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0           system. However, Spark’s DataFrame syntax is quite different
International (CC BY 4.0)                                                              from Pandas’, as Spark is heavily influenced by SQL’s syntax
while Pandas relies on features of the Python language. As a            ‘SF_CRIMES’. The AsterixDB server is running locally on port
result, the Koalas project [4] was introduced to bridge the gap be-     19002. Input line 3 displays a sample of two records from the
tween Spark and Pandas DataFrames. Koalas is a Python library           dataset, and input line 4 displays the underlying SQL++ query
implemented on top of the Spark DataFrame API and its syntax is         that AFrame generated and extended. More about AFrame’s in-
designed to be largely identical to that of Pandas’. In order to sup-   cremental query formation process can be found in [13].
port Pandas DataFrame features (e.g., row label, eager evaluation)
in a distributed environment, Koalas implements an intermediate
data representation that results at times in expensive operations
and performance trade-offs. Since Koalas operates directly on top
of Spark, users are also required to set up distributed file storage
as well as tuning Spark to suit their data access patterns.
   2.3.2 Dask. Dask is an open source Python framework that
provides advanced parallelism for data scientists’ familiar ana-
lytical libraries such as Pandas, NumPy, and Scikit-learn. Dask
DataFrame is a scalable dataframe library that is composed of mul-
tiple Pandas DataFrames. Dask partitions its data files row-wise
and operates on them in parallel while utilizing all of the available
computational cores. Since Dask uses Pandas internally, it also
inherits the high memory consumption problem that Pandas has.
Dask tries to compensate for this disadvantage by delaying its
expression evaluation in order to reduce the amount of needed
computation. However, Dask does not aim at implementing all of
the Pandas operations because supporting certain Pandas opera-                              Figure 1: Acquire data
tions in a distributed environment results in poor performance
trade-offs, as mentioned in its documentation [3].                         Next, our user drops some columns from the dataset and ex-
                                                                        plores the data values, as shown in Figure 2. Input line 5 drops
3     AFRAME CASE STUDY                                                 several columns using the Pandas’ ‘drop’ function and prints out
                                                                        two records from the resulting data. Our user then prints out the
There are several methodologies that provide a guideline for the
                                                                        unique values from the columns ‘pdDistrict’ and ‘dayOfWeek’
stages in a data science lifecycle, such as the traditional method
                                                                        as shown in input lines 6 and 7 respectively.
called CRISP_DM for data mining and an emerging methodol-
ogy introduced by Microsoft called TDSP [7]. They have defined
general phases in a data analysis lifecycle that can be summa-
rized as follows: business understanding, data understanding and
preparation, modeling, evaluation, and deployment.
   Our goal for AFrame is to deliver a scale-independent data
analysis experience. Data analysts should be able to use AFrame
interchangeably with Pandas DataFrames and utilize their fa-
vorite ML libraries to perform each of the data science lifecycle
stages on large volumes of data with minimum effort.
   In order to see if AFrame delivers up to our expectations, we
conducted a case study by asking a user to perform a data analysis
using AFrame in places where Pandas DataFrames would oth-
erwise be used (with an exception of training machine learning
models). We used a running example of an analysis of a San Fran-                  Figure 2: Data cleaning and exploration
cisco police department historical incident report dataset [6] to
predict the probability of incidents being resolved. Due to space       3.2    Modeling
limitations, we will only be displaying a subset of the printed         The next stage in a data science project lifecycle is to determine
attributes in our Figures.                                              and optimize features through feature engineering to facilitate
   We present the case study here through each of the previously        machine learning model training. This stage also includes ma-
mentioned data science lifecycle stages. (Interested readers can        chine learning model development, which is the process to con-
find information about AFrame’s performance in [13].)                   struct and select a model that can predict the target values most
                                                                        accurately considering their success metrics.
3.1    Data understanding and preparation                                  In Figure 3, the user applies one-hot encoding by utilizing the
The goal of this stage in the data science lifecycle is to obtain and   Pandas’ ‘get_dummies’ function to create multiple features from
clean the data to prepare it for analysis. Figure 1 is a snapshot       the columns that he previously explored. Input line 8 applies
from a Jupyter notebook that shows a process of creating an             one-hot encoding to the ‘pdDistrict’ column and line 9 displays
AFrame object and displaying two records from a dataset. Input          the resulting data with ten new columns each indicating whether
line 2 labeled ‘In[2]’ shows how to create an AFrame object by          or not the record has that particular feature. Input lines 10 and 11
utilizing AFrame’s AsterixDB connector and providing a data-            perform the same operation on the ‘category’ and ‘dayOfWeek’
verse name and a dataset name. For this example, the dataset is         columns respectively. The user then appends all of their one-hot
called ‘Crimes_sample’ and it is contained in a dataverse named         encodings to the original data in input line 12.
                                                                       supported. As such, AFrame provides an operation called ‘toPan-
                                                                       das’ which converts data into Pandas DataFrame objects2 . On
                                                                       input line 20, ‘Y’ is the binary encoded ‘resolved’ column and the
                                                                       remaining columns will be used to train the models. Input line
                                                                       22 splits the data into an 80% training set and a 20% testing set.


                 Figure 3: One-hot encodings

   In order to extract important features from the data, users can
also apply AsterixDB’s builtin functions directly on the entire                  Figure 5: Preparing data for model training
data or part of the data. This is done through the ‘map’ and ‘apply’      Input lines 23 - 26 in Figure 6 are standard Scikit-learn model
functions. The complete list of all available builtin functions        training steps that take a Pandas DataFrame as their input. Input
can be found at [1]. Figure 4 shows an example of using the            line 23 trains a Logistic Regression model on the training data,
map operation on a subset of the data attributes. Input line 13        while input line 24 calls the ‘predict’ function on the test data
creates two new columns, ‘month’ and ‘hour’. For ‘month’, the          and displays a subset of the results. Instead of returning binary
user applies AsterixDB’s ‘parse_date’ to the ‘date’ column to          results indicating whether or not a particular incident will get
generate a date object and then applies the ‘get_month’ function       resolved, users can utilize Scikit-learn’s ‘predict_proba’ method
to extract only the month before appending it as a new column          to get the raw probability values that the model outputs for each
called ‘month’. Similarly, for the ‘hour’ column, ‘parse_time’ is      of the prediction labels (0 and 1 in our case). Input line 25 shows
applied to the ‘time’ column followed by the ‘get_hour’ function       the probability values that the model outputs for each of the
to extract the hour of day from the data before appending it as a      labels in order (0 followed by 1) on a subset of the records. Our
new column. Finally, to finish up the feature engineering process,     user decided to use the ‘predict_proba’ function and output the
the target column ‘resolution’ is converted into a binary value        probability of an incident getting resolved as shown in line 26.
column called ‘resolved’ using AsterixDB’s ‘to_number’ function.


                                                                                   Figure 6: Model training and inferencing

                                                                       3.3     Evaluation and deployment
   Figure 4: Applying functions to create new columns                  In order to deploy the model into a production environment and
                                                                       apply it to large datasets at full scale, users can export and pack-
                                                                       age their models using Python’s pickle and then deposit them
   Once the feature engineering process is done, the data is split     into AsterixDB for use as external user-defined functions (UDFs).
into training and testing sets for use in training and evaluating      In our example, the user deposited their model and created a
machine learning models. Figure 5 shows the process of split-          function called ‘getResolution’ in AsterixDB that invokes the
ting the data. Input line 19 converts the data referenced by an
                                                                       2 The resulting data is required by Pandas to fit in memory. This currently limits
AFrame object into a Pandas DataFrame. Currently, feeding data
                                                                       the training dataset size, but there is no limit to the amount of data to which the
into existing Scikit-learn models from a database system is not        model may be applied (see Section 3.3).
trained model’s ‘predict_proba’ function on a data record. Due
to space limitations, we omit the steps to deposit the model into
AsterixDB and create a new UDF to call it, but the required steps
can be found in [1]. After creating a UDF, users can then uti-
lize their model using AFrame in the same way that they would
use a builtin function. Figure 7 shows the user applying their
Python UDF ‘getResolution’ on an AFrame object. Line 37 uses
the ‘apply’ operation to apply the Python UDF on the previously
transformed data. The results of the function are the probabilities
of crime incidents getting resolved, as displayed in line 38.


                                                                                               Figure 9: Model inferencing


                                                                          3.4      Lessons Learned
                                                                          The case study is not only helpful in proving the usability of
  Figure 7: Calling the model using the function syntax                   AFrame but it also helps us identify useful and missing features.
   At this point in the analysis, our user is done with the training      For example, the transformation saving mode was created to
and evaluation process and wants to apply the model to other              record the data transformation steps as a function so that they
larger sets of data. However, to apply the model to a different           can easily be applied to other datasets using the existing function
dataset, that dataset has to have the same features arranged in the       syntax. Unique Pandas’ functions (e.g., describe, get_dummies)
same order as the training data. To simplify the model inferencing        are implemented in AFrame by internally calling multiple simple
process, AFrame allows users to save their data transformation as         operations in a sequence. This was influenced by the engineered
a function that can then be applied to other datasets effortlessly.       features that the user manually created.
Line 40 in Figure 8 displays the persist-transformation syntax.
The user has named the resulting function ‘is_resolvable’. The            4     CONCLUSION
underlying SQL++ query that transforms and appends their engi-            In this short paper, we have detailed an end-to-end case study
neered features to a data record before calling the trained model         that utilizes AFrame in a data analysis. We have demonstrated
on it is shown in input line 41.                                          the flexibility and simplicity of using AFrame to perform data
                                                                          preparation, modeling, and deployment on different dataset sizes
                                                                          and moving seamlessly from a small dataset to datasets at scale.
                                                                          We believe that AFrame can deliver a scale-independent data
                                                                          preparation and analysis experience while requiring less effort
                                                                          from data analysts, allowing them to focus on more critical tasks.
                                                                             Currently, we are extending AFrame by retargeting its in-
                                                                          cremental query transformation process onto multiple database
                                                                          systems to make its benefits available to existing users of other
                                                                          databases. Our re-architected version of AFrame called PolyFrame,
                                                                          can operate against AsterixDB using SQL++, PostgreSQL using
                                                                          SQL, MongoDB using MongoDB’s query language, and Neo4j
                                                                          using Cypher. More information about PolyFrame including its ar-
                                                                          chitecture and language configuration rules can be found in [14].

                                                                          REFERENCES
                                                                           [1] Apache AsterixDB. https://asterixdb.apache.org/.
                                                                           [2] Apache Spark. http://spark.apache.org/.
                                                                           [3] Dask Anti-Uses. https://docs.dask.org/en/latest/dataframe.html.
                                                                           [4] Koalas. http://koalas.readthedocs.io.
           Figure 8: Persisting the transformation                         [5] Pandas. http://pandas.pydata.org/.
                                                                           [6] Police department incident reports. http://data.sfgov.org/.
    Finally, applying the data transformation and the ML model             [7] Team Data Science Process. https://docs.microsoft.com/en-us/azure/machine-
                                                                               learning/team-data-science-process/overview.
on a large distributed dataset can be done through AFrame using            [8] M. Abadi et al. Tensorflow: A system for large-scale machine learning. In
the apply function. Figure 9 shows the user accessing a differ-                12th USENIX OSDI Symposium, pages 265–283, 2016.
ent dataset, called ‘Crimes’, from AsterixDB and applying the              [9] S. Alsubaiee et al. AsterixDB: A scalable, open source BDMS. PVLDB,
                                                                               7(14):1905–1916, 2014.
‘is_resolvable’ function to it. In input line 43, our user filters only   [10] M. Armbrust et al. Scaling Spark in the real world: Performance and usability.
the crime incidents that happened in the Central district, possibly            PVLDB, 8(12):1840–1843, 2015.
                                                                          [11] F. Pedregosa et al. Scikit-learn: machine learning in Python. Journal of Machine
assisted under the hood by an index on ‘pdDistrict’, before ap-                Learning Research, 12(10):2825–2830, 2011.
plying the trained ML model. The model’s prediction results are           [12] M. Rocklin. Dask: Parallel computation with blocked algorithms and task
appended to the dataset as a new column called ‘is_resolvable’                 scheduling. In Proc. 14th Python in Science Conf., pages 130–136, 2015.
                                                                          [13] P. Sinthong and M. J. Carey. AFrame: Extending DataFrames for large-scale
in line 44. These results can be used to do further analysis or to             modern data analysis. In 2019 IEEE Big Data Conf., pages 359–371, 2019.
visualize them using Pandas’ compatible visualization libraries           [14] P. Sinthong and M. J. Carey. PolyFrame: A Query-based approach to scaling
by converting the AFrame object into a Pandas DataFrame.                       Dataframes (Extended Version). arXiv preprint arXiv:2010.05529, 2020.

</pre>