=Paper=
{{Paper
|id=Vol-2841/SIMPLIFY_3
|storemode=property
|title=Scale-independent Data Analysis with Database-backed Dataframes: a Case Study
|pdfUrl=https://ceur-ws.org/Vol-2841/SIMPLIFY_3.pdf
|volume=Vol-2841
|authors=Phanwadee Sinthong,Michael Carey,Yuhan Yao
|dblpUrl=https://dblp.org/rec/conf/edbt/Sinthong0Y21
}}
==Scale-independent Data Analysis with Database-backed Dataframes: a Case Study==
Scale-independent Data Analysis with Database-backed Dataframes: a Case Study Phanwadee Sinthong Yuhan Yao Michael J. Carey University of California, Irvine University of California, Irvine University of California, Irvine California, USA California, USA California, USA psinthon@uci.edu yuhany2@uci.edu mjcarey@ics.uci.edu ABSTRACT 2 BACKGROUND AND RELATED WORK Big Data analytics are being used in many businesses and organi- We are developing AFrame to lighten user workloads and ease zations. However, the requirements that big data analytics intro- the transitioning from a local workstation to a production envi- duce cannot be solved by a single system, and require distributed ronment by providing data scientists with their familiar interface, system knowledge and data management expertise from data an- Pandas DataFrames, and transparently scale its operations to alysts who should instead be focus on gleaning information from execute in a distributed database environment. Here we briefly the data. AFrame is a data exploration library that provides a data review Pandas, AFrame’s architecture, and related work. scientist’s familiar interface, Pandas DataFrame, and scales its op- erations to large volumes of data through a big data management 2.1 Pandas system to enable a seamless analysis experience moving from Pandas [5] is an open source Python data analytic library. Pandas small to large datasets. In this paper, we present a detailed case provides a data structure called DataFrame, designed specifically study that uses AFrame to perform an end-to-end data analysis for data manipulation and analysis. DataFrames are similar to from data cleaning and modeling through to deployment. a table in Excel with labeled rows and columns. Pandas works with several popular machine learning libraries such as Scikit- 1 INTRODUCTION Learn [11] and Tensorflow [8]. Pandas is one of the most popular As large volumes of data are generated and available through data analytic libraries partly due to its flexible data structure various sources (e.g., social platforms, IoT, and daily transactions), and the rich set of features that the library provides. However, Big Data analytics are gaining popularity across all types of in- Pandas’ shortcoming lies in its scalability, as it only operates on dustries. Harnessing information and patterns from data can help a single workstation and utilizes a single processing core. businesses identify opportunities, make well-informed decisions, and manage risks. That, in turn, leads to higher profits and sat- 2.2 AFrame isfied customers. Even though Big Data production is evident, AFrame is a Python library that provides a Pandas DataFrame- efficient Big Data processing and analysis remain challenging. like syntax to interact with distributed data stored in a big data Existing data analytic tools and libraries require an extensive management system, Apache AsterixDB [9]. AFrame utilizes lazy knowledge of distributed data management from data analysts evaluation to scale Pandas operations by incrementally construct- who should instead be focusing on tasks like developing and ing database queries from each of Pandas DataFrame operations using machine learning models. and it only sends the queries over to AsterixDB when results AFrame [13] is a data exploration and analysis library that are actually called for. AFrame delivers a Pandas DataFrame- delivers a scale-independent analysis experience by providing like experience on Big Data requiring minimum effort from data a Pandas-like DataFrame interface while transparently scaling scientists allowing them to focus on analyzing the data. the analytical operations to be executed on a database system. AFrame leverages database data storage and management in 2.3 Related Work order to accommodate the rate and volume at which the data There are several scalable dataframe libraries that try to deliver a arrives. It allows data scientists to perform data analysis right Pandas-like experience and scale operations onto large volumes where the data is stored. AFrame simplifies user interactions of file-based data using different methods. These libraries either with large amounts of data, as the framework provides data provide a similar Pandas-like interface on a distributed compute scientists’ familiar interface and operates directly on database engine, execute several Pandas DataFrames in parallel, or use systems without requiring multiple system setups. memory mapping to optimize the computation and speed up the In this paper, we illustrate the usability of AFrame through a data access. To our knowledge, there has not been any effort to de- case study that uses it to perform an end-to-end data analysis. velop a Pandas-like interface directly on top of database systems We highlight some of AFrame’s functionalities that help simplify where large volumes of data are stored. AFrame can leverage Big Data analysis through each of the data analytics lifecycle database index and query optimizer to efficiently access and op- stages. The notebook that we use in this paper is also available 1 . erate on data at scale without moving it into system-specific The rest of this paper is organized as follows: Section 2 dis- environment or create an intermediate data representation. Here cusses background and related work. Section 3 details our case we briefly compare and contrast two of the most well-known study. We conclude and describe our current work in Section 4. scalable dataframe libraries, Spark [10] and Dask [12]. 1 https://nbviewer.jupyter.org/github/psinthong/SF_CRIME_Notebook/blob/master/ 2.3.1 Spark. Apache Spark [2] is a general-purpose cluster sf_crimes_paper.ipynb computing framework. Spark provides a DataFrame API, an in- © 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed- terface for data analysts to interact with data in a distributed file ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus) on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0 system. However, Spark’s DataFrame syntax is quite different International (CC BY 4.0) from Pandas’, as Spark is heavily influenced by SQL’s syntax while Pandas relies on features of the Python language. As a ‘SF_CRIMES’. The AsterixDB server is running locally on port result, the Koalas project [4] was introduced to bridge the gap be- 19002. Input line 3 displays a sample of two records from the tween Spark and Pandas DataFrames. Koalas is a Python library dataset, and input line 4 displays the underlying SQL++ query implemented on top of the Spark DataFrame API and its syntax is that AFrame generated and extended. More about AFrame’s in- designed to be largely identical to that of Pandas’. In order to sup- cremental query formation process can be found in [13]. port Pandas DataFrame features (e.g., row label, eager evaluation) in a distributed environment, Koalas implements an intermediate data representation that results at times in expensive operations and performance trade-offs. Since Koalas operates directly on top of Spark, users are also required to set up distributed file storage as well as tuning Spark to suit their data access patterns. 2.3.2 Dask. Dask is an open source Python framework that provides advanced parallelism for data scientists’ familiar ana- lytical libraries such as Pandas, NumPy, and Scikit-learn. Dask DataFrame is a scalable dataframe library that is composed of mul- tiple Pandas DataFrames. Dask partitions its data files row-wise and operates on them in parallel while utilizing all of the available computational cores. Since Dask uses Pandas internally, it also inherits the high memory consumption problem that Pandas has. Dask tries to compensate for this disadvantage by delaying its expression evaluation in order to reduce the amount of needed computation. However, Dask does not aim at implementing all of the Pandas operations because supporting certain Pandas opera- Figure 1: Acquire data tions in a distributed environment results in poor performance trade-offs, as mentioned in its documentation [3]. Next, our user drops some columns from the dataset and ex- plores the data values, as shown in Figure 2. Input line 5 drops 3 AFRAME CASE STUDY several columns using the Pandas’ ‘drop’ function and prints out two records from the resulting data. Our user then prints out the There are several methodologies that provide a guideline for the unique values from the columns ‘pdDistrict’ and ‘dayOfWeek’ stages in a data science lifecycle, such as the traditional method as shown in input lines 6 and 7 respectively. called CRISP_DM for data mining and an emerging methodol- ogy introduced by Microsoft called TDSP [7]. They have defined general phases in a data analysis lifecycle that can be summa- rized as follows: business understanding, data understanding and preparation, modeling, evaluation, and deployment. Our goal for AFrame is to deliver a scale-independent data analysis experience. Data analysts should be able to use AFrame interchangeably with Pandas DataFrames and utilize their fa- vorite ML libraries to perform each of the data science lifecycle stages on large volumes of data with minimum effort. In order to see if AFrame delivers up to our expectations, we conducted a case study by asking a user to perform a data analysis using AFrame in places where Pandas DataFrames would oth- erwise be used (with an exception of training machine learning models). We used a running example of an analysis of a San Fran- Figure 2: Data cleaning and exploration cisco police department historical incident report dataset [6] to predict the probability of incidents being resolved. Due to space 3.2 Modeling limitations, we will only be displaying a subset of the printed The next stage in a data science project lifecycle is to determine attributes in our Figures. and optimize features through feature engineering to facilitate We present the case study here through each of the previously machine learning model training. This stage also includes ma- mentioned data science lifecycle stages. (Interested readers can chine learning model development, which is the process to con- find information about AFrame’s performance in [13].) struct and select a model that can predict the target values most accurately considering their success metrics. 3.1 Data understanding and preparation In Figure 3, the user applies one-hot encoding by utilizing the The goal of this stage in the data science lifecycle is to obtain and Pandas’ ‘get_dummies’ function to create multiple features from clean the data to prepare it for analysis. Figure 1 is a snapshot the columns that he previously explored. Input line 8 applies from a Jupyter notebook that shows a process of creating an one-hot encoding to the ‘pdDistrict’ column and line 9 displays AFrame object and displaying two records from a dataset. Input the resulting data with ten new columns each indicating whether line 2 labeled ‘In[2]’ shows how to create an AFrame object by or not the record has that particular feature. Input lines 10 and 11 utilizing AFrame’s AsterixDB connector and providing a data- perform the same operation on the ‘category’ and ‘dayOfWeek’ verse name and a dataset name. For this example, the dataset is columns respectively. The user then appends all of their one-hot called ‘Crimes_sample’ and it is contained in a dataverse named encodings to the original data in input line 12. supported. As such, AFrame provides an operation called ‘toPan- das’ which converts data into Pandas DataFrame objects2 . On input line 20, ‘Y’ is the binary encoded ‘resolved’ column and the remaining columns will be used to train the models. Input line 22 splits the data into an 80% training set and a 20% testing set. Figure 3: One-hot encodings In order to extract important features from the data, users can also apply AsterixDB’s builtin functions directly on the entire Figure 5: Preparing data for model training data or part of the data. This is done through the ‘map’ and ‘apply’ Input lines 23 - 26 in Figure 6 are standard Scikit-learn model functions. The complete list of all available builtin functions training steps that take a Pandas DataFrame as their input. Input can be found at [1]. Figure 4 shows an example of using the line 23 trains a Logistic Regression model on the training data, map operation on a subset of the data attributes. Input line 13 while input line 24 calls the ‘predict’ function on the test data creates two new columns, ‘month’ and ‘hour’. For ‘month’, the and displays a subset of the results. Instead of returning binary user applies AsterixDB’s ‘parse_date’ to the ‘date’ column to results indicating whether or not a particular incident will get generate a date object and then applies the ‘get_month’ function resolved, users can utilize Scikit-learn’s ‘predict_proba’ method to extract only the month before appending it as a new column to get the raw probability values that the model outputs for each called ‘month’. Similarly, for the ‘hour’ column, ‘parse_time’ is of the prediction labels (0 and 1 in our case). Input line 25 shows applied to the ‘time’ column followed by the ‘get_hour’ function the probability values that the model outputs for each of the to extract the hour of day from the data before appending it as a labels in order (0 followed by 1) on a subset of the records. Our new column. Finally, to finish up the feature engineering process, user decided to use the ‘predict_proba’ function and output the the target column ‘resolution’ is converted into a binary value probability of an incident getting resolved as shown in line 26. column called ‘resolved’ using AsterixDB’s ‘to_number’ function. Figure 6: Model training and inferencing 3.3 Evaluation and deployment Figure 4: Applying functions to create new columns In order to deploy the model into a production environment and apply it to large datasets at full scale, users can export and pack- age their models using Python’s pickle and then deposit them Once the feature engineering process is done, the data is split into AsterixDB for use as external user-defined functions (UDFs). into training and testing sets for use in training and evaluating In our example, the user deposited their model and created a machine learning models. Figure 5 shows the process of split- function called ‘getResolution’ in AsterixDB that invokes the ting the data. Input line 19 converts the data referenced by an 2 The resulting data is required by Pandas to fit in memory. This currently limits AFrame object into a Pandas DataFrame. Currently, feeding data the training dataset size, but there is no limit to the amount of data to which the into existing Scikit-learn models from a database system is not model may be applied (see Section 3.3). trained model’s ‘predict_proba’ function on a data record. Due to space limitations, we omit the steps to deposit the model into AsterixDB and create a new UDF to call it, but the required steps can be found in [1]. After creating a UDF, users can then uti- lize their model using AFrame in the same way that they would use a builtin function. Figure 7 shows the user applying their Python UDF ‘getResolution’ on an AFrame object. Line 37 uses the ‘apply’ operation to apply the Python UDF on the previously transformed data. The results of the function are the probabilities of crime incidents getting resolved, as displayed in line 38. Figure 9: Model inferencing 3.4 Lessons Learned The case study is not only helpful in proving the usability of Figure 7: Calling the model using the function syntax AFrame but it also helps us identify useful and missing features. At this point in the analysis, our user is done with the training For example, the transformation saving mode was created to and evaluation process and wants to apply the model to other record the data transformation steps as a function so that they larger sets of data. However, to apply the model to a different can easily be applied to other datasets using the existing function dataset, that dataset has to have the same features arranged in the syntax. Unique Pandas’ functions (e.g., describe, get_dummies) same order as the training data. To simplify the model inferencing are implemented in AFrame by internally calling multiple simple process, AFrame allows users to save their data transformation as operations in a sequence. This was influenced by the engineered a function that can then be applied to other datasets effortlessly. features that the user manually created. Line 40 in Figure 8 displays the persist-transformation syntax. The user has named the resulting function ‘is_resolvable’. The 4 CONCLUSION underlying SQL++ query that transforms and appends their engi- In this short paper, we have detailed an end-to-end case study neered features to a data record before calling the trained model that utilizes AFrame in a data analysis. We have demonstrated on it is shown in input line 41. the flexibility and simplicity of using AFrame to perform data preparation, modeling, and deployment on different dataset sizes and moving seamlessly from a small dataset to datasets at scale. We believe that AFrame can deliver a scale-independent data preparation and analysis experience while requiring less effort from data analysts, allowing them to focus on more critical tasks. Currently, we are extending AFrame by retargeting its in- cremental query transformation process onto multiple database systems to make its benefits available to existing users of other databases. Our re-architected version of AFrame called PolyFrame, can operate against AsterixDB using SQL++, PostgreSQL using SQL, MongoDB using MongoDB’s query language, and Neo4j using Cypher. More information about PolyFrame including its ar- chitecture and language configuration rules can be found in [14]. REFERENCES [1] Apache AsterixDB. https://asterixdb.apache.org/. [2] Apache Spark. http://spark.apache.org/. [3] Dask Anti-Uses. https://docs.dask.org/en/latest/dataframe.html. [4] Koalas. http://koalas.readthedocs.io. Figure 8: Persisting the transformation [5] Pandas. http://pandas.pydata.org/. [6] Police department incident reports. http://data.sfgov.org/. Finally, applying the data transformation and the ML model [7] Team Data Science Process. https://docs.microsoft.com/en-us/azure/machine- learning/team-data-science-process/overview. on a large distributed dataset can be done through AFrame using [8] M. Abadi et al. Tensorflow: A system for large-scale machine learning. In the apply function. Figure 9 shows the user accessing a differ- 12th USENIX OSDI Symposium, pages 265–283, 2016. ent dataset, called ‘Crimes’, from AsterixDB and applying the [9] S. Alsubaiee et al. AsterixDB: A scalable, open source BDMS. PVLDB, 7(14):1905–1916, 2014. ‘is_resolvable’ function to it. In input line 43, our user filters only [10] M. Armbrust et al. Scaling Spark in the real world: Performance and usability. the crime incidents that happened in the Central district, possibly PVLDB, 8(12):1840–1843, 2015. [11] F. Pedregosa et al. Scikit-learn: machine learning in Python. Journal of Machine assisted under the hood by an index on ‘pdDistrict’, before ap- Learning Research, 12(10):2825–2830, 2011. plying the trained ML model. The model’s prediction results are [12] M. Rocklin. Dask: Parallel computation with blocked algorithms and task appended to the dataset as a new column called ‘is_resolvable’ scheduling. In Proc. 14th Python in Science Conf., pages 130–136, 2015. [13] P. Sinthong and M. J. Carey. AFrame: Extending DataFrames for large-scale in line 44. These results can be used to do further analysis or to modern data analysis. In 2019 IEEE Big Data Conf., pages 359–371, 2019. visualize them using Pandas’ compatible visualization libraries [14] P. Sinthong and M. J. Carey. PolyFrame: A Query-based approach to scaling by converting the AFrame object into a Pandas DataFrame. Dataframes (Extended Version). arXiv preprint arXiv:2010.05529, 2020.