<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Pharo DataFrame: Past, Present, and Future</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Larisa Safina</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleksandr Zaitsev</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cyril Ferlicot-Delbecque</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Papa Ibrahima Sow</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ecole Supérieur Polytechnique (ESP)</institution>
          ,
          <addr-line>UMMISCO, Dakar</addr-line>
          ,
          <country country="SN">Senegal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>UMR SENS, Cirad</institution>
          ,
          <addr-line>Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRIStAL</institution>
          ,
          <addr-line>Lille</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>DataFrame is a tabular data structure for data analysis. It is a two-dimensional table (similar to a spreadsheet) with an extensive API for querying and manipulating the data. Data frames are available in many programming languages (e.g., pandas in Python or data.frame in R), they are the go-to tools for data scientists and machine learning practitioners. Pharo DataFrame was first released in 2017. Since then, the library underwent many changes and improvements. In this paper, we present the Pharo DataFrame library, show examples of its usage, and compare its API to that of pandas. We overview the changes that have been made since DataFrame v1.0, discuss the limitations of the current implementation, and present the roadmap for future.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Pharo</kwd>
        <kwd>DataFrame</kwd>
        <kwd>data analysis</kwd>
        <kwd>data structure</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>We live in a data-intensive world, where data in itself constitutes a form of wealth, its amounts are
growing exponentially, as well as the level of integration of data-powered tools in our everyday
lives. Acquiring data, extracting knowledge from it, and acting based on that knowledge became
one of the key activities in modern industries. That is why, modern programming languages
and environments are expected to provide tools for data analysis, data visualization, machine
learning, data mining, and business intelligence.</p>
      <p>
        Among such tools are data frames — tabular data structures that provide extensive API for
data analysis and manipulation. Available in various programming languages (e.g., pandas
in Python, data.frame in R), data frames are the go-to tools for data scientists and machine
learning practitioners. The first implementation of DataFrame was introduced into Pharo in
2017 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] as part of the Google Summer of Code project.1 During the last six years, DataFrame
underwent many modifications and improvements: from adding new features and extending its
API to larger architectural changes. In this paper, we present the DataFrame library, overview
the major changes in its new version, and discuss the future developments that are envisioned
by the DataFrame community. We also contribute the early results of exploring the DataFrame
API. We study (1) the evolution of API by comparing it across two versions of the library and
(2) the completeness of API by comparing it to the non-exhaustive list of the most important
features of pandas.
      </p>
      <p>The rest of this paper is structured in the following way. In Section 2, we explain what
are data frames and why do we need them. In Section 3, we overview the major changes and
improvements that have been introduced into the library over the last years of development.
Section 4 contains the overview of data frames in other programming language and a
nonextensive comparison of DataFrame’s API to that of pandas. In Section 5, we discuss the
envisioned future developments, and finally, Section 6 concludes this paper. Additionally, in
Appendix A, we provide an example of how DataFrame can be used to analyse the gender wage
gaps dataset.</p>
    </sec>
    <sec id="sec-2">
      <title>2. What are DataFrames and Why We Need Them?</title>
      <p>
        Data can be represented in many ways. Tree data structures (e.g., JSON, STON, XML) are
often used to model complex objects that can be composed of instances or collections of other
objects. Such data structures are simple and powerful because they can be used to store objects
of diferent complexity, from primitive values to large complex object structures with nested
elements. However, they are not optimal for analysing data. The more complex such a tree
structure becomes, the more dificult it is to write queries, add or remove features, merge
multiple datasets, aggregate and group their values. That is why we often prefer to represent
data with a more restrictive tabular data structures [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] that can be characterized by the
following features:
      </p>
      <sec id="sec-2-1">
        <title>1. Tables have rows and columns;</title>
      </sec>
      <sec id="sec-2-2">
        <title>2. Each row has same columns and in the same order;</title>
        <p>3. Columns are homogeneous, meaning that they store values of the same type (e.g., only
strings or only integers) while rows can be heterogeneous (one row can contain values of
diferent data types);</p>
      </sec>
      <sec id="sec-2-3">
        <title>4. Rows are stored in a particular order.</title>
        <p>Each row in such table can be seen as an object (observation) and each column as a parameter
(feature). For example, to store a data about employees, one may create a table where each
employee is represented by a row and each column corresponds to certain property: name
(string), age (integer), salary (float), etc. The data types of each column are usually of
primitive data types, therefore such tables are not optimal for storing complex objects. However,
the same limitation makes them well suited for data analysis.</p>
        <p>Although tables of data can be implemented with a combination of standard collections (e.g., a
list of dictionaries), the data manipulation of such a composed collection would be cumbersome.
This raises a need for a dedicated data structure that represents a table and provides a simple API
for accessing and manipulating its rows and columns, querying and analysing the data stored
in it. Such data sets are called data frames. The example in Appendix A provides a hands-on
demonstration of data frames and their applications.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Pharo DataFrame: past and present</title>
      <p>
        The first version of DataFrame [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for Pharo was released in September 2017 as a result of
Google Summer of Code. Although that version was fully functional for the basic use cases of
data frame, the project was rather immature and had several major problems. Most importantly:
• Lack of functionality. The API of Pharo DataFrame was far smaller compared to that
of other open-source data frame libraries. For example, it did not provide methods for
handligh missing values, data loading from CSV could not be configured, etc. Although all
those functionalities could still be achieved through the standard API of Pharo collections,
there was a clear need for introducing dedicated methods into the DataFrame. As can be
seen in Table 1, the number of methods in DataFrame has doubled over the recent years.
Although many important functionalities are still missing (see Section 4), the modern
version of DataFrame is more complete and according to the community of its developers,
today DataFrame covers all the most common use cases.
• Low performance. Both in terms of speed and memory capacity, DataFrame was by far
inferior to similar libraries in other languages. Some operations that could be performed
in less than a second using pandas, would take more than 20 min in the first version
of DataFrame. The datasets with several million rows could freeze the Pharo image.
Although, performance is still an issue in modern version of DataFrame, most
computationally expensive operations have been optimized. There is currently an ongoing efort
to make Pharo DataFrame as eficient as its analogues in other languages (see Section 5).
• Incomplete coherence with Pharo collections. Although the DataFrame community was
always striving for the compatibility with standard API of Pharo collections, v1.0 contained
multiple methods that were inconsistent with other collections. The most striking example
were the SQL-like querying methods such as select: columnNames where: aBlock which
were incoherent with Smalltalk-style select: aBlock methods. Those and many other
cases of incoherent API were fixed in the modern version of DataFrame.
• Dependency on Roassal2. The first version of DataFrame provided methods for data
visualizations using the Roassal2 library [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Although, visualizations are very important
for data analysis, such a large dependency was unnecessary and hard to manage. In later
versions, the community has decided that DataFrame should remain a simple collection
for data analysis, and data visualizations must be delegated to a diferent library that
would be expected to support data frames. Today, data frames can be visualized using
the Charting packages of Roassal3.2 In close collaboration with DataFrame community,
the developers of Roassal are currenly working to improve those packages and build a
powerful data visualization library.
      </p>
      <sec id="sec-3-1">
        <title>2https://github.com/ObjectProfile/Roassal3</title>
        <p>
          • Lack of detailed documentation. The first version of DataFrame was only documented
through blog posts, a README file on GitHub, and a short paper by Zaitsev et al., [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. In
recent years, DataFrame community has produced several more forms of documentation,
including the DataFrame Booklet [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and examples of applying data frames for machine
learning and data mining on pharo-ai Wiki.3
        </p>
        <p>In Table 1, we compare two versions of DataFrame: the first stable varsion v1.0 and the most
recent pre-release version pre-v3. Both versions were loaded into Pharo 9 on May 29, 2023. As
can be seen in the table, the number of methods in both DataFrame and DataSeries classes has
doubled. Although the test coverage4 of DataFrame has always been high, in recent version, it
was increased to 95.43% and the total number of test methods was increased almost six times.</p>
        <p>In addition to the changes listed above, the DataFrame community has also introduced several
new tools that can be used todether with data frames. Those include the new Spec-based data
inspector tool,5 data imputers6 that provide diferent strategies for filling the empty (nil) values
in columns by filling them with zeroes, average values, etc. and remembering the statistical
properties to ensure reproducibility, and a tool for loading default datasets.7</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. DataFrame Outside of Pharo</title>
      <p>
        Being an essential structure in data analysis, data frames have been implemented in many
programming languages [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the most popular of which are pandas (Python) and data.frame (R).
Pandas DataFrame is widely used in data analysis and provides a flexible, high-performance
way to manipulate, analyse, and visualize data. In R, the data.frame is a fundamental data
structure used for storing and manipulating structured data as well. In both libraries data frames
ofer wide range of functionalities, including data alignment, indexing, merging, filtering, and
statistical operations. Their diferences are not critical and mostly related to the question of
syntax, and spesific strategies of indexing, handling missing vlues etc.
      </p>
      <p>
        In this section we compare Pharo DataFrame with pandas. We explore commonly used data
analysis features based on the litrature in the field [
        <xref ref-type="bibr" rid="ref2 ref7">2, 7</xref>
        ] and review operations for data import
      </p>
      <sec id="sec-4-1">
        <title>3https://github.com/pharo-ai/wiki</title>
        <p>4Test coverage was calculated using the DrTests tool in Pharo 9 as the percentage of methods from the core package
(DataFrame-Core package in v1.0 and DataFrame package in pre-v3) that are covered by tests from
DataFrameTests package
5https://github.com/pharo-ai/data-inspector
6https://github.com/pharo-ai/data-imputers
7https://github.com/pharo-ai/datasets
and export, data manipulation (aggregation, grouping, joining, merging, sorting and ranking),
visualization, performance optimization and operations for providing more specific ways of
analysis (time series, statistical analysis, handling categorical data, etc). We do not go deep into
commands details, and do not cover command specific parameters. In Table 2, we show which
features from a non-exhaustive list of features from the corresponding categories are present in
DataFrame.</p>
        <p>As can be seen, the Pharo implementation supports all the listed methods with limited support
for Regression analysis (implemented in the Pharo-ai), and some input and output formats
(currently XML and SQL are missing). Functionality which is fully missing in DataFrame for
the moment, concerns time series analysis and handling categorical data.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Pharo DataFrame: future</title>
      <p>
        During the several years of using and developing DataFrame, we have collected the list of
improvements, that we plan to implement in the future, which has been extended now by the
information received from comparing DataFrame with pandas. The aspects we would like to
focus on in the short term regard the library documentation, functional improvements, and
issues with a library’s general performance. Later we would like to work on adding suport for
managing big data in DataFrame.
5.1. Functionality Enhancements
Information retrieved from the comparison with pandas shows that as a first step it will be
necessary to add support for time series analysis which is a statistical method for analyzing
and forecasting data points collected over time and which is useful for understanding patterns,
trends, and dependencies in sequential data (e.g. stocks, weather patterns). The next step
will be to add supoprt for handling categorical data for managing and analyzing variables,
that represent categories or groups. Although those operations could be performed using the
standard Pharo API (e.g.,DateAndTime class), this would require writing multiple lines of code
for a simple operation. DataFrame could therefore benefit from a dedicated API for time series
analysis and handling categorical variables. We also plan to improve support for handling
missing values by introducing NaN (“not a number” values), and adding support for them to
the numerical algorithms of DataFrame.
5.2. Performance
DataFrame performance remains to be a weak spot of the library from both volume and velocity
points of view. DataFrame users and developers have observed certain delays in managing data
comparing to the other data frames libraries. We plan to benchmark most used operations in
DataFrame and compare them to the corresponding implementation in pandas to find the most
costly ones and reimplement them to support better velocity. Some operations of DataFrame
are also limited by the amount of data that they can handle. Additional study would be required
to test DataFrame on the datasets of diferent sizes and identify the maximum number of rows
and columns that can be processed by this library. It must be noted that the performance
issues are not due to the programming environment (Pharo) but are caused by the poorly
optimized implementation of certain methods in DataFrame. As indicated by Zaitsev et al., [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
numerical algorithms in Pharo can be as fast as those in Python (numpy, pandas, scikit-learn) if
implemented using the same low-level optimization techniques. At the moment, Pharo data
frames are not capable of handling infinite streams of data (operations should be limited to
a given window, e.g. 1000 rows). We would like to improve data frames in this regard that
would allow us to work with machine learning projects involving streams of real-time data (e.g.
stock predictions, financial transactions etc) We can take the inspiration from the windowing
aggregation operations from the streamz.dataframe module.
5.3. Big Data Support
Big Data [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ] is the concept that has started to gain popularity since the late 90s, which
refers to a large, complex and often heterogeneous sets of data that are hard to be efectively
manipulated by the traditional data processing techniques due to their inhereted velocity, volume
and variety. Big Data is widely used in research and industry fields (including the critical sectors
of banking, security, healthcare etc.) and serves as an input for data science and machine
learning algorithms. Advantages gained by adoption of Big Data (better information extraction
and discovering of patterns and correlations that can lead to improved decision-making and a
deeper understanding of underlying (business) processes) create a serious demand for building
more tools capable of efectively dealing with it.
      </p>
      <p>
        Pharo approaches this demand with the Spa framework [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] used in Pharo-based parallel
and distributed applications. Spa supports a "spark-alike" MapReduce programming model [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
with diferent debugging features enabled, allowing to deploy and coordinate various instances
and threads of the same Pharo image using diferent Pharo VMs. At the moment, DataFrame in
Pharo is not adapted to be used with big data. DataFrame can not process more data than a Pharo
image can allocate and can not be scaled. However, being implemented as a façade providing
user with a frontend (API containing all public methods) and hiding a backend (internal data
frame representation: a collection for storing data and a set of 29 main methods for manipulating
it that all other methods of API are based on), DataFrame provides the possibility to easily
substitute its backend implementation. This can be used, for example, to provide optimised
data storage, database connection, etc. We would like to change the internal implementation
of DataFrame to make it able to treat big data by making it agnostic to the underlying data
structure and adding support for various stand-alone database connectors or for distributed
framework as Hadoop,8 Spark,9 and Spa: a Pharo-native framework. Due to the DataFrame
architecture, these changes will be seamless and invisible for library users.
5.4. Better synchronisation with PolyMath and pharo-ai
PolyMath10 is a Pharo library for scientific computing, similar to existing libraries like NumPy, 11
SciPy12 for Python or SciRuby13 for Ruby. It provides basic support for complex and quaternions
extensions, random number generators, fuzzy algorithms, automatic diferentiation,
KDEtrees, numerical methods, Ordinary Diferential Equation (ODE) solvers, etc. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Although
DataFrame was never part of the PolyMath library itself, it was originally implemented by the
community of PolyMath developers and under the umbrella of the PolyMath Organization.14
      </p>
      <p>Pharo-ai15 is another Pharo library that implements algorithms of artificial intelligence. Most
of those are algorithms of shallow machine learning (the one that does not rely on deep
neural networks), but it also provides tools for data mining, natural language processing, graph
algorithms, etc. All algorithms of pharo-ai are meant to be compatible with DataFrame, which
comes in handy because training machine learning models is often preceded by data
manipulations (cleaning, preprocessing) — the task for which DataFrame is well suited. That being
said, users of pharo-ai are not required to use DataFrame as every algorithm can also accept
any OrderedCollection of correct shape (i.e. a collection of collections representing a tabular
dataset of shape  × ). This is possible thanks to the fact that DataFrame is coherent with
the standard API of Pharo collections. The algorithms of pharo-ai do not reference DataFrame
directly but only use the methods of Pharo collections. DataFrame has the responsibility to
implement those methods.</p>
      <p>Despite having diferent purposes, the three libraries, PolyMath, pharo-ai, and DataFrame,
8https://hadoop.apache.org
9https://spark.apache.org
10https://github.com/PolyMathOrg/PolyMath
11https://numpy.org/
12https://scipy.org/
13http://sciruby.com/
14https://github.com/PolyMathOrg
15https://github.com/orgs/pharo-ai
have a lot in common. There is an intersection between the communities of their core developers.
Those libraries are meant to be compatible with each other (although sometimes this is not the
case) and are often used together in the same application. However, for the historical reasons, the
open-source development process of those three libraries is not always synchronized. There are
cases when machine learning algorithms are placed into PolyMath, mathematical algorithms are
implemented in DataFrame, and data processing algorithms are stored in pharo-ai. Although
in some cases it makes sense (e.g., data preprocessing and data partitioning algorithms are in
pharo-ai because they are designed specifically for machine learning purposes), overall the
three communities would benefit from better communication, clearly defined scopes, and more
synchronized development process.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we have discussed the DataFrame library in Pharo programming language. We
have presented its current state, and its evolution and provide a high-level comparison with
the pandas framework. This gave us valuable insight into the future directions for the library’s
improvement. As a future work, we foresee the rigorous implementation of the features missing
in Pharo DataFrame that are presented in other libraries and are in high demand by library
users. We would also like to tackle the performance issues the library possesses at the moment
and evaluate the results by benchmarking DataFrame and comparing results with other data
frame libraries. We are especially interested to apply future improvemnts for the scenario of
handling big data, and we plan rethink the back-end implementation of DataFrame so it could
profit from using the external data storages: databases and distributed clusters, and to support
the Spa: Pharo-native framework for distributed handling of big data.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Example: Using DataFrame to Analyse Gender Wage Gaps</title>
      <p>In this example, we use DataFrame to analyse a dataset of wage gaps between male and female
employees in diferent countries and find the countries with largest and smallest wage gaps in a
given year. We use a public Gender Wage Gap dataset provided by OECD (Organisation for
Economic Co-operation and Development).16 The dataset is provided as a CSV file which can
be loaded as into Pharo as DataFrame:
wageGapFile := 'data/wagegap.csv' asFileReference.
data := DataFrame readFromCsv: wageGapFile.</p>
      <p>In Table 3, we show five randomly selected columns of that DataFrame. The original dataset
combines records from employees and self-employed people. To keep only the data about
employees, we use select: method which is supported by all Pharo collections. In case of
DataFrame, at each iteration, the select block will be evaluated with one row.
data := data select: [ :row | (row at: 'SUBJECT') = 'EMPLOYEE' ].</p>
      <p>To answer the questions listed above, we only need three columns of the dataset: country,
year, and wage gap. The columns: method creates a subset of DataFrame with only the given
columns. In the next line, we rename those columns using the columnNames: setter.
16https://data.oecd.org/earnwage/gender-wage-gap.htm
data := data columns: #('LOCATION' 'TIME' 'Value').
data columnNames: #('Country' 'Year' 'Gap').</p>
      <p>At the final step of data preprocessing, we replace the three-letter country codes with full
country names. To do that, we load another dataset which contains the list of ISO-3166 country
and dependent territories with UN regional codes.17
countryCodesFile := 'data/countryCodes.csv' asFileReference.
countryCodes := DataFrame readFromCsv: countryCodesFile.</p>
      <p>To get the mapping in the form of a dictionary, we first set the alpha-3 column (country
codes) as row names and then access column name of the new DataFrame. The column will be
an object of class DataSeries which is a kind of OrderedDictionary containing country codes
as its keys and country names as its values.
countryCodes rowNames: (countryCodes column: 'alpha-3').
countryCodeMapping := countryCodes column: 'name'.</p>
      <p>The original Wage Gaps dataset contains two values which can not be recognized as valid
country codes: OECD and EU27. We remove them and then apply the mapping to the Country
column.
data := data reject: [ :row |</p>
      <p>#('OECD' 'EU27') includes: (row at: 'Country') ].
data toColumn: 'Country' applyElementwise: [ :each |
countryCodeMapping at: each ].</p>
      <p>In Table 4, we show the same rows were randomly selected for Table 3, after applying the
preprocessing described above. The data is now clean and ready to be analysed. We first select
the year for which we will compare the gender wage gaps across diferent countries. The most
recent year in the dataset (2022) only contains one entry. We therefore search for the year
after 2013 years which has the most entries (we are not interested in data from more than 10
years ago). To do that, we first filter data by year. In the next line, we group the data by years
and aggregate it by counting the number of country names per year. Then, using the argmax
method, we get the year for which the count value is the largest. That year is 2018.
selectedYear := ((data select: [ :row | (row at: 'Year') &gt; 2013 ])
group: 'Country' by: 'Year' aggregateUsing: [ :group | group size ])
argmax.
17https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes</p>
      <p>We take a subset of data for only the selected year and sort it in the descending order of wage
gaps. Using methods head and tail, we get the top-5 and the bottom-5 countries in the dataset.
The result can be seen in Tables 5 and 6.
oneYearData := (data select: [ :row | (row at: 'Year') = selectedYear ])
sortDescendingBy: 'Gap'.
oneYearData head. "top-5 rows"
oneYearData tail. "bottom-5 rows"</p>
      <p>Gap
5.06
4.86
3.49
3.40
3.03</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>O.</given-names>
            <surname>Zaytsev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Papoulias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Stinckwich</surname>
          </string-name>
          ,
          <article-title>Towards exploratory data analysis for pharo</article-title>
          ,
          <source>in: Proceedings of the 12th edition of the International Workshop on Smalltalk Technologies</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>McKinney</surname>
          </string-name>
          ,
          <article-title>Python for data analysis: Data wrangling with Pandas, NumPy, and</article-title>
          <string-name>
            <surname>Jupyter</surname>
          </string-name>
          ,
          <string-name>
            <surname>" O'Reilly Media</surname>
          </string-name>
          ,
          <source>Inc."</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Petersohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Macke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xin</surname>
          </string-name>
          , W. Ma,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Mo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Joseph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parameswaran</surname>
          </string-name>
          ,
          <article-title>Towards scalable dataframe systems</article-title>
          , arXiv preprint arXiv:
          <year>2001</year>
          .
          <volume>00888</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bergel</surname>
          </string-name>
          , Agile Visualization, LULU Press,
          <year>2016</year>
          . URL: http://agilevisualization.com/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>C. F.-D. Oleksandr</surname>
            <given-names>Zaitsev</given-names>
          </string-name>
          ,
          <article-title>Data analysis made simple with pharo dataframe</article-title>
          ,
          <year>2023</year>
          . URL: https://github.com/SquareBracketAssociates/Booklet-DataFrame.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Awesome</given-names>
            <surname>Dataframes</surname>
          </string-name>
          , https://github.com/jcmkk3/awesome-dataframes,
          <year>2023</year>
          . Accessed:
          <fpage>2023</fpage>
          -05-29.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Géron</surname>
          </string-name>
          ,
          <article-title>Hands-on machine learning with Scikit-Learn and TensorFlow : concepts, tools, and techniques to build intelligent systems,</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Sebastopol, CA,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.</given-names>
            <surname>Zaitsev</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. M.</surname>
          </string-name>
          <article-title>no Sebastian, S. Ducasse, How fast is ai in pharo? benchmarking linear regression</article-title>
          , in: IWST 2022-International Workshop on Smalltalk Technologies,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zakir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Seymour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Berg</surname>
          </string-name>
          ,
          <article-title>Big data analytics</article-title>
          .,
          <source>Issues in Information Systems</source>
          <volume>16</volume>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mayer-Schonberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cukier</surname>
          </string-name>
          ,
          <string-name>
            <surname>Big Data</surname>
            :
            <given-names>A Revolution</given-names>
          </string-name>
          <string-name>
            <surname>That Will Transform How We Live</surname>
          </string-name>
          , Work, and
          <string-name>
            <surname>Think</surname>
          </string-name>
          , Houghton Miflin Harcourt, Boston,
          <year>2013</year>
          . URL: http://www.amazon. com/books/dp/0544002695.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Marra</surname>
          </string-name>
          ,
          <article-title>A live debugging approach for big data processing applications</article-title>
          ,
          <source>Ph.D. thesis, Vrije Universiteit Brussel</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghemawat</surname>
          </string-name>
          ,
          <source>Mapreduce: Simplified data processing on large clusters, Commun. ACM</source>
          <volume>51</volume>
          (
          <year>2008</year>
          )
          <fpage>107</fpage>
          -
          <lpage>113</lpage>
          . URL: https://doi.org/10.1145/1327452.1327492. doi:
          <volume>10</volume>
          .1145/ 1327452.1327492.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Besset</surname>
          </string-name>
          ,
          <article-title>Object-Oriented Implementation of Numerical Methods An Introduction with Pharo</article-title>
          , Square Bracket Associates,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>