=Paper=
{{Paper
|id=Vol-3627/paper12
|storemode=property
|title=Pharo DataFrame: Past, Present, and Future
|pdfUrl=https://ceur-ws.org/Vol-3627/paper12.pdf
|volume=Vol-3627
|authors=Larisa Safina,Oleksandr Zaitsev,Cyril Ferlicot-Delbecque,Papa Ibrahima Sow
|dblpUrl=https://dblp.org/rec/conf/iwst/SafinaZFS23
}}
==Pharo DataFrame: Past, Present, and Future==
Pharo DataFrame: Past, Present, and Future
Larisa Safina1,*,† , Oleksandr Zaitsev2,*,† , Cyril Ferlicot-Delbecque1 and
Papa Ibrahima Sow3
1
Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRIStAL, Lille, France
2
UMR SENS, Cirad, Montpellier, France
3
Ecole Supérieur Polytechnique (ESP), UMMISCO, Dakar, Senegal
Abstract
DataFrame is a tabular data structure for data analysis. It is a two-dimensional table (similar to a spread-
sheet) with an extensive API for querying and manipulating the data. Data frames are available in many
programming languages (e.g., pandas in Python or data.frame in R), they are the go-to tools for data sci-
entists and machine learning practitioners. Pharo DataFrame was first released in 2017. Since then, the
library underwent many changes and improvements. In this paper, we present the Pharo DataFrame
library, show examples of its usage, and compare its API to that of pandas. We overview the changes
that have been made since DataFrame v1.0, discuss the limitations of the current implementation, and
present the roadmap for future.
Keywords
Pharo, DataFrame, data analysis, data structure
1. Introduction
We live in a data-intensive world, where data in itself constitutes a form of wealth, its amounts are
growing exponentially, as well as the level of integration of data-powered tools in our everyday
lives. Acquiring data, extracting knowledge from it, and acting based on that knowledge became
one of the key activities in modern industries. That is why, modern programming languages
and environments are expected to provide tools for data analysis, data visualization, machine
learning, data mining, and business intelligence.
Among such tools are data frames — tabular data structures that provide extensive API for
data analysis and manipulation. Available in various programming languages (e.g., pandas
in Python, data.frame in R), data frames are the go-to tools for data scientists and machine
learning practitioners. The first implementation of DataFrame was introduced into Pharo in
2017 [1] as part of the Google Summer of Code project.1 During the last six years, DataFrame
underwent many modifications and improvements: from adding new features and extending its
API to larger architectural changes. In this paper, we present the DataFrame library, overview
IWST 2023: International Workshop on Smalltalk Technologies. Lyon, France; August 29th-31st, 2023
*
Corresponding author.
†
Both authors contributed equally.
larisa.safina@inria.fr (L. Safina); oleksandr.zaitsev@cirad.fr (O. Zaitsev); cyril@ferlicot.fr
(C. Ferlicot-Delbecque); papaibrahimasow@esp.sn (P. I. Sow)
0000-0002-4490-7451 (L. Safina); 0000-0003-0267-2874 (O. Zaitsev)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
CEUR Workshop Proceedings (CEUR-WS.org)
Proceedings
http://ceur-ws.org
ISSN 1613-0073
1
https://summerofcode.withgoogle.com/archive/2017/organizations/5691803940421632
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
the major changes in its new version, and discuss the future developments that are envisioned
by the DataFrame community. We also contribute the early results of exploring the DataFrame
API. We study (1) the evolution of API by comparing it across two versions of the library and
(2) the completeness of API by comparing it to the non-exhaustive list of the most important
features of pandas.
The rest of this paper is structured in the following way. In Section 2, we explain what
are data frames and why do we need them. In Section 3, we overview the major changes and
improvements that have been introduced into the library over the last years of development.
Section 4 contains the overview of data frames in other programming language and a non-
extensive comparison of DataFrame’s API to that of pandas. In Section 5, we discuss the
envisioned future developments, and finally, Section 6 concludes this paper. Additionally, in
Appendix A, we provide an example of how DataFrame can be used to analyse the gender wage
gaps dataset.
2. What are DataFrames and Why We Need Them?
Data can be represented in many ways. Tree data structures (e.g., JSON, STON, XML) are
often used to model complex objects that can be composed of instances or collections of other
objects. Such data structures are simple and powerful because they can be used to store objects
of different complexity, from primitive values to large complex object structures with nested
elements. However, they are not optimal for analysing data. The more complex such a tree
structure becomes, the more difficult it is to write queries, add or remove features, merge
multiple datasets, aggregate and group their values. That is why we often prefer to represent
data with a more restrictive tabular data structures [2, 3] that can be characterized by the
following features:
1. Tables have rows and columns;
2. Each row has same columns and in the same order;
3. Columns are homogeneous, meaning that they store values of the same type (e.g., only
strings or only integers) while rows can be heterogeneous (one row can contain values of
different data types);
4. Rows are stored in a particular order.
Each row in such table can be seen as an object (observation) and each column as a parameter
(feature). For example, to store a data about employees, one may create a table where each
employee is represented by a row and each column corresponds to certain property: name
(string), age (integer), salary (float), etc. The data types of each column are usually of
primitive data types, therefore such tables are not optimal for storing complex objects. However,
the same limitation makes them well suited for data analysis.
Although tables of data can be implemented with a combination of standard collections (e.g., a
list of dictionaries), the data manipulation of such a composed collection would be cumbersome.
This raises a need for a dedicated data structure that represents a table and provides a simple API
for accessing and manipulating its rows and columns, querying and analysing the data stored
in it. Such data sets are called data frames. The example in Appendix A provides a hands-on
demonstration of data frames and their applications.
3. Pharo DataFrame: past and present
The first version of DataFrame [1] for Pharo was released in September 2017 as a result of
Google Summer of Code. Although that version was fully functional for the basic use cases of
data frame, the project was rather immature and had several major problems. Most importantly:
• Lack of functionality. The API of Pharo DataFrame was far smaller compared to that
of other open-source data frame libraries. For example, it did not provide methods for
handligh missing values, data loading from CSV could not be configured, etc. Although all
those functionalities could still be achieved through the standard API of Pharo collections,
there was a clear need for introducing dedicated methods into the DataFrame. As can be
seen in Table 1, the number of methods in DataFrame has doubled over the recent years.
Although many important functionalities are still missing (see Section 4), the modern
version of DataFrame is more complete and according to the community of its developers,
today DataFrame covers all the most common use cases.
• Low performance. Both in terms of speed and memory capacity, DataFrame was by far
inferior to similar libraries in other languages. Some operations that could be performed
in less than a second using pandas, would take more than 20 min in the first version
of DataFrame. The datasets with several million rows could freeze the Pharo image.
Although, performance is still an issue in modern version of DataFrame, most computa-
tionally expensive operations have been optimized. There is currently an ongoing effort
to make Pharo DataFrame as efficient as its analogues in other languages (see Section 5).
• Incomplete coherence with Pharo collections. Although the DataFrame community was
always striving for the compatibility with standard API of Pharo collections, v1.0 contained
multiple methods that were inconsistent with other collections. The most striking example
were the SQL-like querying methods such as select: columnNames where: aBlock which
were incoherent with Smalltalk-style select: aBlock methods. Those and many other
cases of incoherent API were fixed in the modern version of DataFrame.
• Dependency on Roassal2. The first version of DataFrame provided methods for data
visualizations using the Roassal2 library [4]. Although, visualizations are very important
for data analysis, such a large dependency was unnecessary and hard to manage. In later
versions, the community has decided that DataFrame should remain a simple collection
for data analysis, and data visualizations must be delegated to a different library that
would be expected to support data frames. Today, data frames can be visualized using
the Charting packages of Roassal3.2 In close collaboration with DataFrame community,
the developers of Roassal are currenly working to improve those packages and build a
powerful data visualization library.
2
https://github.com/ObjectProfile/Roassal3
• Lack of detailed documentation. The first version of DataFrame was only documented
through blog posts, a README file on GitHub, and a short paper by Zaitsev et al., [1]. In
recent years, DataFrame community has produced several more forms of documentation,
including the DataFrame Booklet [5] and examples of applying data frames for machine
learning and data mining on pharo-ai Wiki.3
In Table 1, we compare two versions of DataFrame: the first stable varsion v1.0 and the most
recent pre-release version pre-v3. Both versions were loaded into Pharo 9 on May 29, 2023. As
can be seen in the table, the number of methods in both DataFrame and DataSeries classes has
doubled. Although the test coverage4 of DataFrame has always been high, in recent version, it
was increased to 95.43% and the total number of test methods was increased almost six times.
Table 1
Comparing two versions of Pharo DataFrame: v1.0 and pre-v3
v1.0 (2017) pre-v3 (2023)
Methods in DataFrame class 73 186
Methods in DataSeries class 63 108
Test methods 103 595
Test coverage 72.02% 95.43%
In addition to the changes listed above, the DataFrame community has also introduced several
new tools that can be used todether with data frames. Those include the new Spec-based data
inspector tool,5 data imputers6 that provide different strategies for filling the empty (nil) values
in columns by filling them with zeroes, average values, etc. and remembering the statistical
properties to ensure reproducibility, and a tool for loading default datasets.7
4. DataFrame Outside of Pharo
Being an essential structure in data analysis, data frames have been implemented in many
programming languages [6], the most popular of which are pandas (Python) and data.frame (R).
Pandas DataFrame is widely used in data analysis and provides a flexible, high-performance
way to manipulate, analyse, and visualize data. In R, the data.frame is a fundamental data
structure used for storing and manipulating structured data as well. In both libraries data frames
offer wide range of functionalities, including data alignment, indexing, merging, filtering, and
statistical operations. Their differences are not critical and mostly related to the question of
syntax, and spesific strategies of indexing, handling missing vlues etc.
In this section we compare Pharo DataFrame with pandas. We explore commonly used data
analysis features based on the litrature in the field [2, 7] and review operations for data import
3
https://github.com/pharo-ai/wiki
4
Test coverage was calculated using the DrTests tool in Pharo 9 as the percentage of methods from the core package
(DataFrame-Core package in v1.0 and DataFrame package in pre-v3) that are covered by tests from DataFrame-
Tests package
5
https://github.com/pharo-ai/data-inspector
6
https://github.com/pharo-ai/data-imputers
7
https://github.com/pharo-ai/datasets
and export, data manipulation (aggregation, grouping, joining, merging, sorting and ranking),
visualization, performance optimization and operations for providing more specific ways of
analysis (time series, statistical analysis, handling categorical data, etc). We do not go deep into
commands details, and do not cover command specific parameters. In Table 2, we show which
features from a non-exhaustive list of features from the corresponding categories are present in
DataFrame.
Table 2
Selected features of Python’s pandas and their presence in Pharo DataFrame.
Data Import / Export: Time Series Analysis:
CSV yes Handle date/time no
Excel yes Resample no
SQL no Frequency conversion no
XML no Time shifting no
Rolling window no
Data Manipulation:
Select data yes Statistical Analysis:
Filter data yes Descriptive statistics yes
Add/remove column/row yes Correlation yes
Transpose yes Covariance yes
Handle missing values yes Regression no
Grouping and Aggregation yes
Join (inner, outer, left, right) yes Handling Categorical Data
Merge yes Encode categorical variables no
Sort yes Transform categorical variables no
Rank no Create dummy variables no
Categorical data analysis no
As can be seen, the Pharo implementation supports all the listed methods with limited support
for Regression analysis (implemented in the Pharo-ai), and some input and output formats
(currently XML and SQL are missing). Functionality which is fully missing in DataFrame for
the moment, concerns time series analysis and handling categorical data.
5. Pharo DataFrame: future
During the several years of using and developing DataFrame, we have collected the list of
improvements, that we plan to implement in the future, which has been extended now by the
information received from comparing DataFrame with pandas. The aspects we would like to
focus on in the short term regard the library documentation, functional improvements, and
issues with a library’s general performance. Later we would like to work on adding suport for
managing big data in DataFrame.
5.1. Functionality Enhancements
Information retrieved from the comparison with pandas shows that as a first step it will be
necessary to add support for time series analysis which is a statistical method for analyzing
and forecasting data points collected over time and which is useful for understanding patterns,
trends, and dependencies in sequential data (e.g. stocks, weather patterns). The next step
will be to add supoprt for handling categorical data for managing and analyzing variables,
that represent categories or groups. Although those operations could be performed using the
standard Pharo API (e.g.,DateAndTime class), this would require writing multiple lines of code
for a simple operation. DataFrame could therefore benefit from a dedicated API for time series
analysis and handling categorical variables. We also plan to improve support for handling
missing values by introducing NaN (“not a number” values), and adding support for them to
the numerical algorithms of DataFrame.
5.2. Performance
DataFrame performance remains to be a weak spot of the library from both volume and velocity
points of view. DataFrame users and developers have observed certain delays in managing data
comparing to the other data frames libraries. We plan to benchmark most used operations in
DataFrame and compare them to the corresponding implementation in pandas to find the most
costly ones and reimplement them to support better velocity. Some operations of DataFrame
are also limited by the amount of data that they can handle. Additional study would be required
to test DataFrame on the datasets of different sizes and identify the maximum number of rows
and columns that can be processed by this library. It must be noted that the performance
issues are not due to the programming environment (Pharo) but are caused by the poorly
optimized implementation of certain methods in DataFrame. As indicated by Zaitsev et al., [8],
numerical algorithms in Pharo can be as fast as those in Python (numpy, pandas, scikit-learn) if
implemented using the same low-level optimization techniques. At the moment, Pharo data
frames are not capable of handling infinite streams of data (operations should be limited to
a given window, e.g. 1000 rows). We would like to improve data frames in this regard that
would allow us to work with machine learning projects involving streams of real-time data (e.g.
stock predictions, financial transactions etc) We can take the inspiration from the windowing
aggregation operations from the streamz.dataframe module.
5.3. Big Data Support
Big Data [9, 10] is the concept that has started to gain popularity since the late 90s, which
refers to a large, complex and often heterogeneous sets of data that are hard to be effectively
manipulated by the traditional data processing techniques due to their inhereted velocity, volume
and variety. Big Data is widely used in research and industry fields (including the critical sectors
of banking, security, healthcare etc.) and serves as an input for data science and machine
learning algorithms. Advantages gained by adoption of Big Data (better information extraction
and discovering of patterns and correlations that can lead to improved decision-making and a
deeper understanding of underlying (business) processes) create a serious demand for building
more tools capable of effectively dealing with it.
Pharo approaches this demand with the Spa framework [11] used in Pharo-based parallel
and distributed applications. Spa supports a "spark-alike" MapReduce programming model [12]
with different debugging features enabled, allowing to deploy and coordinate various instances
and threads of the same Pharo image using different Pharo VMs. At the moment, DataFrame in
Pharo is not adapted to be used with big data. DataFrame can not process more data than a Pharo
image can allocate and can not be scaled. However, being implemented as a façade providing
user with a frontend (API containing all public methods) and hiding a backend (internal data
frame representation: a collection for storing data and a set of 29 main methods for manipulating
it that all other methods of API are based on), DataFrame provides the possibility to easily
substitute its backend implementation. This can be used, for example, to provide optimised
data storage, database connection, etc. We would like to change the internal implementation
of DataFrame to make it able to treat big data by making it agnostic to the underlying data
structure and adding support for various stand-alone database connectors or for distributed
framework as Hadoop,8 Spark,9 and Spa: a Pharo-native framework. Due to the DataFrame
architecture, these changes will be seamless and invisible for library users.
5.4. Better synchronisation with PolyMath and pharo-ai
PolyMath10 is a Pharo library for scientific computing, similar to existing libraries like NumPy,11
SciPy12 for Python or SciRuby13 for Ruby. It provides basic support for complex and quaternions
extensions, random number generators, fuzzy algorithms, automatic differentiation, KDE-
trees, numerical methods, Ordinary Differential Equation (ODE) solvers, etc. [13]. Although
DataFrame was never part of the PolyMath library itself, it was originally implemented by the
community of PolyMath developers and under the umbrella of the PolyMath Organization.14
Pharo-ai15 is another Pharo library that implements algorithms of artificial intelligence. Most
of those are algorithms of shallow machine learning (the one that does not rely on deep neu-
ral networks), but it also provides tools for data mining, natural language processing, graph
algorithms, etc. All algorithms of pharo-ai are meant to be compatible with DataFrame, which
comes in handy because training machine learning models is often preceded by data manip-
ulations (cleaning, preprocessing) — the task for which DataFrame is well suited. That being
said, users of pharo-ai are not required to use DataFrame as every algorithm can also accept
any OrderedCollection of correct shape (i.e. a collection of collections representing a tabular
dataset of shape 𝑚 × 𝑛). This is possible thanks to the fact that DataFrame is coherent with
the standard API of Pharo collections. The algorithms of pharo-ai do not reference DataFrame
directly but only use the methods of Pharo collections. DataFrame has the responsibility to
implement those methods.
Despite having different purposes, the three libraries, PolyMath, pharo-ai, and DataFrame,
8
https://hadoop.apache.org
9
https://spark.apache.org
10
https://github.com/PolyMathOrg/PolyMath
11
https://numpy.org/
12
https://scipy.org/
13
http://sciruby.com/
14
https://github.com/PolyMathOrg
15
https://github.com/orgs/pharo-ai
have a lot in common. There is an intersection between the communities of their core developers.
Those libraries are meant to be compatible with each other (although sometimes this is not the
case) and are often used together in the same application. However, for the historical reasons, the
open-source development process of those three libraries is not always synchronized. There are
cases when machine learning algorithms are placed into PolyMath, mathematical algorithms are
implemented in DataFrame, and data processing algorithms are stored in pharo-ai. Although
in some cases it makes sense (e.g., data preprocessing and data partitioning algorithms are in
pharo-ai because they are designed specifically for machine learning purposes), overall the
three communities would benefit from better communication, clearly defined scopes, and more
synchronized development process.
6. Conclusion
In this paper, we have discussed the DataFrame library in Pharo programming language. We
have presented its current state, and its evolution and provide a high-level comparison with
the pandas framework. This gave us valuable insight into the future directions for the library’s
improvement. As a future work, we foresee the rigorous implementation of the features missing
in Pharo DataFrame that are presented in other libraries and are in high demand by library
users. We would also like to tackle the performance issues the library possesses at the moment
and evaluate the results by benchmarking DataFrame and comparing results with other data
frame libraries. We are especially interested to apply future improvemnts for the scenario of
handling big data, and we plan rethink the back-end implementation of DataFrame so it could
profit from using the external data storages: databases and distributed clusters, and to support
the Spa: Pharo-native framework for distributed handling of big data.
References
[1] O. Zaytsev, N. Papoulias, S. Stinckwich, Towards exploratory data analysis for pharo, in:
Proceedings of the 12th edition of the International Workshop on Smalltalk Technologies,
2017, pp. 1–6.
[2] W. McKinney, Python for data analysis: Data wrangling with Pandas, NumPy, and Jupyter,
" O’Reilly Media, Inc.", 2022.
[3] D. Petersohn, S. Macke, D. Xin, W. Ma, D. Lee, X. Mo, J. E. Gonzalez, J. M. Hellerstein,
A. D. Joseph, A. Parameswaran, Towards scalable dataframe systems, arXiv preprint
arXiv:2001.00888 (2020).
[4] A. Bergel, Agile Visualization, LULU Press, 2016. URL: http://agilevisualization.com/.
[5] C. F.-D. Oleksandr Zaitsev, Data analysis made simple with pharo dataframe, 2023. URL:
https://github.com/SquareBracketAssociates/Booklet-DataFrame.
[6] Awesome Dataframes, https://github.com/jcmkk3/awesome-dataframes, 2023. Accessed:
2023-05-29.
[7] A. Géron, Hands-on machine learning with Scikit-Learn and TensorFlow : concepts, tools,
and techniques to build intelligent systems, O’Reilly Media, Sebastopol, CA, 2017.
[8] O. Zaitsev, J. M. no Sebastian, S. Ducasse, How fast is ai in pharo?
benchmarking linear regression, in: IWST 2022-International Workshop on Smalltalk
Technologies, 2022.
[9] J. Zakir, T. Seymour, K. Berg, Big data analytics., Issues in Information Systems 16 (2015).
[10] V. Mayer-Schonberger, K. Cukier, Big Data: A Revolution That Will Transform How We
Live, Work, and Think, Houghton Mifflin Harcourt, Boston, 2013. URL: http://www.amazon.
com/books/dp/0544002695.
[11] M. Marra, A live debugging approach for big data processing applications, Ph.D. thesis,
Vrije Universiteit Brussel, 2022.
[12] J. Dean, S. Ghemawat, Mapreduce: Simplified data processing on large clusters, Com-
mun. ACM 51 (2008) 107–113. URL: https://doi.org/10.1145/1327452.1327492. doi:10.1145/
1327452.1327492.
[13] D. H. Besset, Object-Oriented Implementation of Numerical Methods An Introduction
with Pharo, Square Bracket Associates, 2016.
A. Example: Using DataFrame to Analyse Gender Wage Gaps
In this example, we use DataFrame to analyse a dataset of wage gaps between male and female
employees in different countries and find the countries with largest and smallest wage gaps in a
given year. We use a public Gender Wage Gap dataset provided by OECD (Organisation for
Economic Co-operation and Development).16 The dataset is provided as a CSV file which can
be loaded as into Pharo as DataFrame:
wageGapFile := 'data/wagegap.csv' asFileReference.
data := DataFrame readFromCsv: wageGapFile.
Table 3
Five rows of the Gender Wage Gap dataset before preprocessing
LOCATION INDICATOR SUBJECT MEASURE FREQUENCY YEAR Value
PRT WAGEGAP EMPLOYEE PC A 2014 15.321756895
BRA WAGEGAP EMPLOYEE PC A 2013 16.363636364
LUX WAGEGAP SELFEMPLOYED PC A 2020 22.295013428
SVK WAGEGAP EMPLOYEE PC A 2003 20.689655172
GRC WAGEGAP EMPLOYEE PC A 2002 23.565754634
In Table 3, we show five randomly selected columns of that DataFrame. The original dataset
combines records from employees and self-employed people. To keep only the data about
employees, we use select: method which is supported by all Pharo collections. In case of
DataFrame, at each iteration, the select block will be evaluated with one row.
data := data select: [ :row | (row at: 'SUBJECT') = 'EMPLOYEE' ].
To answer the questions listed above, we only need three columns of the dataset: country,
year, and wage gap. The columns: method creates a subset of DataFrame with only the given
columns. In the next line, we rename those columns using the columnNames: setter.
16
https://data.oecd.org/earnwage/gender-wage-gap.htm
data := data columns: #('LOCATION' 'TIME' 'Value').
data columnNames: #('Country' 'Year' 'Gap').
At the final step of data preprocessing, we replace the three-letter country codes with full
country names. To do that, we load another dataset which contains the list of ISO-3166 country
and dependent territories with UN regional codes.17
countryCodesFile := 'data/countryCodes.csv' asFileReference.
countryCodes := DataFrame readFromCsv: countryCodesFile.
To get the mapping in the form of a dictionary, we first set the alpha-3 column (country
codes) as row names and then access column name of the new DataFrame. The column will be
an object of class DataSeries which is a kind of OrderedDictionary containing country codes
as its keys and country names as its values.
countryCodes rowNames: (countryCodes column: 'alpha-3').
countryCodeMapping := countryCodes column: 'name'.
The original Wage Gaps dataset contains two values which can not be recognized as valid
country codes: OECD and EU27. We remove them and then apply the mapping to the Country
column.
data := data reject: [ :row |
#('OECD' 'EU27') includes: (row at: 'Country') ].
data toColumn: 'Country' applyElementwise: [ :each |
countryCodeMapping at: each ].
Table 4
The cleaned dataset ready to be analysed.
Country Year Gap
Portugal 2014 15.321756895
Brazil 2013 16.363636364
Slovakia 2003 20.689655172
Greece 2002 23.565754634
In Table 4, we show the same rows were randomly selected for Table 3, after applying the
preprocessing described above. The data is now clean and ready to be analysed. We first select
the year for which we will compare the gender wage gaps across different countries. The most
recent year in the dataset (2022) only contains one entry. We therefore search for the year
after 2013 years which has the most entries (we are not interested in data from more than 10
years ago). To do that, we first filter data by year. In the next line, we group the data by years
and aggregate it by counting the number of country names per year. Then, using the argmax
method, we get the year for which the count value is the largest. That year is 2018.
selectedYear := ((data select: [ :row | (row at: 'Year') > 2013 ])
group: 'Country' by: 'Year' aggregateUsing: [ :group | group size ])
argmax.
17
https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes
We take a subset of data for only the selected year and sort it in the descending order of wage
gaps. Using methods head and tail, we get the top-5 and the bottom-5 countries in the dataset.
The result can be seen in Tables 5 and 6.
oneYearData := (data select: [ :row | (row at: 'Year') = selectedYear ])
sortDescendingBy: 'Gap'.
oneYearData head. "top-5 rows"
oneYearData tail. "bottom-5 rows"
Table 5 Table 6
Countries with largest wage gaps Countries with smallest wage gaps
Country Gap Country Gap
Korea, Republic of 34.10 Hungary 5.06
Japan 23.53 Denmark 4.86
Estonia 22.68 Romania 3.49
Israel 22.65 Belgium 3.40
Latvia 20.28 Bulgaria 3.03
This simple example demonstrated the typical use case for a DataFrame library. Although
the same preprocessing and analysis steps could be performed on standard Pharo collections,
DataFrame makes this process simpler and more intuitive by providing a dedicated API for data
manipulation.