Liliia V. Pavlenko et al. CEUR Workshop Proceedings                                                                                                 102–117


                         Teaching statistics to future programmers using real data
                         sets and R programming language
                         Liliia V. Pavlenko, Maksym P. Pavlenko and Vitalii H. Khomenko
                         Berdyansk State Pedagogical University, 4 Schmidta Str., Berdyansk, 71100, Ukraine


                                      Abstract
                                      This paper addresses the problem of teaching statistics to future programmers. It argues that the theoretical
                                      content of teaching statistics needs to be updated and oriented towards the practical field, even at the higher
                                      education level. It suggests that the teaching of statistics to students should move from theoretical methods to
                                      practical solutions of applied problems and emphasize the analysis and interpretation of results rather than the
                                      statistical calculations. The paper proposes a system of tasks based on real data sets obtained from statistical
                                      research as a way of improving the learning of statistics for future programmers. It shows that such tasks can
                                      increase the students’ motivation compared to synthetic examples, which are commonly used in statistics courses.
                                      The paper also reviews the software tools for statistical data analysis and identifies their features and advantages
                                      for the learning process. It recommends using R, a specialized programming language, as the main tool for
                                      teaching statistics.

                                      Keywords
                                      statistics education, future programmers, real data sets, R programming language, applied problems


                         1. Introduction
                         The rapid growth of information in the modern world poses a challenge for the statistical education of
                         society. Statistics is an essential component of the educational programs for training specialists in the
                         field of IT [1, 2]. However, teaching statistics to students often encounters various problems, such as:
                         different levels of prior knowledge, low level of motivation, lack of understanding of the relevance and
                         applicability of statistics for their future profession [3].
                            The discipline of statistics has been undergoing significant changes and developments in recent
                         years. Cox [4], Moore [5], Smith and Staetsky [6] raise many questions about the need to improve the
                         objectives, content, methods and forms of teaching statistics.
                            Many researchers have investigated the issues and challenges of teaching statistics. They have
                         provided recommendations for teaching statistics in different types of educational institutions [7, 8,
                         9, 10, 11, 12, 13]. Some of them have suggested moving from the theoretical learning to the practical
                         application of statistical methods [10, 14].
                            Nicholl [15] notes that the theoretical content of teaching statistics has expanded significantly over
                         the past 50 years, but this process has been uncoordinated, by adding new concepts without removing
                         old ones. As a result, the content of teaching statistics has become overloaded with theoretical concepts
                         that do not enhance the students’ motivation and interest in learning statistics. Rumsey [16], Gal and
                         Garfield [17] draw attention to the problems of teaching statistics and propose to change the paradigm
                         of teaching and focus on the practical field, even at the higher education level.
                            Education is seen as an investment in human capital and production. The World Economic Forum
                         predicts that the global demand for statistical data analysts will increase by almost six times in the next

                          CoSinE 2024: 11th Illia O. Teplytskyi Workshop on Computer Simulation in Education, co-located with the XVI International
                          Conference on Mathematics, Science and Technology Education (ICon-MaSTEd 2024), May 15, 2024, Kryvyi Rih, Ukraine
                          " liliya.pavlenko@meta.ua (L. V. Pavlenko); pavlenko.2277@gmail.com (M. P. Pavlenko); v_g_homenko@ukr.net
                          (V. H. Khomenko)
                          ~ https://bdpu.org/en/faculties/fmkto/structure-fmkto/kaf-ktun/composition-ktun/pavlenkolv/ (L. V. Pavlenko);
                          https://bdpu.org/en/faculties/fmkto/structure-fmkto/kaf-ktun/composition-ktun/pavlenko-2/ (M. P. Pavlenko);
                          https://bdpu.org/en/faculties/fmkto/structure-fmkto/kaf-ktun/composition-ktun/homenko-2/ (V. H. Khomenko)
                           0000-0001-7823-7399 (L. V. Pavlenko); 0000-0003-0091-696X (M. P. Pavlenko); 0000-0002-7361-2169 (V. H. Khomenko)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                            102
Liliia V. Pavlenko et al. CEUR Workshop Proceedings                                               102–117


five years.
   According to the Modis survey [18], 97.44% of respondents (representatives of banks and industry)
consider data analysis as a promising skill for success in sales and marketing. However, they are more
interested in interpreting the data than in performing the calculations. 42% of respondents complain
about the shortage of qualified professionals who have skills in statistical data analysis in the labor
market. 55% of respondents say that it is difficult to find specialists who can calculate and interpret the
results.
   Every day, large amounts of various data are generated and accumulated in the world [19]. Therefore,
the labor market demand for data analysts and data scientists is constantly growing. Varian [20] notes
that data analyst will become one of the most popular professions in the future.
   Therefore, improving the education of students in statistics requires moving from the theoretical
methods to the practical solutions of applied problems and shifting the emphasis from the statistical
calculations to the analysis and interpretation of results.
   To prepare an intellectually active, knowledgeable and skillful specialist, education needs to move
from the reproductive to the innovative learning. Innovative learning is a creative combination of the
traditional and new teaching methods, tailored to each discipline, based on its theoretical content and
practical orientation [21]. Moreover, it should be considered that teaching students is not only about
developing certain professional competencies, but also about aligning them with the current modern
requirements [22]. It means that the future specialists should be able to express their thoughts and
concepts verbally, understand the language of symbols, signs and schemes. This is not just the ability
to think creatively, but also the ability to make original decisions and actions.
   To organize innovative teaching of statistics in accordance with modern requirements, it is advisable to
use special software tools for statistical data analysis. However, there are also specialized programming
languages and environments that can be used to analyze data, interpret results and prepare conclusions
and reports in various formats quickly and efficiently.
   Therefore, there is a contradiction between the traditional approaches to teaching statistics and the
society’s expectations for the level of training of modern IT specialists in the field of statistical data
analysis, as well as between the theoretical orientation of the content of statistics teaching and the need
to train a specialist with applied tools and methods of statistical data analysis.
   The aim of this paper is to justify the use of R programming language as a teaching method for
learning statistics.


2. Results
The following main methods were used in the research process: content analysis of scientific and
methodical literature, generalization and systematization to clarify the state of the problem development;
questionnaire of those getting higher education and initial statistical processing of the obtained results
to clarify the current state of the researched problem; generalization of theoretical and practical data to
justify the introduction of innovative approaches to the study of statistics by students based on the use
of programming language R.
   The process of teaching statistics to the students is associated with certain difficulties: the study
material in this course contains a large number of definitions and formulas. At the same time, students
need not only to reproduce them, but also to understand the meaning and be able to apply in practice.
However, with the traditional organization of the educational process, practical tasks are far from the
real economic, social and other processes that occur in real life. The analyzed data are generalized
and do not allow to fully form students’ understanding of the need and expediency of studying this
discipline and the opportunity to implement the acquired competencies in their further professional
activities.
   Therefore, most students learn statistics in fragments, and do not form systemic knowledge as a
result. In addition, mainly verbal presentation of information increases fatigue, resulting reducing
productivity of the learning process [23].


                                                      103
Liliia V. Pavlenko et al. CEUR Workshop Proceedings                                                    102–117


   The number of statistically educated people is decreasing. It is difficult for potential employers to find
a specialist who will be able to perform statistical calculations without prior training and explanation.
Therefore, there is a need to improve the content of teaching this discipline through the introduction of
practical tasks.
   Improving the content of the statistics course requires the introduction of changes in the methods
and means of its teaching using innovative technologies.
   Scientific innovations that promote scientific progress cover all areas of knowledge. There are socio-
economic, organizational and managerial, technical and technological innovations. One of the types of
social innovations is pedagogical innovation.
   Pedagogical innovation is an innovation in the field of pedagogy, purposeful progressive changes that
make stable elements (innovations) in the educational environment that improve the characteristics of
both – its individual components and the educational system on the whole [24].
   Pedagogical innovations can be carried out both with the application of the educational system’s
own resources (intensive way of development) and with the involvement of additional capacities
(investments) – new means, equipment, technologies, capital investments, etc. (extensive way of
development).
   Kazakov [25] notes that the combination of intensive and extensive ways of pedagogical systems
development allows to carry out so-called “integrated innovations”, which are built at the junction of
various, multilevel pedagogical subsystems and their components.
   The main ways and objects of innovative transformations in the teaching of statistics are:
    • making concepts and strategies for the development of statistical education [26];
    • updating the content of statistics training;
    • change and development of new learning technologies;
    • improving the training of IT specialists in the field of statistical data analysis;
    • designing new models of the educational process for teaching statistics;
    • improving the monitoring of the educational process and student learning;
    • new generation electronic teaching aids development.
  Innovation can take place at different levels. The highest level includes innovations that affect the
entire pedagogical system.
  Kulinenko [27] notes that while organizing the innovation, it should be considered that:

    • innovative ideas must be clear, convincing and adequate to the real educational needs of man and
      society, they must be transformed into specific goals, objectives and technologies;
    • innovation activity should be morally and materially stimulated, legal support of innovation
      activity is necessary;
    • not only results are important in pedagogical activity, but also ways, means, methods of their
      achievement.

   The current problems of teaching statistics in modern higher educational institutions include the
review of experience associated with the intensification of learning. One of the main teacher’s tasks
is to teach students to obtain the necessary information independently, to teach them to consciously
process the obtained information [28]. In order for them to be able to study the teaching materials on
their own, the materials need to be designed primarily for students and not for teachers.
   Possibilities of “Statistics” discipline for experts in the field of IT consists first of all of that knowing
mathematical language and modeling that will allow the student to be better guided in forecasting of
economic, social, technical and other processes; secondly, that statistics by its internal nature has rich
opportunities for the formation of students algorithmic thinking.
   Future IT professionals must not only know the theoretical foundations, but also be able to apply the
means of automating statistical analysis. Such tools include specialized statistical software packages
and programming languages.
   Statistical packages on the basis of functionality can be divided into 3 main groups.


                                                      104
Liliia V. Pavlenko et al. CEUR Workshop Proceedings                                                102–117


   1. Universal statistical packages Statistica, SPSS, Statgraphics, STATA, Stadia, SYSTAT, S-PLUS and
      MS Excel. These packages are not targeted at a specific subject area and can be used to analyze
      data from different industries. Typically, they offer a wide range of statistical methods and have a
      relatively simple interface.
      It is recommended to work with such packages for starter users who have only basic knowledge in
      the field of statistics, as well as experienced users in the initial stages of working with data, when
      statistical methods that will be used to address a particular issue are not clearly defined yet. The
      versatility of the universal package allows holding a pilot analysis of different data types using a
      wide range of statistical methods. The vast majority of existing universal packages has much
      common functionality and is similar in the composition of the built-in statistical procedures.
   2. Professional statistical packages such as SAS or BMDP. Professional packages, in the contrast
      to the universal ones, allow you to work with extremely large amounts of data, apply highly
      specialized methods of analysis and create your own data processing system. As a rule, such
      packages are complex and should not be used in the educational process.
   3. Specialized statistical packages BioStat, Datastream, Datascope, etc. were designed for statistical
      analysis in specific areas of activity, which use special methods of statistical analysis, usually not
      presented in the universal packages.

   Specialized packages allow analysis using a limited number of specialized statistical methods or are
used in a specialized subject area. As a rule, such statistical packages are handled by specialists who are
well acquainted with data analysis methods in the field to which the package is focused. For example,
the BioStat statistical package was created to analyze data in the field of biology and medicine.
   Most of the existing statistical packages have a flexible modular structure that can be supplemented
and expanded owing to the custom modules that are optionally purchased or freely available on the
Internet. Such flexibility allows you to adapt packages to the needs of a particular user.
   Statistical packages are just the tools for an experienced professional. If the specialist does not have
sufficient knowledge and competencies, then, even the most advanced software product will not allow
holding quality data analysis. However, the wrong software, which does not contain the required set of
statistical procedures, can make the work of even an experienced specialist more difficult.
   Therefore, during the training of IT specialists it is necessary to acquaint those who get higher edu-
cation with the available statistical packages and their characteristics, but the application of specialized
programming languages is closer and more understandable for the students while conducting statistical
data analysis.
   For statistical data analysis it is possible and appropriate to use R and Python programming languages.
   We will consider the features of the programming language R. The language R is a powerful high-level
object-oriented programming language and environment for statistical calculations and visualization of
source and calculation data, which allows you to solve many problems in the field of data processing.
It’s a free open source program under GNU GPL designed to run common operating systems (Windows,
macOS, Linux).
   Tens of thousands of specialized modules and utilities have been developed for this language. One of
the most important features of the programming language R is the efficient implementation of vector
operations, which allows the application of compact notation while processing large amounts of data.
All this makes R an effective tool for obtaining useful information from large amounts of various
statistics, including Big Data.
   The R language is a convenient and effective tool for teaching statistical analysis, data processing
and visualization.
   It is also possible to use the Python programming language in the field of data analysis and interactive
research calculations with results visualization. Python is an open source object-oriented programming
language. The relatively recent advent of improved libraries for Python (primarily pandas) has made
it a serious competitor to the R language for statistical data analysis. Combining with the benefits of
Python as a universal programming language makes it an excellent choice for creating data processing
applications.


                                                      105
Liliia V. Pavlenko et al. CEUR Workshop Proceedings                                                   102–117


   So, the use of a specialized programming language as a learning tool contributes to the develop-
ment of statistical data analysis skills as well as the development of algorithmic thinking of future IT
professionals.
   In order to study the relevance of the problem of scientific research, a ascertaining experiment
was conducted among students of IT specialties. The issues that allow finding out the opinion of
higher education students on the problem of improving the methods of teaching statistics to future IT
professionals were studied.
   The results of the ascertaining experiment are presented in percentages and indicate the number
of positive answers to the questions. The survey was organized using Google Forms. 83 students
majoring in 015 Professional Education (Digital Technology) and 015 Professional Education (Computer
Technology) took part in ascertaining experiment.

2.1. Declared interest of students in studying the course of statistics
In this block students were asked two questions. You can see the results of the answers to the first
question of the survey in figure 1. The analysis of answers allows establishing the level of awareness of
students in the demand for specialists in the labor market who know how to analyze data.


Figure 1: Results of answers to the question regarding students’ awareness in the demand for the specialists on
data analysis in the labor market.


   Analysis of students’ answers allows us to conclude that the majority of respondents, 42.17% believe
that a data analysis specialist is in demand in the labor market. This confirms the relevance and need to
study the course of statistics for IT professionals.
   The second question clarified which specialties in data analysis students consider the most relevant
today. The results of the student survey are shown in figure 2.


Figure 2: The results of questionnaire regarding students’ awareness about modern professions on data analysis
in the labor market.

   The most famous profession among the future students programmers is the profession of data analysts
(65.06%), in second place is the profession of data scientists (51.81%). These professions are known to
more than 50% of students, which indicates their awareness and interest in this field.
   So, based on the results of studying the answers to the questions of this block, we can draw the
following conclusion. Training statistics of future IT professionals is relevant, because students are
aware of the existence of professions in the field of data analysis and believe that they will need statistics
in future professional activities.


                                                      106
Liliia V. Pavlenko et al. CEUR Workshop Proceedings                                              102–117


2.2. Students’ opinion about the need to fill the content with tasks of an applied
     nature
Students were asked to answer open-ended questions: “Which subject area data analysis you are
interested conducting in?” The students’ answers showed that the most popular data for processing are
data from sociology, medicine, engineering, economics and biology.
   Also, the idea of what data students are interested in working with in practice was studied. The
results of answers to the questions are shown in figure 3.


Figure 3: Students’ opinion on data origin for practical tasks.

  Among the surveyed respondents, 71.08% believe that data obtained as a result of practical research
and having an applied nature are most attractive for them. This indicates the need to develop practical
and laboratory work based on real data obtained from statistical studies.

2.3. Students’ interest in using programming languages and software for statistical
     data analysis
The purpose of the third block of questions was to study the opinion of respondents about the need and
feasibility of using software and programming languages for statistical data analysis.
    Students were asked the following questions: “Do you know programming languages with which
it is possible to perform statistical data analysis (enter)?”, “Which software product interface is more
user friendly for you?”, “Are you more interested in data analysis using special software or using a
programming language?”
    According to the first question, the opinions of the respondents were divided as follows: 55.42% indi-
cated the programming language R, 28.92% indicated the Python programming language. Programming
languages such as C++ (9.64%) and Java (6.02%) were also indicated (figure 4).


Figure 4: Respondents’ answers to the question on convenience of program packages interface.

   The obtained results allow us to state that the R language is the best known as a mean of statistical
data analysis. So, we will use this programming language to solve application problems.
   In choosing the convenience of the software package interface, respondents preferred MS Excel
(56.63%), followed by Statistica software package (28.92%), followed by SPSS (14.46%) (figure 5).
   So, the students will be asked to use MS Excel and Statistica for practical calculations.
   According to the results of students’ answers to the third question of this block, the programming
language (57.83%) was chosen by the students as the main tool for organizing the training of statistical
data analysis (figure 6).


                                                      107
Liliia V. Pavlenko et al. CEUR Workshop Proceedings                                                  102–117


Figure 5: Choosing program packages for statistical data analysis.


Figure 6: Respondents’ answers regarding choosing the mean of solving the tasks of statistical data analysis.


  So, students in the class will be asked to use the programming language R as the main tool for
practical calculations. MS Excel and Statistica will be used as aids in statistical analysis.


3. Using applied tasks for teaching statistics
Taking into account and summarizing the results of the study, in our opinion, it is advisable to build the
content and structure of the course considering the wishes of students. In practical classes, tasks that
are of a real applied nature and based on real statistics should be considered. One of the main teaching
methods should be a practical method of learning based on programming. The means of statistical data
analysis in practical classes can be both software tools for data analysis (MS Excel and Statistica) and
the language and programming environment R.
   A system of tasks has been developed for the course. Let’s consider an example for training of the
statistical analysis in the R environment. For carrying out the analysis we will take data from the
website https://abit-poisk.org.ua, namely data concerning entrants for 2017. This site contains large
amounts of data, for our example we will take only entrants who entered the Faculty of Physical and
Mathematical Computer and Technological Education of Berdyansk State Pedagogical University in
the specialty “Professional Education (Computer Technology)” and “Professional Education (Digital
Technology)”, the level of “bachelor”.
   A total of 31 applications were submitted for these specialties. We will analyze these data, using
descriptive statistics in R and present the results using the most common graphs in R when analyzing
this data.
   Step 1. We set the name, specialty, id, total score of the external evaluation, status (budget / contract),
then enter the data into the table. We will set the value in the form of vectors with the command
<- c (’’vector_value1’’, ‘‘vector_value2’’,. . . ). We build the table from the received
vectors by means of the command > studentdata. Commands for a table creation with the information
about applicants:

     > last_name <-c(’’Shvachko’’, ‘‘Dybiaga’’, ‘‘Kartashov’’, ‘‘Sytosenko’’,
     ‘‘Filipenko’’, ‘‘Klimenko’’, ‘‘Veretelnik’’, ‘‘Diakov’’,’’Salionov’’,
    ‘‘Bagnuk’’, ‘‘Kombarov’’, ‘‘Baranovsky’’, ‘‘Kiseliov’’, ‘‘Sakun’’, ‘‘Bova’’,
     ‘‘Potapova’’, ‘‘Kobzar’’, ‘‘Sementsov’’, ‘‘Cybulka’’, ‘‘Teplov’’,
     ‘‘Mitushkin’’, ‘‘Kartinik’’, ‘‘Gavrylenko’’, ‘‘Trotsenko’’,
     ‘‘Panchukov’’, ‘‘Kyslynsky’’, ‘‘Sagirov’’, ‘‘Korobov’’,


                                                      108
Liliia V. Pavlenko et al. CEUR Workshop Proceedings                                          102–117


     ‘‘Shatalina’’,’’Tichovod’’,’’Popov’’)
     > specialty <-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
     1,1,1,1,1,2,2,2,2,2)
     > id <-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,
     21,22,23,24,25,26,27,28,29,30,31)
     > rating <-c(186,184,180,179,173,173,170,168,167,166,163,
     162,160,156,148,145,145,142,142,140,140,139,135,131,129,123,
     147,146,140,136,128)
     > status <-c(1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,
     0,0,0,1,1,1,1,0)
     > studentdata <- data.frame(id, last_name, rating, status)
     > studentdata
        id last_name rating status
     1   1   Shvachko    186      1
     2   2    Dybiaga    184      1
     3   3 Kartashov     180      1
     4   4 Sytosenko     179      1
     5   5 Filipenko     173      1
     6   6   Klimenko    173      1
     7   7 Veretelnik     170     1
     8   8     Diakov    168      1
     9   9   Salionov    167      1
     10 10     Bagnuk    166      1
     11 11   Kombarov    163      1
     12 12 Baranovsky    162      1
     13 13   Kiseliov    160      0
     14 14      Sakun    156      0
     15 15       Bova    148      0
     16 16   Potapova    145      0
     17 17     Kobzar    145      0
     18 18 Sementsov     142      0
     19 19    Cybulka    142      0
     20 20     Teplov    140      0
     21 21 Mitushkin     140      0
     22 22   Kartinik    139      0
     23 23 Gavrylenko    135      0
     24 24 Trotsenko     131      0
     25 25 Panchukov     129      0
     26 26 Kyslynsky     123      0
     27 27    Sagirov    147      1
     28 28    Korobov    146      1
     29 29 Shatalina     140      1
     30 30   Tichovod    136      1
     31 31      Popov    128      0

  Step 2. We will calculate the main statistical values: average, median, standard deviation, minimum
and maximum value. The results of the main statistical values calculation:

     > y <- mean(rating)
     > y
     [1] 153
     > sd <-sd(rating)


                                                      109
Liliia V. Pavlenko et al. CEUR Workshop Proceedings                                              102–117


     > sd
     [1] 18.03145
     > var <-var(rating)
     > var
     [1] 325.1333
     > mad <-mad(rating)
     > mad
     [1] 22.239
     > min <-min(rating)
     > min
     [1] 123
     > max <-max(rating)
     > max
     [1] 186

   According to the results of the calculations, the following data were obtained: the average score of
entrants with external evaluation is 153, the average difference between the scores of different entrants
is 22 points, the lowest result (min) – 123 points, the best result (max) – 186 points.
   Step 3. Let’s construct a histogram of frequencies for external evaluation points using the command
> barplot (figure 7):

     > counts <- table(studentdata$rating)
     > barplot(counts, main=’’Frequency diagram’’, xlab=’’Rating’’,
     ylab=’’Frequency’’)


Figure 7: The histogram of frequencies for external evaluation points.


  The histogram of frequencies shows that the largest number of entrants has a score from 139 to
142 points, as well as the fact that the vast majority has a unique score with EIT, which is no longer
repeated.


                                                      110
Liliia V. Pavlenko et al. CEUR Workshop Proceedings                                                  102–117


  Step 4. We construct histograms of points / frequencies with a normal distribution curve. With this
purpose we use the command > box. We will build: on the 𝑥-axis – the parameter rating, and on the
𝑦-axis – the frequency of the score in the table (figure 8):

     > box()
     > library(plotrix)
     > x <-studentdata$rating
     > h <-hist(x, breaks=12, col=’’darkblue’’, xlab=’’ZNO score’’,
     main=’’Frequencies histogram with the curve of distribution ‘‘)
     > xfit <-seq(min(x), max(x), length=40)
     > yfit <-dnorm(xfit, mean=mean(x), sd=sd(x))
     > yfit <-yfit * diff(h$mids[1:2] * length(x))
     > lines(xfit, yfit, col=’’red’’, lwd=3)


Figure 8: Frequencies histogram with the curve of distribution.


   The distribution histogram shows that the data on the scores of applicants are not the subject to
the normal law of distribution. We have a lot of “average” entrants, i.e. those who passed the external
examination from 135 to 145 points. There are also those who passed 165 points, i.e. entrants with a
“sufficient” level. There are very few who scored more than 180 points.
   Step 5. We construct a diagram of the nuclear estimation of the density of values for external
evaluation points using the command > box (figure 9):

     > box()
     > par(mfrow=c(2,1))
     > d <- density(studentdata$rating)
     > plot(d)

  The nuclear density estimation diagram shows that the highest density is observed in the range from
130 to 155 points. That is, in this interval, based on the graph, the values differ by 25 points, then, if you
take the full table, they differ by 22 (see standard deviation).


                                                      111
Liliia V. Pavlenko et al. CEUR Workshop Proceedings                                                 102–117


Figure 9: The diagram of nuclear density estimation.


   As a result of solving applied problems using theoretical knowledge from different sections of
statistics, students will not only master the skills of using statistical methods, but also develop the
ability to interpret the results and predict the studied processes. It should be emphasized that the use of
programming as a practical teaching method will allow students to improve their knowledge and skills
in the field of programming as well as the use of algorithms and design patterns.
   Using real data for statistical analysis, students will be able to understand the need and feasibility of
statistical research in future professional activities.
   One of the problems of using application tasks with real data is the selection and use of data sets. Much
of the datasets are closed and inaccessible for free research and use. However, there are organizations
that provide free access to data:

    • World Bank Open Data (https://data.worldbank.org/) provides more than 3,000 sets of economic
      and social data on various indicators. Data can be downloaded in csv and xml formats. The service
      supports API access, which allows you to automate data downloads using the programming
      language R.
    • The unified state web portal of open data (https://data.gov.ua/) contains 15 categories of data
      sets that are constantly updated. Datasets are available for download in Excel, csv, json and xml
      formats. All data are available from Creative Commons Attribution 4.0 International license.
    • The official page of the All-Ukrainian Population Census (http://database.ukrcensus.gov.ua) pro-
      vides access to information on the population living in the country, socio-economic characteristics,
      and demographic indicators, level of education, national composition and language characteristics.
      Datasets can be downloaded in txt, csv, html formats.
    • Open World Health Organization data repository (https://www.who.int/data/gho/). The site
      provides datasets on the health status of citizens of World Health Organization member states.
      Datasets are divided into over 100 categories. Data can be downloaded in Excel format or use the
      API for direct access to data.
    • UNICEF Dataset (https://data.unicef.org/) collected relevant data on education, child labor, child
      disability, infant mortality, maternal mortality, water and sanitation, pneumonia, malaria and
      more. Datasets are available in Excel and csv formats.
    • Registry of Open Data on AWS (RODA) (https://registry.opendata.aws/) contains data located
      on AWS servers. The service offers access to over 200 datasets. There is a page with additional
      information, usage examples, license information, and more for each data set. Using the wide
      range of computing products offered by AWS (Amazon EC2, Amazon Athena, AWS Lambda and
      Amazon EMR), it is possible to share data in the cloud. This allows users to spend more time
      analyzing data rather than collecting data. When using data sets hosted on AWS, it is necessary


                                                       112
Liliia V. Pavlenko et al. CEUR Workshop Proceedings                                                 102–117


      to consider the type of license of each specific data set, as they belong to different agencies,
      government organizations, researchers, businesses and individuals.
    • Data.gov (https://www.data.gov/) provides open data sets of the US government. The resource
      contains more than 200,000 data sets from various sources: federal agencies, states, counties,
      cities, etc. Data can be obtained in various formats, including Excel, csv, json, xml.
    • The GroupLens Research (https://grouplens.org/) provides several sets of movie ratings data
      provided by MovieLens users. The kits contain movie ratings, movie metadata (genre and release
      year), and user demographics (age, gender, and occupation). Such data can be used to develop a
      recommendation system based on regression analysis.
    • Open data sets Yelp (https://www.yelp.com/dataset) is a subset of our businesses, reviews, and
      user data for application in personal, educational, and academic purposes. Available as JSON files,
      use it to teach students about databases, to learn NLP, or for sample production data while you
      learn statistics.
    • Kaggle (https://www.kaggle.com/datasets) a social network for researchers, which provides access
      to various data sets for analysis and research. The convenience of Kaggle is that it is not just a
      data warehouse. Each data set brings together a community of researchers in which data are
      discussed and approaches to data processing are elucidated.
    • Google Public Data Explorer (https://www.google.com/publicdata/directory) provides access to
      more than 130 datasets submitted by World Bank, U. S. Bureau of Labor Statistics, OECD, IMF
      and other organizations.

   All considered services provide access to open data sets. This allows you to fill the content of teaching
statistics for future programmers with the tasks of applied direction.


4. Experimental verification of the effectiveness in the use of applied
   tasks to teach statistics to the future programmers
Using programming language R and tasks of applied direction while training statistics with future IT
specialists.
   The main purpose of the pedagogical experiment is to test the hypothesis that the use of programming
language R and applied problems in teaching statistics to the future IT professionals will help increase
the educational motivation of students.
   According to the hypothesis of the study, the experiment involved checking the level of motivation
of students of IT specialties in the field of statistics based on the results of implementation of applied
problems and programming language R. The experiment was conducted on the basis of Berdyansk State
Pedagogical University. Students majoring in 015 Professional Education (Digital Technology) and 015
Professional Education (Computer Technology) took part in it.
   Control and experimental groups were organized. In the control group, the educational process
was carried out according to the traditional methods. This technique involved the use of specialized
software (Microsoft Excel, Statistica, etc.) and synthetic tasks, the content of which did not take into
account the specifics of future professional activities of students of IT specialties. The control group
(CG) consisted of 42 students. The experimental group used application problems and the programming
language R to solve them. The experimental group (EG) included 32 students.
   During the formation of control and experimental groups, their alignment was carried out taking
into account the initial level of educational motivation of students.
   The success of the pedagogical research was ensured by the application of the standardized methods.
This guaranteed the reliability of the results.
   Experimental methods of teaching statistics of future programmers using professional tasks and
programming language R was based on their application at all stages of learning: in learning new
material as a motivating task, at the stage of consolidation, in independent work of students as a
professionally oriented project.


                                                      113
Liliia V. Pavlenko et al. CEUR Workshop Proceedings                                                       102–117


   An electronic learning tool has been developed for students programmers to provide information
and methodological support for the statistics course. The development of an electronic tool takes
into account students age and preparation level. The developed learning tool contains theoretical
materials, tasks for practical implementation, visual materials with examples of the application of
the programming language R, a guide to the commands of the R language and a list of recommended
reading. The e-learning tool is available on the Internet at the link https://r.ktuni.bdpu.org/.
   In order to test the effectiveness of the implemented experimental training, the level of educational
motivation was chosen as a criterion. To assess the dynamics of changes in motivation to study statistics,
future IT specialists used the method of Rean and Yakunin [29] aimed at diagnosing educational
motivation in general in order to identify the predominant types of motives for learning. The technique
allows identifying the predominant type of motives and to trace the dynamics of changes in the structure
of educational motivation. The methodology is standardized and involves the study of 16 types of
educational motives of students.
   Positive motivation for learning ensures the successful formation of knowledge and skills. High
positive motivation can compensate for insufficiently high abilities of students. With the right choice of
means of motivation for learning, there is a positive pedagogical influence. Focusing only on “negative”
motives (avoidance, fear of failure, fear) is always less effective than “positive” ones. In our study, we
will determine the impact of the developed system of tasks on the level of educational motivation of
students.
   Table 1 presents the results of calculating the average scores for each type of educational motives on
the scale of Rean and Yakunin [29]. Comparative analysis of table 1 allows us to conclude that before the
experiment the levels of educational motives of students in the control and experimental groups did not
differ. After the experiment in the experimental group there is an increase in the levels of the internal
educational motives of students. In general, the level of educational motivation in the experimental
group is higher than in the control group, except for the motives of avoiding failure and punishment.

Table 1
The results of students’ questionnaire according to the methods of Rean and Yakunin [29].
                                                           Before the experiment   After the experiment
                  Educational motivation
                                                           CG           EG         CG          EG
        1. To become a qualified specialist                6.6         6.6         6.7        6.8
        2. To get the diploma                              6.7         6.6         6.2        6.8
        3. To continue successful studies at further
        courses                                            5.6         6.3         6.0        6.2
        4. To study successfully, to pass exams
        for “good” and “excellent” marks                   6.0         5.3         4.5        6.2
        5. To get constant scholarship                     5.5         5.2         4.9        5.5
        6. To gain deep and profound knowledge             6.0         6.3         6.3        6.8
        7. To be always ready for classes                  4.5         4.5         5.0        5.2
        8. Not to give up learning the subjects of the
        educational cycle                                  5.5         5.6         5.5        6.5
        9. Not to lag behind the classmates                6.0         5.6         5.5        5.8
        10. To provide future successful professional
        activity                                           6.8         6.6         6.5        6.9
        11. To execute pedagogical requirements            5.0         4.7         5.2        5.5
        12. To get teachers’ respect                       4.8         5.2         3.6        4.9
        13. To be an example for the classmates            3.2         4.7         3.5        4.3
        14. To gain parents’ and relatives’ respect        4.5         4.8         5.0        6.6
        15. To avoid condemnation and punishment
        for bad studying                                   4.1        4.9          5.0        4.5
        16. To get intellectual satisfaction               4.9        4.91         4.5        6.6

  Table 2 shows the results of statistical comparison of the control and experimental groups before


                                                         114
Liliia V. Pavlenko et al. CEUR Workshop Proceedings                                                  102–117


Table 2
Statistical comparison of the students of control and experimental groups educational motivation levels before
and after the experiment.
                           Before the experiment                After the experiment
                  𝑊𝑒𝑚𝑝       𝑊𝑐𝑟𝑖𝑡   Taken hypothesis   𝑊𝑒𝑚𝑝     𝑊𝑐𝑟𝑖𝑡    Taken hypothesis
                  0.1508     1.96          𝐻0           2.186    1.96           𝐻1


and after the experiment. The following statements were formulated as working hypotheses: 𝐻0 –
levels of learning motivation in the compared groups do not differ; 𝐻1 – levels of motivation to learn in
the compared groups differ. The Mann-Whitney U-test was used to determine the difference between
the samples. This is a non-parametric statistical criterion used to estimate the difference between two
samples at the level of any qualitatively measured trait. It allows you to detect differences in the value
of the parameter between small samples.
   Statistical analysis allows us to conclude that at the level of significance 𝛼 = 0.05 the initial states of
the experimental and control groups (before the experiment) coincide. At the end of the experiment,
the levels of educational motivation differ.
   So, the results of the study indicate that the hypothesis of the study was confirmed, namely the
introduction of statistics of the R programming language and applied problems in the learning process
helps to increase the level of educational motivation of future IT professionals.


5. Conclusions
This paper has provided a theoretical justification for the introduction of innovative approaches to
teaching statistics. It has shown that the teaching of statistics to future programmers should be based
on the use of applied tasks developed with real data sets obtained from statistical research. Such tasks
can increase the students’ motivation and interest compared to synthetic examples, which are often
used in statistics courses.
   Real data sets for statistical analysis are a rich source of applied tasks. They are freely available on
the Internet and cover various subject areas, such as sociology, medicine, engineering, economics and
biology. Therefore, the development of practical and laboratory work for future IT professionals should
include tasks that involve real data from these domains.
   Using the R programming language to teach statistics to future programmers allows the use of a
practical training method based on programming. This method engages students in familiar and relevant
activities and develops their programming skills. Therefore, we propose to use R as the main tool for
teaching statistics. MS Excel and Statistica software packages can be used as supplementary tools.
   In further research, we plan to develop a methodology for implementing and applying R and Python
programming languages for statistical data analysis.


References
 [1] O. V. Bondarenko, O. V. Hanchuk, O. V. Pakhomova, G. Tsutsunashvili, A. Zagórski, Visualization
     of demographic statistical data, IOP Conference Series: Earth and Environmental Science 1049
     (2022) 012076. doi:10.1088/1755-1315/1049/1/012076.
 [2] L. F. Panchenko, V. Y. Velychko, Unveiling the potential of structural equation modelling in
     educational research: a comparative analysis of Ukrainian teachers’ self-efficacy, Educational
     Technology Quarterly 2023 (2023) 157–172. doi:10.55056/etq.601.
 [3] A. Zieffler, J. Garfield, S. Alt, D. Dupuis, K. Holleque, B. Chang, What Does Research Suggest
     About the Teaching and Learning of Introductory Statistics at the College Level? A Review of the
     Literature, Journal of Statistics Education 16 (2008) 8. URL: https://www.tandfonline.com/doi/full/
     10.1080/10691898.2008.11889566. doi:10.1080/10691898.2008.11889566.


                                                      115
Liliia V. Pavlenko et al. CEUR Workshop Proceedings                                                 102–117


 [4] D. R. Cox, The current position of statistics: a personal view, International statistical review 65
     (1997) 261–276.
 [5] D. S. Moore, New pedagogy and new content: The case of statistics, International statistical review
     65 (1997) 123–137.
 [6] T. M. F. Smith, L. Staetsky, The teaching of statistics in UK universities, Journal of the Royal
     Statistical Society: Series A (Statistics in Society) 170 (2007) 581–622. doi:10.1111/j.1467-985X.
     2007.00482.x.
 [7] D. Ben-Zvi, J. B. Garfield, The challenge of developing statistical literacy, reasoning and thinking,
     Springer, 2004.
 [8] R. Biehler, D. Frischemeier, C. Reading, J. M. Shaughnessy, Reasoning about data, in: International
     handbook of research in statistics education, Springer, 2018, pp. 139–192.
 [9] A. J. Bishop, M. A. K. Clements, K. Clements, C. Keitel, J. Kilpatrick, C. Laborde, International
     Handbook of Mathematics Education, Springer Science & Business Media, 1996.
[10] J. Garfield, D. Ben-Zvi, Developing students’ statistical reasoning: Connecting research and
     teaching practice, Springer Science & Business Media, 2008.
[11] C. W. Langrall, K. Makar, P. Nilsson, J. M. Shaughnessy, Teaching and learning probability and
     statistics: An integrated perspective, 2017.
[12] J. M. Shaughnessy, Research in Probability and Statistics : Reflections and Directions, in: Hand-
     book on Research in Mathematics Education, 1992, pp. 465–494. URL: https://ci.nii.ac.jp/naid/
     10029959707/.
[13] J. M. Shaughnessy, Research on Statistics’ Reasoning and Learning, in: Second Handbook of
     Research on Mathematics Teaching and Learning, 2007, pp. 957–1009. URL: https://ci.nii.ac.jp/
     naid/10029959708/.
[14] J. M. Watson, N. E. Fitzallen, P. Carter, Top drawer teachers: Statistics, 2013. URL: http://ecite.utas.
     edu.au/87993.
[15] D. F. Nicholl, Future directions for the teaching and learning of statistics at the tertiary level,
     International Statistical Review 69 (2001) 11–15.
[16] D. J. Rumsey, Statistical literacy as a goal for introductory statistics courses, Journal of Statistics
     Education 10 (2002).
[17] I. Gal, J. Garfield, Curricular goals and assessment challenges in statistics education, The assessment
     challenge in statistics education (1997) 1–13.
[18] Modis, STEM IQ Survey Results 2018, 2018. URL: https://www.modis.com/en-us/resources/
     employers/stem-iq-survey-2018/.
[19] V. H. Khomenko, L. V. Pavlenko, M. P. Pavlenko, S. V. Khomenko, Cloud technologies in infor-
     mational and methodological support of university students’ independent study, Information
     Technologies and Learning Tools 77 (2020) 223–239. URL: https://journal.iitta.gov.ua/index.php/
     itlt/article/view/2941. doi:10.33407/itlt.v77i3.2941.
[20] H. R. Varian, Nel 2020 il data analyst sarà la professione più ricercata, 2017. URL:
     https://www.giornaledibrescia.it/rubriche/impresa-4-0/nel-2020-il-data-analyst-sar%C3%
     A0-la-professione-pi%C3%B9-ricercata-1.3182021.
[21] A. V. Kaminskaya, Forming of readiness of future teachers to innovative activity in higher
     educational establishment, Scientific Bulletin of Donbass (2011). URL: http://nvd.luguniv.edu.ua/
     archiv/NN13/11kavvnz.pdf.
[22] A. M. Striuk, S. O. Semerikov, Professional competencies of future software engineers in the
     software design: teaching techniques, Journal of Physics: Conference Series 2288 (2022) 012012.
     doi:10.1088/1742-6596/2288/1/012012.
[23] M. M. Fitsula, Pedagogy, 2000.
[24] E. S. Rapatsevych, Psychological and pedagogical dictionary, Minsk, 2006.
[25] V. H. Kazakov, New times - new technologies of professional training, Professional education
     (2006) 12.
[26] S. Tishkovskaya, G. A. Lancaster, Statistical Education in the 21 st Century: A Review of Challenges,
     Teaching Innovations and Strategies for Reform, Journal of Statistics Education 20 (2012) 4.


                                                      116
Liliia V. Pavlenko et al. CEUR Workshop Proceedings                                           102–117


     doi:10.1080/10691898.2012.11889641.
[27] L. Kulinenko, Technologies of innovative educational space, Naukovyi chasopys Natsionalnoho
     pedahohichnoho universytetu imeni M. P. Drahomanova. Seriia 07. Relihiieznavstvo. Kulturolohiia.
     Filosofiia (2013). URL: http://enpuir.npu.edu.ua/handle/123456789/12492.
[28] M. Pavlenko, L. Pavlenko, Formation of communication and teamwork skills of future IT-specialists
     using project technology, Journal of Physics: Conference Series 1840 (2021) 012031. doi:10.1088/
     1742-6596/1840/1/012031.
[29] E. P. Il’in, Human motives: theory and methods of study, High school, 1998. URL: https://www.
     elibrary.ru/item.asp?id=21748410.


                                                      117