=Paper= {{Paper |id=Vol-3083/paper296 |storemode=property |title=Computational modelling of stochastic processes for learning research |pdfUrl=https://ceur-ws.org/Vol-3083/paper296.pdf |volume=Vol-3083 |authors=Oleksandr H. Kolgatin,Larisa S. Kolgatina,Nadiia S. Ponomareva |dblpUrl=https://dblp.org/rec/conf/icteri/KolgatinKP21 }} ==Computational modelling of stochastic processes for learning research== https://ceur-ws.org/Vol-3083/paper296.pdf

Computational modelling of stochastic processes for
learning research
Oleksandr H. Kolgatin1 , Larisa S. Kolgatina2 and Nadiia S. Ponomareva3,4
1
Simon Kuznets Kharkiv National University of Economics, 9A Science Ave., Kharkiv, 61166, Ukraine
2
H. S. Skovoroda Kharkiv National Pedagogical University, 29 Alchevskyh Str., Kharkiv, 61002, Ukraine
3
Kryvyi Rih State Pedagogical University, 54 Gagarin Ave., Kryvyi Rih, 50086, Ukraine
4
Kharkiv University of Technology “STEP”, 9/11 Malomyasnytska Str, Kharkiv, 61000, Ukraine

Abstract
The objectives of our work were to use computer-based statistical modelling for comparison and system-
atisation of various approaches to non-parametric null hypothesis significance testing. Statistical model
for simulation of null hypothesis significance testing has been built for educational purpose. Fisher’s
angular transformation, Chi-square, Mann-Whitney and Fisher’s exact tests were analysed. Appropriate
software has been developed and gave us possibility to suggest new illustrative materials for describing
the limitations of analysed tests. Learning researches as the method of understanding inductive statistics
have been suggested, taking into account that modern personal computers provide acceptable time
of the simulations with high precision. The obtained results showed low power of the most popular
non-parametric tests for small samples. Students can’t analyse the test power at traditional null hy-
pothesis significance testing, because the real differences between samples are unknown. Therefore,
it is necessary to change the accents in Ukrainian statistical education, including PhD studies, from
using null hypothesis significance testing to statistical modelling as a modern and effective method of
proving the scientific hypothesises. This conclusions correlate with observed scientific publications and
the recommendation of the American Statistical Association.

Keywords
computational modelling, computer-based simulation, statistical hypothesis significance testing, educa-
tion, learning research

1. Introduction
1.1. Statement of the problem
Computational modelling and using the computer-based models for simulation is an essential
part of educational content and methodology. From one hand, computer-based simulation

CoSinE 2021: 9th Illia O. Teplytskyi Workshop on Computer Simulation in Education,
co-located with the 17th International Conference on ICT in Education, Research, and Industrial Applications:
Integration, Harmonization, and Knowledge Transfer (ICTERI 2021), October 1, 2021, Kherson, Ukraine
" kolgatin@ukr.net (O. H. Kolgatin); LaraKL@ukr.net (L. S. Kolgatina); ponomareva.itstep@gmail.com
(N. S. Ponomareva)
~ http://www.is.hneu.edu.ua/?q=node/294 (O. H. Kolgatin); http://hnpu.edu.ua/uk/kolgatina-larysa-sergiyivna
(L. S. Kolgatina); https://tinyurl.com/5xc89ntp (N. S. Ponomareva)
0000-0001-8423-2359 (O. H. Kolgatin); 0000-0003-2650-8921 (L. S. Kolgatina); 0000-0001-9840-7287
(N. S. Ponomareva)
© 2022 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)

1
becomes one of the main method of pedagogical research that brings new facilities in forecast
and proving the efficiency of new pedagogical technologies. From the other hand computational
modelling and computer-based simulations provided by students give them new experience
in difficult sided of educational content, facilitate competences in independent work and re-
searcher’s competences [1]. Thus, students’ learning research with computational simulations
has been developed as a method for improving students’ self management through creative
learning activity [2, 3]. Semerikov et al. [4, 5] have suggested to use computer simulation of
neural networks using spreadsheets. These results give us possibility to introduce this modern
technology in educational process of wide kinds of educational programs that are not directly
connected with computer science. The elements of technique of using CoCalc at studying topic
“Neural network and pattern recognition” of the special course “Foundations of Mathematic
Informatics” are shown in the works of Markova et al. [6]. The method of computational
simulation and modelling is supported by work of Modlo and Semerikov [7], where new tools
for modelling of electromechanical technical objects in cloud-based learning environment have
been suggested. Khazina et al. [8] also considered computer modelling as a scientific means of
training. So we can conclude that computational modelling and simulation is the popular and
actual learning method. This field of research is very interesting for our present study, because
it promotes development of computational modelling as a method of learning researches.
One of such fields of pedagogical investigations, where new information technologies can
provide new level of understanding the modelled processes, is comparative pedagogical exper-
iment and statistical hypothesis testing as a part of one. Computer-based statistical analysis
becomes a major part of the monitoring of the learning resources quality [9]. In our work, we
consider statistical processing of results of pedagogical experiment as one aspect of pedagogical
research. Traditionally this problem is solved by methods of mathematical statistics on the
basis of statistical hypothesis testing. Two hypotheses are put forward: the null hypothesis,
which states that there are no differences between compared random variables in the studied
parameter, and the alternative hypothesis, which argues that the observed differences are caused
by the studied impact. A researcher uses some criterion that integrates the observed differences
in the numeric form and calculates the probability of obtaining the same or larger differences in
random process to accept one of those hypothesises. The number of participants in pedagogical
studies is usually small, so we accept alternative hypothesis, if the probability of the type I
error (the probability that the observed differences are due to random factors) does not exceed
5 %. Understanding the essence of statistical hypothesis testing is a hard problem. Thus, Sotos
et al. [10] pointed the common misconceptions about statistical inference. They noted that in
response to the persistence of the misconceptions, educational researchers and practitioners
have initiated and promoted a thorough reform for teaching statistics. And one direction of
this reform was the importance of integrating technology in the statistics classroom, using
simulations to help students understand the ideas behind statistical processes [10]. Problems of
using statistical hypothesis tests are so deep that discussions continue even now, after more
than a hundred years after implementation this approach into the science. For example, sci-
entists discuss the problem of dichotomization of p-values, because it makes matters worse
(Wasserstein et al. [11]) and suggest to describe the data using other approaches (McShane et al.
[12]). The use of computer-based modelling provides a new look at the system of inductive
methods of statistics, gives possibility to highlight the most powerful methods and to determine

2
the limits of their applicability, which is particularly important in psychological and pedagogical
studies, where samples are small [13].
Otherwise, the practice of statistical data analysis in Ukrainian pedagogical researches
is grounded on traditional approach. So we need to show educational community modern
computer-based techniques for data analysis that are based on simulation of stochastic pro-
cesses. We also need to show comparison of these techniques with traditional criteria for null
hypothesis significance testing. This work is devoted to simulation of using popular classical
criteria of statistical hypothesis testing: Pearson’s criterion Chi-square, Fisher’s angular trans-
formation and Mann-Whitney U criterion. Information and communication technologies offer
new perspectives for the analysis of the boundaries of these tests application, investigation
of the criteria sensitivity, development of approaches to statistical analysis for small samples.
Learning researches with appropriate models will be useful not only for students, but also for
researchers to improve understanding the essence of statistical data processing in pedagogic
research.

1.2. Analysis of previous research
Last time researchers pay great attention for statistical modelling as an alternative approach
to prove research hypothesises. Computer-based simulation has provided possibility to show
the boundaries of using Pearson’s criterion Chi-square at null hypothesis significance testing
(Kolgatin [13]). Computational model for investigating the efficiency of statistical hypothesis
testing was proposed. This model did not use any assumptions about probability distribution
and test features. So it could be used for comparison of methods built on different principles.
There was shown that Chi-square test and Fisher’s angular transformation test in the studied
range of sample sizes (from 9 to 200) do not provide good accuracy for frequency tables with 2
categories. The idea of these tests is to guarantee that we should obtain the error of the first
type (type I error) in 5 % cases (5 % significance level were used), if the null hypothesis is true.
Real value of the type I error was essentially differ within the interval from 0.04 to 0.08 instead
of 0.05. Accuracy of the type I error estimation is better (in the interval from 0.04 to 0.06), if
the sizes of samples are not less than 70. Accuracy of the type I error estimation by Chi-square
test for frequency tables with 3 categories is better even for very small samples sizes. This
accuracy essentially depends on the number of measures in samples and is worse, when one of
the sample is small and the other one is large. Therefore, some recommendations for combining
categories to use the chi-square test for small sample sizes are debatable. Another result that
was obtained in [13] is devoted to Chi-square test power, its ability to show differences between
distributions. The type II error was quite high, it decreased with increasing the sample sizes,
when 3 categories in frequencies tables were used instead of 2 categories. This question is
discussed in this paper later in detail, but we can conclude here that this results correlate with
the statement of the American Statistical Association (ASA) about the limitations of p values
(Wasserstein and Lazar [14]).
Statistical modelling as a powerful alternative to null hypothesis significance testing was
described by Lang et al. [15]. They noted that statistical modelling is a more complicated
approach than null hypothesis significance testing, but this added complexity affords researchers
the opportunity to quantify evidence in support of specific substantive hypotheses relative to

3
competing hypotheses — not simply against a null hypothesis [15]. The authors underlined that
the purpose of statistical modelling is to represent, as accurately and completely as possible,
a data generation process, with the goal of understanding and gathering evidence about its
structure [15]. These authors suggested and compared Bayesian and “frequentist” models for
exploring how child temperament mediates the relationship between age and developmental
progress in communication and motor skills [15].
Statistical modelling as an educational tool is analysed in many scientific works. A good
review on the corresponded literature was suggested by Jamie [16]. The main idea of the
authors is to use computer simulation methods (CSMs) for the purpose of clarifying abstract
and difficult concepts and theorems of statistics. Some systems of computer mathematics
and spreadsheets are considered: SAS PROC IML, Excel, MINITAB, SAS, SPSS. There was
analysed the approaches to teach and illustrate such parts of statistical education: central
limit theorem, Student’s t-distribution, confidence intervals, binomial distribution, regression
analysis, sampling distribution, survey sampling [16].
Many of the models for modelling the statistical hypothesis testing with educational purpose
was suggested at the end of last century. Flusser and Hanna [17] have used BASIC computer
programs to simulate a binomial experiment and test a simple statistical hypothesis. Taylor and
Bosch [18] suggested interactive clinical trial simulation program that provides a few thousand
simulations in about 5 minutes. Bradley et al. [19] have developed a comprehensive simulation
laboratory for statistics that could work with real experimental data from database and generate
samples according to given parameters. This software calculated p-value according F-test.
Students could see that the decisions about the null hypothesis differ for various series and
analyse the Type I and Type II errors. Ricketts and Berry [20] used statistical modelling in a
package Resampling Stats to demonstrate a histogram of Differences between means. This
results, obtained for very small samples, helped students to understand the essence of null
hypothesis significance testing without any formulas.
This software gave possibility to demonstrate the Type I and Type II errors for students, but
did not produce enough performance for analysing the qualities of used criteria. Therefore, it
is actual to develop some computer-based model for comparison of various approach for null
hypothesis significance testing and analysing boundaries of its using. Such model will be useful
not only for understanding the essence of null hypothesis significance testing, but it will be also
useful for understanding the limitation of the traditional null hypothesis significance testing
approach and will motivate pedagogical scientists to computational modelling as a perspective
method of statistical data analysis.

1.3. Objectives
We have started this work in 2014 with the objectives to use computer-based statistical modelling
for comparison and systematisation of various approaches to non-parametric null hypothesis
significance testing. The accessible for Ukrainian students information in textbooks and hand-
books was contradictory and not enough for confident and reasonable choice of the statistical
method for data analysis in pedagogical researches. We have tried to develop a statistical model
for providing learning researches with null hypothesis significance testing by university and
postgraduate students.

4
But now we are finishing this work with the objectives to prove advantage of statistical
modelling over null hypothesis significance testing. We are grounding on our own simulations,
stormy development of information and communication technologies and newest publications
in statistical scientific literature. The aim of this research is to show the limitations of classical
null hypothesis significance testing and motivate students and researchers to computational
modelling as an effective method of research hypothesises proving.
Such changing the objectives of our study leads to some inconsistency of this paper and
deprives us of opportunities to introduce this analysis directly into educational process, because
the program of statistical education should be revised taking into account obtained result. So we
can suggest our results as a matter for critical thinking and developing educational programs to
statistical educators.

2. Theoretical framework
Statistical modelling of various criteria for null hypothesis significance testing needs in preparing
procedures of these criteria implementation. More over, some criteria, such as Pearson’s Chi-
square, Fisher’s angular transformation, needs data in a form of frequency table. Our model
generates the samples in metric scale, so the data should be collapsed into some intervals to
obtain the frequency table.
Pearson’s Chi-square criterion was used in the form:
𝑚 ∑︁
𝑘
∑︁ (𝐸𝑖,𝑗 − 𝑇𝑖,𝑗 )2
𝜒2 = , (1)
𝑇𝑖,𝑗
𝑖=1 𝑗=1

where 𝐸𝑖,𝑗 , 𝑇𝑖,𝑗 – empirical and theoretical frequencies; 𝑖, 𝑚 – index and number of categories;
𝑗, 𝑘 – index and number of samples (𝑘 = 2 in this study).
The form of Chi-square criterion with Yates’s correction for continuity was analysed by
D’Agostino et al. [21], Kolgatin [22] and was not used here. All studies in this work were
carried out for significance level of 5 %, the critical values of Chi-square criterion were assumed
according to Verma [23].
The criterion of Fisher’s angular transformation was used for 2-tails in such form:
⃒ (︂ )︂ (︂ )︂⃒ √︂
* 𝐸 1,1 𝐸 1,2 𝑛1 · 𝑛2
(2)
⃒ ⃒
𝜙 = 2 · ⃒⃒arcsin − arcsin ⃒ ,
𝑛1 𝑛2 ⃒ 𝑛1 + 𝑛2

where 𝐸1,1 and 𝐸1,2 – frequencies in one of the categories for samples 1 and 2; 𝑛1 and 𝑛2 –
sizes of samples 1 and 2 [24]. The critical value of this criterion was assumed 1.96 at significance
level of 5 % (2-tail). Mostly, Fisher’s angular transformation is used for 1-sided test with critical
value 1.64 [24]. The test power is higher in such case [13]. We used 2-sided test in this work to
have correct comparison with Pearson’s Chi-square test, which has no 1-sided form.
The words “exact test” are magical for some students and even researchers. Fisher’s exact
test for consistency in a 2×2 table was analysed by D’Agostino et al. [21], Berkson [25], Liddell
[26] etc. Their results were pessimistic. All these researchers believed that this test is exact only
because it do not use any approximations. Theoretical basis of this test is not exact, so let try

5
our simulations to understand and illustrate the problem. We have used 2-sided form of the
Fisher’s exact test criterion, which give us the p-value (probability of the Type I error) [27, 28].
The probability of given observed frequencies combination can be calculated with the formula:

(𝑛1 )!(𝑛2 )!(𝑛𝑎 )!(𝑛𝑏 )!
𝑝* = , (3)
𝑎1 !𝑏1 !𝑎2 !𝑏2 !𝑛!
where 𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 – observed frequencies in the samples 𝐴 and 𝐵 in the categories 1 and 2
accordantly; 𝑛 = 𝑎1 + 𝑎2 + 𝑏1 + 𝑏2 – total number of measures; 𝑛1 = 𝑎1 + 𝑏1 – number of
measures in the category 1; 𝑛2 = 𝑎2 + 𝑏2 – number of measures in the category 2; 𝑛𝑎 = 𝑎1 + 𝑎2
– the size of the sample 𝐴; 𝑛𝑏 = 𝑏1 + 𝑏2 – the size of the sample 𝐵.
The probability of random realisation of given combination and all other less probable
combinations is ∑︁
𝑝 = 𝑝* + 𝑝𝑖 , (4)
∀𝑝𝑖 <𝑝* ,𝑖∈[0;𝑛𝑎 ]

where
(𝑛1 )!(𝑛2 )!(𝑛𝑎 )!(𝑛𝑏 )!
𝑝𝑖 = . (5)
(𝑖)!(𝑛1 − 𝑖)!(𝑛𝑎 − 𝑖)!(𝑛𝑏 − 𝑛𝑎 + 𝑖)!𝑛!
Mann-Whitney test and its modifications are the field of researchers’ attention now and
statistical modelling is the main method of comparison the efficiency of various modifications
[29, 30]. The assumptions of this group of test were analysed by Fay and Proschan [31].
Mann-Whitney test was used in our work based on research by Sidorenko [32], Gubler and
Genkin [24], Billiet [33] in the form

𝑈 = min(𝑈𝑎 , 𝑈𝑏 ), (6)

where
𝑛𝑎 (𝑛𝑎 + 1)
𝑈𝑎 = (𝑛𝑎 𝑛𝑏 ) + − 𝑇𝑎 , (7)
2
𝑈𝑏 = (𝑛𝑎 𝑛𝑏 ) − 𝑈𝑎 , (8)
where 𝑛𝑎 and 𝑛𝑏 – the sizes of 𝐴 and 𝐵 samples accordantly; 𝑇𝑎 – the sum of ranks in the
sample 𝐴. The calculated values of Mann-Whitney criterion were compared with its critical
values according the table, when both 𝑛𝑎 and 𝑛𝑏 were not grater than 30 [33]. Z-test for U
criterion was used in cases, where at least one of the sample size was grater than 30 [33]:

𝑈 − 21 𝑛𝑎 𝑛𝑏
𝑍 = √︁ . (9)
𝑛𝑎 𝑛𝑏 (𝑛𝑎 +𝑛𝑏 +1)
12

3. Statistical model
The method of statistical modelling was used for investigation. The model allows to form 2
samples from one population or from different populations that have some differences in its
probability distributions.

6
The first regime was used for Type I error investigation. Two series of numbers was created
on the base of the same random number generator. The values obtained were distributed into 𝑚
categories, we could control the distribution to ensure uniform distribution or the predominance
of frequencies in certain categories; an empirical value of criterion was calculated for obtained
frequencies tables and compared with the critical value of this criterion at the specified level of
significance; decision about the possibility of rejection of the null hypothesis was made. We
knew that actually the null hypothesis was true, because both samples (series of numbers) were
generated with one random number generator. But the alternative hypothesis was accepted in
some of the tests as a result of random factors. The relative frequency of such false decisions
was estimated as the probability of a type I error and should correspond to the significance level
that was used to choose critical value of a criterion.
We needed a large number of trials to obtain a satisfactory precision of the analysis. 1000000
trials were conducted in computational experiments for each case. The precision of the obtained
values of the probability of a type I error was estimated on the base of the standard deviation in
consecutive identical trials. The estimated absolute error was about 0.0005 for 95 % confidence
interval. So we used all numbers with 2-3 significant digits and the last digit in all shown results
is spare guard digit. We need in a guard digit for further data processing. The number of trials
can be less in students’ investigations for saving computational time when the power of tests
are analysed. In some of the trials with very small samples were obtained zero values of the
frequencies in some categories, and it was not possible to calculate the values of a criterion.
These results were removed from the analysis, and, if their part in the total number of trials
exceeded 1 %, the study under appropriate conditions was not conducted.
The second regime was used for investigation the power of the tests. Two unequal random
number generators were used to analyse of criteria sensitivity. In such case we knew that
actually the alternative hypothesis is true, because samples (series of numbers) were generated
with different random number generator. We could control the level of variation. The relative
frequency of true positive decisions corresponds to the criterion power that determined by the
level of differences between the parameters of random number generators, which are used for
samples.

4. Learning researches
4.1. Motivation with leading questions
One of the method to motivate students for learning research is to suggest them leading
questions after a brief theoretical review [2]. Such questions attract students’ attention to
the most problem and debatable issues of the educational content. The limitation of using
some theory, accuracy estimation, possible practical problems in specific cases are always the
problem not only for students, but also for professional researchers. Concerning to the statistical
education in using null hypothesis significance testing we would suggest students such leading
questions:

• Is Type I error fixed, when null hypothesis significance testing?
• What factors affect the power of the null hypothesis significance testing?

7
Table 1
Type I error, when null hypothesis is true (example for equivalent samples sizes)
Frequency of null hypothesis rejection using the test, %
Size of Size of
Pearson Chi-Square test for Fisher’s angular Fisher’s Mann-
the the
the number of categories transformation exact test Whitney
sample 𝐴 sample 𝐵
2 3 5 test
4 4 7.07 0.79
5 5 6.05 2.17 3.18
6 6 5.79 0.63 4.11
7 7 5.71 1.29 3.81
8 8 7.64 2.09 4.96
9 9 4.99 9.02 2.10 4.01
10 10 4.21 4.83 4.84 1.26 4.03
data storing continue with the step given by a teacher
198 198 5.01 5.02 5.02 5.01 3.95 5.00
199 199 5.07 5.04 5.01 5.07 4.00 4.98
200 200 5.10 5.00 5.04 5.01 4.02 4.96

• Should we collapse metric scale data into some intervals?
• Which tests should we use for small samples?
• What can we know about the test power when implement null hypothesis significance
testing in practice?
• Can we prove that null hypothesis is true?

Students find the answers on these leading questions during independent work according to
the plan of learning research. The answers can be not determined. So the process of making
conclusions is creative for students.
Students should be equipped with the clear instruction for research steps and data collection.
Also, some templates for conclusions should be prepared. We’ll show the examples of tables
and diagrams as possible results of investigations. The details of instructional materials is
determined by the level of readiness of students for independent work.

4.2. Learning research of the non-parametric criteria performance with true
null hypothesis
Now we will refer to students. You know some recommendations about limitations in using
some null hypothesis significance tests. But how important are each of these limitations? What
error can took place in each practical case of the test using. Textbooks do not give us detailed
information. We can find the answers in professional research papers, but, may be, this source of
information is not so easy to use. May be, some practical questions was not analysed in scientific
works yet. So we need to master the method of statistical modelling to explore some specific
practical problems. Study the above statistical model and use it to test non-parametric criteria
performance with true null hypothesis. This model tries to use Pearson’s Chi-square, Fisher’s
angular transformation, Fisher’s exact test for consistency in a 2×2 table, Mann-Whitney test
for testing null hypothesis for 2 samples of some given probability distribution. Both samples

8
Table 2
Conclusions according the accuracy of Type I error estimation with analysed tests
Templates
Accuracy of Type I error estimation with analysed tests was ? 𝑏𝑒𝑡𝑡𝑒𝑟
𝑤𝑜𝑟𝑠𝑒 when processing the data organised
in 2 categories
Accuracy of Type I error estimation with analysed tests ? 𝑑𝑖𝑠𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑑
𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑑 with increasing the sizes of
samples
The matter of the observed periodical behaviour of Type I error estimation at using Fisher’s exact test
for 2x2 frequency tables (2 categories) is ? 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 𝑚𝑎𝑡𝑡𝑒𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑏𝑎𝑠𝑒 𝑚𝑜𝑑𝑒𝑙 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑟𝑖𝑡𝑒𝑟𝑖𝑜𝑛
𝑙𝑜𝑤 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑜𝑓 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑎𝑙 𝑠𝑖𝑚𝑢𝑙𝑎𝑡𝑖𝑜𝑛𝑠
The matter of the observed periodical behaviour of Type I error estimation at using Fisher’s
angular transformation and Pearson’s Chi-square test for 2x2 frequency tables (2 categories) is
? 𝑎𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟 𝑎𝑛𝑑 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 𝑚𝑎𝑡𝑡𝑒𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑟𝑖𝑡𝑒𝑟𝑖𝑎
𝑙𝑜𝑤 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑜𝑓 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑎𝑙 𝑠𝑖𝑚𝑢𝑙𝑎𝑡𝑖𝑜𝑛𝑠
Fisher’s exact test was ? 𝑙𝑒𝑠𝑠 𝑐𝑜𝑛𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑣𝑒
𝑚𝑜𝑟𝑒 𝑐𝑜𝑛𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑣𝑒 than Fisher’s angular transformation and Pearson’s Chi-
square test for 2x2 frequency tables (2 categories)
′
Collapsing data of small size samples into less number of categories ? 𝑑𝑖𝑑𝑛 𝑙𝑒𝑑 𝑡𝑜
𝑡 𝑙𝑒𝑎𝑑 𝑡𝑜
improving Type I
error estimation
The observed behaviour ? 𝑐𝑎𝑛 𝑑𝑖𝑓 𝑓 𝑒𝑟
𝑤𝑖𝑙𝑙 𝑏𝑒 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 in simulations with another probabilities distribution in
population and another sizes of samples

Figure 1: Accuracy of Type I error estimation in Fisher’s angular transformation, Chi-square and Fisher’s
exact tests for 2 categories for samples of equal sizes 𝑛𝑎 = 𝑛𝑏 = 4...200.

are the random samples from the unique population. So, the ideal test should decline the null
hypothesis in 5 % cases (Type I error on significance level 5 %). Try your simulations for given
probability distribution in population according to your individual variant and fill in the table 1.
It will be useful to create this table in some spreadsheet by copying the data from the used

9
Table 3
Conclusions according the power of analysed tests with given data
Templates
′
Collapsing data of small size samples into less number of categories ? 𝑑𝑖𝑑𝑛 𝑙𝑒𝑑 𝑡𝑜
𝑡 𝑙𝑒𝑎𝑑 𝑡𝑜
improving the
power of null hypothesis significance testing
Power of the analysed tests ? 𝑑𝑖𝑠𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑑
𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑑 with increasing the sizes of samples asymptotically
To satisfy the Type II error less than 5 % (power grater than 95 %) at using Chi-square test for 2x2
frequency tables (2 categories) we needed samples sizes 𝑛𝑎 = _____ and 𝑛𝑏 = _____
To satisfy the Type II error less than 5 % (power grater than 95 %) at using Chi-square test for 3x2
frequency tables (3 categories) we needed samples sizes 𝑛𝑎 = _____ and 𝑛𝑏 = _____
To satisfy the Type II error less than 5 % (power grater than 95 %) at using Chi-square test for 5x2
frequency tables (5 categories) we needed samples sizes 𝑛𝑎 = _____ and 𝑛𝑏 = _____
To satisfy the Type II error less than 5 % (power grater than 95 %) at using Fisher’s exact test for 2x2
frequency tables (2 categories) we needed samples sizes 𝑛𝑎 = _____ and 𝑛𝑏 = _____
To satisfy the Type II error less than 5 % (power grater than 95 %) at using Fisher’s angular transforma-
tion test for 2x2 frequency tables (2 categories) we needed samples sizes 𝑛𝑎 = _____ and 𝑛𝑏 = _____
To satisfy the Type II error less than 5 % (power grater than 95 %) at using Mann-Whitney test we
needed samples sizes 𝑛𝑎 = _____ and 𝑛𝑏 = _____
The analysed tests can be ranged according to their power in the such order:
1. (with the most power) _____________________;
2. _____________________;
3. _____________________;
4. _____________________;
5. _____________________;
6. _____________________

Power of some analysed tests can be improved in the case of one-sided hypothesis testing:
1. _____________________;
2. _____________________

The observed behaviour ? 𝑐𝑎𝑛 𝑑𝑖𝑓 𝑓 𝑒𝑟
𝑤𝑖𝑙𝑙 𝑏𝑒 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 in simulations with another probabilities distribution in
population and another sizes of samples

software output. Draw diagrams according to the obtained data (see the examples on figure 1
and figure 2). Analyse your results and draw conclusions according to the templates in the
table 2.

4.3. Learning research of the non-parametric criteria power
We continue to refer to students. Let analyse the power of the tests. You remember that,
according to the rules, we reject the null hypothesis, if the test allows us. But we never say
that we accept the null hypothesis. We should say that we can not reject it. Now we should
understand the matter of such rule. Use the null hypothesis significance test for two samples
with different probability distributions obtained by different random generators with given

10
Figure 2: Accuracy of Type I error estimation in Mann-Whitney and Chi-square tests for samples of
equal sizes 𝑛𝑎 = 𝑛𝑏 = 5...200.

Figure 3: Power of Fisher’s angular transformation, Chi-square, Mann-Whitney and Fisher’s exact
tests for samples of equal sizes with uniform probability distributions in diapasons [−0.05; 0.95] and
[0.05; 1.05].

11
Figure 4: Power of Fisher’s angular transformation, Chi-square, Mann-Whitney and Fisher’s exact tests
for samples of equal sizes with uniform probability distributions in diapasons [−0.1; 0.9] and [0.1; 1.1].

different parameters (according to your individual variant). Store the results in the form of the
table 1. The form is the same, but now we know that null hypothesis is false. So the data in
the table will show the power of the tests. Organise your data using diagrams and show the
theoretical probability distribution in your samples, using the information about your random
generator. We have used the uniform random generators with different means to obtain the
examples (see figure 3, figure 4, figure 5). Analyse your results and draw conclusions according
to the templates in the table 3.
As we can see, the main problem at using null hypothesis significance testing is unknown
power of tests in practical tasks. The power of tests will be low, if null hypothesis is false with
low differences between compared populations. The tests do not give us mechanism to estimate
the power in such cases. So statistical modelling is more appropriate method of data analysing,
because the model give us possibility to estimate data distribution and confidence intervals.

5. Conclusions
Statistical model for simulation of null hypothesis significance testing has been built. Fisher’s
angular transformation, Chi-square, Mann-Whitney and Fisher’s exact tests were analysed.
Appropriate software has been developed and gave us possibility to suggest new illustrative
materials for describing the limitations of analysed tests.
Learning researches in inductive statistics have been suggested on the base of statistical
modelling. This didactic materials can be useful for master and PhD students in pedagogics.

12
Figure 5: Power of Fisher’s angular transformation, Chi-square, Mann-Whitney and Fisher’s exact
tests for samples of equal sizes with uniform probability distributions in diapasons [−0.15; 0.85] and
[0.15; 1.15].

Suggested methods contain new views on the use of null hypothesis significance testing. We
stress that collapsing data into less number of categories decrease the efficiency of tests and
does not give any advantage in accuracy of significance level providing.
We suggest to change the accents in Ukrainian statistical education, including PhD studies,
from using null hypothesis significance testing to statistical modelling as a modern and effective
method of proving the scientific hypothesises. We ground on results of our simulations suggested
in this paper, possibilities of modern information and communication technologies, literature
review and the opinion of American Statistical Association.
The field of further research is in developing the courseware for teaching the inductive
statistics based on statistical modelling. Studying the null hypothesis significance tests should
be considered as an auxiliary simplified methods.

References
[1] O. H. Kolgatin, L. S. Kolgatina, N. S. Ponomareva, E. O. Shmeltser, A. D. Uchitel, Sys-
tematicity of students’ independent work in cloud learning environment of the course
“Educational Electronic Resources for Primary School” for the future teachers of primary
schools, in: S. Semerikov, V. Osadchyi, O. Kuzminska (Eds.), Proceedings of the Sym-
posium on Advances in Educational Technology, AET 2020, University of Educational
Management, SciTePress, Kyiv, 2022.

13
[2] L. I. Bilousova, L. S. Kolgatina, O. H. Kolgatin, Computer simulation as a method of
learning research in computational mathematics, CEUR Workshop Proceedings 2393
(2019) 880–894.
[3] L. I. Bilousova, O. H. Kolgatin, L. S. Kolgatina, O. H. Kuzminska, Introspection as a
condition of students’ self-management in programming training, in: S. Semerikov, V. Os-
adchyi, O. Kuzminska (Eds.), Proceedings of the Symposium on Advances in Educational
Technology, AET 2020, University of Educational Management, SciTePress, Kyiv, 2022.
[4] S. O. Semerikov, I. O. Teplytskyi, Y. V. Yechkalo, A. E. Kiv, Computer Simulation of
Neural Networks Using Spreadsheets: The Dawn of the Age of Camelot, CEUR Workshop
Proceedings 2257 (2018) 122–147.
[5] S. O. Semerikov, I. O. Teplytskyi, Y. V. Yechkalo, O. M. Markova, V. N. Soloviev, Computer
Simulation of Neural Networks Using Spreadsheets: Dr. Anderson, Welcome Back, CEUR
Workshop Proceedings 2393 (2019) 833–848.
[6] O. Markova, S. Semerikov, M. Popel, CoCalc as a learning tool for neural network simu-
lation in the special course “Foundations of mathematic informatics”, CEUR Workshop
Proceedings 2104 (2018) 388–403.
[7] Y. O. Modlo, S. O. Semerikov, Xcos on Web as a promising learning tool for Bachelor’s of
Electromechanics modeling of technical objects, CEUR Workshop Proceedings 2168 (2018)
34–41.
[8] S. A. Khazina, Y. S. Ramskyi, B. S. Eylon, Computer modeling as a scientific means of
training prospective physics teachers, in: EDULEARN16 Proceedings, 8th International
Conference on Education and New Learning Technologies, IATED, 2016, pp. 7699–7709.
doi:10.21125/edulearn.2016.0694.
[9] H. M. Kravtsov, Methods and technologies for the quality monitoring of electronic educa-
tional resources, CEUR Workshop Proceedings 1356 (2015) 311–325.
[10] A. E. C. Sotos, S. Vanhoof, W. V. den Noortgate, P. Onghena, How confident are students
in their misconceptions about hypothesis tests?, Journal of Statistics Education 17 (2009).
doi:10.1080/10691898.2009.11889514.
[11] R. L. Wasserstein, A. L. Schirm, N. A. Lazar, Moving to a World Beyond “p < 0.05”, The
American Statistician 73 (2019) 1–19. doi:10.1080/00031305.2019.1583913.
[12] B. B. McShane, D. Gal, A. Gelman, C. Robert, J. L. Tackett, Abandon Statistical Significance,
The American Statistician 73 (2019) 235–245. doi:10.1080/00031305.2018.1527253.
[13] O. Kolgatin, Computer-based simulation of stochastic process for investigation of effi-
ciency of statistical hypothesis testing in pedagogical research, Journal of Information
Technologies in Education (ITE) (2016) 007–014. URL: http://ite.kspu.edu/index.php/ite/
article/view/101. doi:10.14308/ite000582.
[14] R. L. Wasserstein, N. A. Lazar, The ASA Statement on p-Values: Context, Process, and
Purpose, The American Statistician 70 (2016) 129–133. doi:10.1080/00031305.2016.
1154108.
[15] K. M. Lang, S. J. Sweet, E. M. Grandfield, Getting beyond the Null: Statistical Modeling as
an Alternative Framework for Inference in Developmental Science, Research in Human
Development 14 (2017) 287–304. doi:10.1080/15427609.2017.1371567.
[16] D. M. Jamie, Using computer simulation methods to teach statistics: A review of the litera-
ture, Journal of Statistics Education 10 (2002). doi:10.1080/10691898.2002.11910548.

14
[17] P. Flusser, D. Hanna, Computer simulation of the testing of a statistical hypothesis,
Mathematics and Computer Education 25 (1991) 158. URL: https://www.learntechlib.org/
p/144840.
[18] D. W. Taylor, E. G. Bosch, CTS: A clinical trials simulator, Statistics in Medicine 9 (1990)
787–801. doi:10.1002/sim.4780090708.
[19] D. R. Bradley, R. L. Hemstreet, S. T. Ziegenhagen, A simulation laboratory for statistics,
Behavior Research Methods, Instruments, and Computers 24 (1992) 190–204. URL: https:
//link.springer.com/content/pdf/10.3758/BF03203496.pdf. doi:10.3758/BF03203496.
[20] C. Ricketts, J. Berry, Teaching statistics through resampling, Teaching Statistics 16 (1994)
41–44. doi:10.1111/j.1467-9639.1994.tb00685.x.
[21] R. B. D’Agostino, W. Chase, A. Belanger, The appropriateness of some common procedures
for testing the equality of two independent binomial populations, The American Statistician
42 (1988) 198–202. URL: http://www.jstor.org/stable/2685002.
[22] O. H. Kolgatin, Informatsionnyye tekhnologii v nauchno-pedagogicheskikh issle-
dovaniyakh (Information technologies in educational researches), Upravlyayushchiye
Sistemy i Mashiny (Control Systems and Machines) 255 (2015) 66–72.
[23] J. P. Verma, Data Analysis in Management with SPSS Software, Springer, India, 2013.
doi:10.1007/978-81-322-0786-3.
[24] Y. V. Gubler, A. A. Genkin, Primeneniye Neparametricheskikh Metodov Statistiki v Mediko-
Biologicheskikh Issledovaniyakh (Application of Nonparametric Methods of Statistics in
Biomedical Research), Meditsina, Leningradskoye otdeleniye, Leningrad, 1973.
[25] J. Berkson, In dispraise of the exact test: Do the marginal totals of the 2x2 table contain
relevant information respecting the table proportions?, Journal of Statistical Planning and
Inference 2 (1978) 27–42. doi:10.1016/0378-3758(78)90019-8.
[26] D. Liddell, Practical tests of 2 × 2 contingency tables, Journal of the Royal Statistical
Society. Series D (The Statistician) 25 (1976) 295–304. doi:10.2307/2988087.
[27] G. K. Kanji, 100 Statistical Tests, SAGE Publications, London - Thousand Oaks - New Delhi,
2006.
[28] K. J. Preacher, Calculation for Fisher’s exact test, 2021. URL: http://quantpsy.org/fisher/
fisher.html.
[29] Y. Fong, Y. Huang, Modified Wilcoxon-Mann-Whitney test and power against strong null,
The American Statistician 73 (2019) 43–49. doi:10.1080/00031305.2017.1328375.
[30] A. Marx, C. Backes, E. Meese, H.-P. Lenhof, A. Keller, EDISON-WMW: Exact dynamic
programing solution of the Wilcoxon-Mann-Whitney test, Genomics, Proteomics and
Bioinformatics 14 (2016) 55–61. doi:10.1016/j.gpb.2015.11.004.
[31] M. P. Fay, M. A. Proschan, Wilcoxon-Mann-Whitney or t-test? On assumptions for
hypothesis tests and multiple interpretations of decision rules, Statistics Surveys 4 (2010)
1–39. doi:10.1214/09-SS051.
[32] Y. V. Sidorenko, Metody Matematicheskoy Obrabotki v Psikhologii (Methods of Mathe-
matical Processing in Psychology), Rech, St. Petersburg, 2002. URL: https://www.sgu.ru/
sites/default/files/textdocsfiles/2014/02/19/sidorenko.pdf.
[33] P. Billiet, The Mann-Whitney U-test – analysis of 2-between-group data with a quantitative
response variable, 2003. URL: https://psych.unl.edu/psycrs/handcomp/hcmann.PDF.