<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data Pre-processing and Visualization for Learning Models and its Applications in Education Machine</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Konstantin Borodkin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marat Nurtas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aizhan Altaibek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yevgeniya Daineko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Temirlan Otepov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>International Information Technology University</institution>
          ,
          <addr-line>Manas St. 34/1, Almaty, 050040</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This research study explores the role and degree of influence of data pre-processing techniques in the development and application of Machine Learning models for solving prediction tasks within the domain of education. Effective data visualization techniques are essential for understanding trends, patterns, and relationships within the data, aiding in feature selection, model evaluation, and interpretation. The research study deals with various techniques for improving data quality, such as data cleaning, working with missing values, and data selection. We assume that data quality and the use of different preprocessing techniques can have a significant impact on some machine learning models performance and quality.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Data pre-processing</kwd>
        <kwd>data analysis</kwd>
        <kwd>machine learning</kwd>
        <kwd>smart education</kwd>
        <kwd>data-driven education</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In the modern world, digitalization is reaching huge proportions. Both state-owned and private
companies are transferring their business to an online format. Consequently, every day a huge
amount of information is generated digitally from users. Every year the amount of information is
growing rapidly. All this information must be transported, stored and processed. There is also no
single format for data, it can be stored in different forms and structures. This requires huge
computing power and the use of the latest technologies in the field of data engineering [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. At the
same time, failures occur, which leads to inaccuracies in the data or deterioration of their quality.
      </p>
      <p>
        The performance and efficiency of machine learning algorithms [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] are intrinsically linked to
the characteristics and quality [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] of the input data. In the era of big data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], as the volume and
variety of available data sources continue to expand, the importance of preparing this data
becomes increasingly evident.
      </p>
      <p>
        Private companies are now very interested in introducing the latest technologies using
machine learning and artificial intelligence [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for their needs. Recently, startups in the field of
artificial intelligence and machine learning have been actively opening, huge amounts of money
are being invested in the development of these industries and this trend will only continue. Some
of the key advantages of automation using artificial intelligence and machine learning algorithms
are increased productivity, time and economic efficiency, reduction of human errors, acceleration
of business decision-making, forecasting customer preferences and maximizing sales [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Big
companies that need to analyze and work with a large amount of data contain entire teams that
monitor and maintain data quality. This once again proves the importance of data quality for
future use.
      </p>
      <p>
        In addition, data preprocessing cannot yet be fully automated, as it is a rather complex process
that may include different techniques, algorithms and must take into account the specifics of the
data and the task in order to select methods and achieve the best results. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>There are some articles in which scientists have investigated the impact of data quality for
various machine learning models in different fields. However, in this paper, the research will be
carried out in relation to the field of education and related data.</p>
      <p>
        When predicting diabetes using machine learning models, the authors managed to improve
the effectiveness of the model using data preprocessing techniques, such as: inserting missing
values and selecting features [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        In the study [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] about working with neural networks to predict the state of the indoor
environment, the authors conclude that separate forecasting of several variables without data
preprocessing can give the same accurate forecasts as simultaneous forecasting with data
preprocessing, however, the computational costs of training several neural networks for separate
forecasting should be taken into account.
      </p>
      <p>
        In another article [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], scientists decided to find out how data processing will affect machine
learning models in prediction tasks. As a result, it turned out that data processing can have a
strong positive impact on the results and quality of the forecast, but it can also have a negative
impact on the effectiveness of the forecast using machine learning methods.
      </p>
      <p>
        The authors of another paper [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] studied the possibilities of 6 different algorithms, as well as
methods for preprocessing data in the task of classifying the electroencephalogram signal to
determine the drowsiness of the driver when driving using machine learning. As a result, they
came to the conclusion that the type of algorithm used to solve the problem has a higher impact
on the results than preprocessing. However, data processing also improves simulation results.
Also, in situations where data preparation is not possible, it is preferable to use tree-based
machine learning algorithms.
      </p>
      <p>
        Another work [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] used machine learning algorithms to predict air pollution. The impact of
data preprocessing and feature selection was also evaluated. As a result, when using these
methods, better accuracy and efficiency of the models were achieved.
      </p>
      <p>
        In another study [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] about preprocessing of near-infrared spectra, the researchers concluded
that data preprocessing has a big impact on small data sets. With an increase in the amount of
data, preprocessing methods lose their effectiveness. Models trained on a large amount of data
are more accurate.
      </p>
      <p>
        One more article [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] investigated the impact of data processing in corporate data analysis.
Various preprocessing methods, as well as various machine learning algorithms, were reviewed.
As a result, it was empirically proved that some of the preprocessing algorithms have a significant
impact on the accuracy of forecasting.
      </p>
      <p>
        In the case of processing unstructured medical data [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], the researchers also analyzed the
most influential methods of data preprocessing. As a result, it was found out that for the analysis
of handwritten text, the most effective stages were normalization and correction of errors.
      </p>
      <p>Machine learning in the field of education continues to evolve. New ways of application are
found for these technologies. This research study will examine the use of machine learning to
analyze and predict students' grades on exams, depending on the indicators of their life and
family. Machine learning can also be used for research, automation of management, improvement
of online learning, personalization of learning, creation of smart applications.</p>
      <p>
        Data visualization [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] is also an important step in the process of solving a problem using
machine learning algorithms. Visualization is applied at various stages in the process of solving
the problem.
      </p>
      <p>Despite the clear importance of data preprocessing, a comprehensive understanding of its
real-world impact remains a dynamic area of research and application. The complexity of this
issue arises from the interplay of diverse preprocessing techniques, the unique characteristics of
different datasets, and the specifics of machine learning models. As such, the impact of data
preprocessing is not one-size-fits-all but is, instead, context-dependent.</p>
      <p>Moreover, there is limited insight into how various preprocessing methods influence the
resilience of machine learning models in the face of noisy data, outliers. This dearth of knowledge
can impede the broader adoption of best practices in data preprocessing, limiting the potential of
machine learning in real-world applications.</p>
      <p>In this regard, the topic of the impact of data quality on machine learning models in education
needs more research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem statement</title>
      <p>This research study assesses the impact of data quality and various data preparation algorithms
on the primary quality metrics of machine learning models. It also evaluates their performance
in forecasting tasks within the field of education while analyzing how data visualization
contributes to problem-solving using machine learning models. We endeavor to provide valuable
insights and empirical evidence that guide data scientists, researchers, and practitioners in
making informed decisions regarding data preprocessing, ultimately enhancing the effectiveness
of machine learning applications across domains.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Data pre-processing and analysis</title>
      <p>To investigate the issue that stands in this research study, datasets from open source were used.</p>
      <p>The dataset [17], which will be used for the prediction task using machine learning algorithms,
contains various information about students. The dataset used in the study is an extended version
of the original “Students Exam Scores” dataset [18]. It contains a large number of columns and
records. It also includes inaccuracies in the data, such as missing values and not informative
columns. This dataset is used to train machine learning models to predict student grades. The
dataset's dimensions are 30641 rows by 14 columns. The target variable in this study that will be
predicted is the result of students' exam in mathematics.</p>
      <p>Main attributes of this dataset:
1. Gender: this attribute indicates the gender of students (male or female)
2. Race/ethnicity: students are categorized into groups based on their race or ethnicity,
labeled as groups A, B, C, D, E
3. Parental Education: this attribute represents the highest level of education achieved by
the students' parents, with categories such as "high school", "some college", "associate's
degree", "bachelor's degree", "master's degree"
4. Lunch Type: this attribute describes the type of lunch that students receive, with options
for "standard" or "free/reduced"
5. Test Preparation Course: it indicates whether a student completed a test preparation
course. Options include "completed" or "none"
6. Parent Marital Status: married/single/widowed/divorced
7. Practice Sport: frequency of student's sports activities never/sometimes/regularly
8. Is First Child: is this student first child in family or not - yes/no
9. Number of Siblings: from 0 to 7
10. Transport Means: what kind of transport does the student use to get to the place of study
school bus/private
11. Weekly Study Hours: number of hours of self-study during the week
12. Math Score: this is the score a student achieved on the math portion of the exam
13. Reading Score: this is the score a student achieved on the reading portion of the exam
14. Writing Score: this is the score a student achieved on the writing portion of the exam.</p>
      <p>Most of the variables are categorical. This means that in the process of data preparation, they
will need to be digitized for further training of machine learning models.</p>
      <p>This dataset was generated in order to study the relationship between student demographics,
preparation, and academic performance.</p>
      <p>The dataset allows researchers and data scientists to explore various aspects of student
performance and understand how demographic factors and preparation influence exam scores.
It can be used for tasks like predictive modeling, clustering, and statistical analysis. Researchers
often use this dataset to examine disparities in student performance based on demographic
attributes and to develop insights into factors that can improve student outcomes.</p>
      <p>The dataset is valuable for educational research. It is an excellent resource for exploring
educational data and conducting various analyses.</p>
      <p>At the stage of studying data, it is useful to use visualization tools in order to better understand
the task and ways to solve the problem.</p>
      <p>Analyzing the ethnicity column using pie-chart [19] reveals insights into the diversity of the
student population and allows for the exploration of potential disparities in academic
performance related to race and ethnicity. According to race, students are distributed as follows:</p>
      <p>The "ParentEduc" column in the dataset provides information about the highest level of
education achieved by the parents of the students in the dataset. Analyzing this column with
barchart [20] reveals insights into the educational background of the students' parents and its
potential influence on student performance. We can see that a smaller part of the students'
parents had a higher education.</p>
      <p>Further, in the process of data analysis, 2 fields are added that help to better characterize the
overall performance of students. The first field is the student's total score for all 3 exams, the
second field is the percentage of the total score scored from the maximum. The maximum value
for each exam is 100. It is worth noting that the dataset has an almost equal distribution of
students by gender: 15424 females and 15217 males. Thus, we can estimate the overall average
academic performance depending on various values, for example, gender:</p>
      <p>Using the correlation matrix [21], it is possible to determine how strongly the numerical
variables correlate with each other. Based on the matrix in Figure 4, it can be established that all
types of tests have a strong positive correlation relative to each other. For example, this means
that students who wrote a written exam well also got a good score in mathematics. Negative
values in the matrix mean the inverse relationship of the two variables. There are no large values
in this data. Values close to 0 mean that the variables are poorly correlated with each other, there
is no direct or inverse relationship between them. Total Score and Total Pct are essentially one
indicator needed for data analysis, so their correlation is 1.</p>
      <p>Further, in the process of data analysis, some patterns also emerged that may affect the
methods of solving the problem. The mean overall score among all students was 204 points.
Students of race “E” were more successful on all types of tests. They scored an average of 18 points
more in all subjects combined. The level of education of parents also has an impact on students'
academic performance. On average, students whose parents had a master's degree received 20
points more. Students who preferred a standard lunch set scored 29 points more than students
with a free lunch. Students who completed preparatory courses received 20 points more than
others. Also, students who studied more hours a week on average had slightly higher academic
performance compared to the rest. The other features did not have such a big impact on student
performance.</p>
      <p>A graphical representation of these patterns is shown in Figure 5.</p>
      <p>Analyzing the histograms [22] from Figure 6 below, we can notice a significant difference in
academic performance between men and women in different subjects. Based on the information,
it can be concluded that men show the best performance in mathematics, and women in writing
and reading exams.</p>
      <p>Next, it is necessary to assess the quality of the available data. To do this, it is necessary to
check for data duplication, missing values, and analysis of outliers in the data.</p>
      <p>Since we do not have a unique student ID, we need to check the data for duplicate rows across
all columns. Duplication of data was not detected.</p>
      <p>Further, using the interquartile range [23], outliers [24] in the data among numerical variables
were checked. For Interquartile Range the formula was used:

=  3 −  1
(1)
lower (25th) quartile, representing the value above which 75% of the data falls.
where  3 - upper (75th) quartile, representing the value above which 25% of the data falls,  1
Typically, values outside the range (Q1 - 1.5 * IQR, Q3 + 1.5 * IQR) are considered outliers.</p>
      <p>The number of outliers turned out to be insignificant: 90 lines for Reading Score and 106 lines
for Writing Score. Also, they cannot be fully called outliers, since they are results of students'
performance on these types of exams.</p>
      <p>There was a certain amount of missing data in the dataset. For each column, it was no more
than 11% of the total amount of data. In general, the number of rows with at least one missing
value is 11398, which is a percentage of 37.2%. Most of the missed data was for categorical
columns.</p>
      <p>During the analysis, it turned out that there is one extra column that contains only a number.
It was not described in documentation in any way. It looks like it's just the entry number in the
dataset. It also did not contain useful information. Therefore, this column has been deleted.</p>
      <p>Further, in the process of data preparation, it is necessary to translate categorical variables
into numerical form. To do this, we have assigned a unique numeric value for each category. For
example, a field with a student's gender is indicated as follows: 1 - male, 2 - female. The other
fields were processed in the same way.</p>
      <p>Thus, in the process of data analysis using visualization tools, it was possible to detect
interesting and useful patterns and also problems in the data, which in the future can help in
solving certain tasks with this dataset.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methods and research</title>
      <p>Dataset 4 - 1000 empty values were randomly added to the numerical columns, which
have a strong correlation with the target variable. Then the empty values are replaced by the
average value of the column
•</p>
      <p>Dataset 5 - removed columns that, as a result of the analysis, did not have a strong impact
on the results of students in the math exam: 'TransportMeans', 'NrSiblings', 'IsFirstChild',
'ParentMaritalStatus'</p>
      <p>Further, the datasets were divided into sets for training and prediction in a ratio of 80% to
20%. All types of models were trained on these sets and the results were entered into a common
table for further analysis.</p>
      <p>To assess the impact of data processing techniques on the change in the load on the computing
system during the training of the models, the load on the Central Processing Unit (CPU) [39] was
measured. The training process took place on a local machine with a processor with 4 physical
The following metrics were used to evaluate the accuracy of predictions.</p>
      <p>Mean Absolute Error (MAE) [40] - measures the average absolute difference between actual
where n - number of errors,   - actual values,  ̂ - predicted values.</p>
      <p>Square Root of Mean Quadratic Error (RMSE) [40] - also measures the difference between
actual and predicted values, but it penalizes larger errors more than MAE.
where n - number of errors,   - actual values,  ̂ - predicted values.</p>
      <p>Coefficient of Determination (R2) [41]</p>
      <p>measures the proportion of the variance in the
dependent variable explained by the model. It ranges from 0 to 1, where 1 means the model
perfectly fits the data, and 0 means the model doesn't explain anything.
where n - number of errors,   - actual values,  ̂ - predicted values,   - mean of actual values.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>During the training of machine learning models, measurements of the load on the CPU were also
made. As a result, it turned out that some of the models did not put any load on the computing
system at all, or its value was so small that monitoring tools could not track the load. Thus, taking
into account the indicators with a zero value, the average load for five datasets can be seen in
decreased.</p>
      <p>In general, the spread of values turned out to be small. The second dataset showed the highest
load on the processor, where the missing values were replaced by a mode.</p>
      <p>The lowest load on CPU was exerted by the 5th data set, where the missing values were
excluded, and the most informative columns for forecasting were selected. This helped to reduce
the load on the computing system because the total amount of processed data has significantly</p>
      <p>For visualization and subsequent analysis of processor usage, results with processor usage
above zero were selected among the algorithms. There are 2 algorithms at the top for CPU usage
in the learning process: LGBM and XGB. The minimum resource consumption, if we do not take
into account zero values, is noticed in the MLP algorithm. Significantly, depending on the data set,
the CPU consumption of the KNN algorithm also changed. At the same time, there is no direct
relationship between the amount of data and the consumption of resources by this algorithm. The
lasso algorithm utilized significantly fewer resources in the dataset with the introduction of a new
parameter for the missing values. The rest of the algorithms showed the highest load with a large
amount of data replacing the missing values with a mode.</p>
      <p>After training all models for 5 subsets of data, the information was recorded in an additional
table. Since the dataset size is not large, the training of models took not a long time. These results
are not bad and allow the model to predict the math exam scores fairly accurately.</p>
      <p>The spread in accuracy between the algorithms is small. The best average value of R2 among
all regression algorithms turned out to be in sample number 5, where rows with empty values
and uninformative columns were removed. The average R2 value is 0.827. The worst value turned
out to be in sample number 4 where inaccuracies in numerical variables were allowed - 0.77. In
accordance with the R2 indicator, we can also see changes in MAE and RMSE. This shows us that
as the accuracy of R2 decreased, the values of model errors increased.</p>
      <p>If we analyze the accuracy of the models in the context of each algorithm, then we can see on
the graphs in Figure 10 that most of them have approximately similar accuracy. Based on the
results, we can conclude that the SVR algorithm performed the worst on all sets of data. Also, the
algorithms LGBM, XGB, Gradient Boost, Random Forest, Decision Tree, MLP, Cat Boost, Linear
Regression, Ridge showed consistently high accuracy. Their R2 score on all datasets was above
0.8. The other algorithms also showed stable results with average accuracy.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>While conducting this research study, we considered the problem of the impact of data quality
and data preparation techniques on machine learning models. To do this, we found a dataset with
errors in the data, analyzed it using visualization methods, applied several different algorithms
for data pre-processing, trained models of several machine learning algorithms and compared
the main metrics with each other.</p>
      <p>As a result, it turned out that for the specified data and for the prediction task, the best result
was obtained by removing missing values and uninformative columns. Most of all, missing data
in strongly correlated numerical variables have the greatest negative impact on the results of
models, regardless of the algorithms of machine learning models. Otherwise, different algorithms
of data preprocessing showed similar results on the machine learning model in prediction tasks
regardless of the data sets processed.</p>
      <p>The CPU load for some of the algorithms varied along with the amount of data being processed.
For some of the other algorithms, the amount of data did not affect the consumption of the
processor's resources.</p>
      <p>Also, we were able to show the importance of data visualization at each of the stages of
machine learning model training.</p>
      <p>In future works, it is also possible to assess the impact on the consumption of other computing
resources, to consider the impact of data preparation methods for solving other tasks, to consider
from what percentage of inaccuracies the influence of missing data increases and also to evaluate
the influence of the parameters of individual models on the results of the accuracy of predictions.</p>
    </sec>
    <sec id="sec-7">
      <title>7. References</title>
      <p>3c2ccb108945#:~:text=Data%20visualization%20is%20the%20visual,communicate%20it
%20to%20your%20peers.
[17] Kaggle Team, Students Exam Scores: Extended Dataset, 2023. URL:
https://www.kaggle.com/datasets/desalegngeb/students-exam-scores/data.
[18] R. Kimmons, Exam Scores dataset, 2012. URL:
http://roycekimmons.com/tools/generated_data/exams.
[19] T. Moriarty, The Right Way to Make a Pie Chart, 2013. URL:
https://medium.com/eyeful/theright-way-to-make-a-pie-chart-7852f568eaa9.
[20] K. Bhalla, Bar Charts: What they are, when to use them &amp; Guidelines for creating, 2021. URL:
https://medium.com/@komal.bhlla/bar-charts-what-they-are-when-to-use-themguidelines-for-creating-64f0720a88d1.
[21] S. Wagavkar, Introduction to the Correlation Matrix, 2023. URL:
https://builtin.com/datascience/correlation-matrix.
[22] J. Chen, G. Scott, P. Rathburn, How a Histogram Works to Display Data, 2023. URL:
https://www.investopedia.com/terms/h/histogram.asp
[23] S. Thomas, What Is the Interquartile Range (IQR)?, 2023. URL:
https://articles.outlier.org/what-is-the-interquartile-range.
[24] P. Flom, Outliers: An Introduction, 2019. URL:
https://towardsdatascience.com/outliers-anintroduction-e07445c8f430.
[25] LightGBM Core Team, LGBMRegressor, 2023. URL:
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html.
[26] XGBoost Core Team, Python API Reference, 2022. URL:
https://xgboost.readthedocs.io/en/stable/python/python_api.html.
[27] SKLearn Core Team, Gradient boosted trees, 2023. URL:
https://scikitlearn.org/stable/modules/ensemble.html#gradient-boosting.
[28] SKLearn Core Team, Random forests and other randomized tree ensembles, 2023. URL:
https://scikit-learn.org/stable/modules/ensemble.html#forest.
[29] SKLearn Core Team, Decision Trees, 2023. URL:
https://scikitlearn.org/stable/modules/tree.html#tree.
[30] SKLearn Core Team, MLPRegressor, 2023. URL:
https://scikitlearn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html.
[31] SKLearn Core Team, Nearest Neighbors Regression, 2023. URL:
https://scikitlearn.org/stable/modules/neighbors.html#regression.
[32] SKLearn Core Team, SVR, 2023. URL:
https://scikitlearn.org/stable/modules/generated/sklearn.svm.SVR.html.
[33] Cat Boost Core Team, CatBoostRegressor, 2023. URL:
https://catboost.ai/en/docs/concepts/python-reference_catboostregressor.
[34] SKLearn Core Team, Linear Regression, 2023. URL:
https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html.
[35] SKLearn Core Team, Lasso, 2023/ URL:
https://scikitlearn.org/stable/modules/linear_model.html#lasso.
[36] SKLearn Core Team, Ridge Regression and Classification, 2023. URL:
https://scikitlearn.org/stable/modules/linear_model.html#ridge-regression.
[37] SKLearn Core Team, Elastic-Net, 2023. URL:
https://scikitlearn.org/stable/modules/linear_model.html#elastic-net.
[38] S. Manikandan. (2011). Measures of central tendency: Median and mode, J Pharmacol</p>
      <p>Pharmacother. 2. doi: 10.4103/0976-500X.83300.
[39] H.M. Deitel, B. Deitel, Chapter 3 – The Processor, An Introduction to Information Processing
(1986) 46-71. doi: 10.1016/B978-0-12-209005-9.50009-6.
[40] T. O. Hodson, Root-mean-square error (RMSE) or mean absolute error (MAE): when to use
them or not, Geosci. Model Dev., 15 (2022): 5481–5487, doi: 10.5194/gmd-15-5481-2022.
[41] W. Rowe, Mean Square Error &amp; R2 Score Clearly Explained, 2018. URL:
https://www.bmc.com/blogs/mean-squared-error-r2-and-variance-in-regressionanalysis/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.M.</given-names>
            <surname>Conejero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.C.</given-names>
            <surname>Preciado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.J.</given-names>
            <surname>Fernandez-Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.E.</given-names>
            <surname>Prieto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rodriguez-Echeverria.</surname>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Towards the use of Data Engineering, Advanced Visualization techniques and Association Rules to support knowledge discovery for public policies</article-title>
          ,
          <source>Expert Systems with Applications 170</source>
          . Doi:
          <volume>10</volume>
          .1016/j.eswa.
          <year>2020</year>
          .
          <volume>114509</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Lao</surname>
          </string-name>
          ,
          <article-title>A Beginnner's Guide to Machine Learning</article-title>
          ,
          <year>2018</year>
          . URL: https://medium.com/@
          <article-title>randylaosat/a-beginners-guide-to-machine-learning-dfadc19f6caf.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ram</surname>
          </string-name>
          ,
          <source>What is Data Quality in Machine Learning</source>
          ,
          <year>2023</year>
          . URL: https://www.analyticsvidhya.com/blog/2023/01/the-role
          <article-title>-of-data-quality-in-machinelearning/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Botelho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Bigelow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Big</given-names>
            <surname>Data</surname>
          </string-name>
          ,
          <year>2022</year>
          . URL: https://www.techtarget.com/searchdatamanagement/definition/bigdata#:~:text=
          <source>Big%20 data%20is%20a%20combination,and%20other%20advanced%20analytics%20applicatio ns.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Great</given-names>
            <surname>Learning Team</surname>
          </string-name>
          ,
          <year>2020</year>
          . URL: https://medium.com/@mygreatlearning/what-isartificial
          <article-title>-intelligence-how-does-ai-work-and-future-of-it-d6b113fce9be.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Soni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. K.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Kapoor.</surname>
          </string-name>
          (
          <year>2020</year>
          ).
          <source>Artificial Intelligence in Business: From Research</source>
          and Innovation to Market Deployment, Procedia Computer Science. Pp.
          <volume>2200</volume>
          -
          <fpage>2210</fpage>
          . doi:
          <volume>10</volume>
          .1016/j.procs.
          <year>2020</year>
          .
          <volume>03</volume>
          .272.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Maharana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mondal</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Nemade.</surname>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>A review: Data pre-processing and data augmentation techniques</article-title>
          ,
          <source>Global Transitions</source>
          , 3. Pp.
          <volume>91</volume>
          -
          <fpage>99</fpage>
          . doi:
          <volume>10</volume>
          .1016/j.gltp.
          <year>2022</year>
          .
          <volume>04</volume>
          .020.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.C.</given-names>
            <surname>Olisah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Smith.</surname>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Diabetes mellitus prediction and diagnosis from a data preprocessing and machine learning perspective</article-title>
          ,
          <source>Computer Methods and Programs in Biomedicine 220. doi: 10</source>
          .1016/j.cmpb.
          <year>2022</year>
          .
          <volume>106773</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. Ooka.</surname>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Influence of data preprocessing on neural network performance for reproducing CFD simulations of non-isothermal indoor airflow distribution</article-title>
          ,
          <source>Energy and Buildings</source>
          <volume>230</volume>
          . doi:
          <volume>10</volume>
          .1016/j.enbuild.
          <year>2020</year>
          .
          <volume>110525</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. F.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Xie.</surname>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>An empirical analysis of data preprocessing for machine learning-based software cost estimation, 67</article-title>
          . Pp.
          <volume>108</volume>
          -
          <fpage>127</fpage>
          . doi:
          <volume>10</volume>
          .1016/j.infsof.
          <year>2015</year>
          .
          <volume>07</volume>
          .004.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Farhangi</surname>
          </string-name>
          . (
          <year>2022</year>
          ).
          <article-title>Investigating the role of data preprocessing, hyperparameters tuning, and type of machine learning algorithm in the improvement of drowsy EEG signal modeling</article-title>
          ,
          <source>Intelligent Systems with Applications</source>
          .
          <volume>15</volume>
          . doi:
          <volume>10</volume>
          .1016/j.iswa.
          <year>2022</year>
          .
          <volume>200100</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>I.</given-names>
            <surname>Aksangur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Eren</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. Erden.</surname>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Evaluation of data preprocessing and feature selection process for prediction of hourly PM10 concentration using long short-term memory models</article-title>
          ,
          <source>Environmental Pollution</source>
          .
          <volume>311</volume>
          . doi:
          <volume>10</volume>
          .1016/j.envpol.
          <year>2022</year>
          .
          <volume>119973</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schoot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kapper</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. H. van Kollenburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Postma</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. van Kessel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M. C.</given-names>
            <surname>Buydens</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. J. Jansen.</surname>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Investigating the need for preprocessing of near-infrared spectroscopic data as a function of sample size</article-title>
          ,
          <source>Chemometrics and Intelligent Laboratory Systems</source>
          .
          <volume>204</volume>
          . doi:
          <volume>10</volume>
          .1016/j.chemolab.
          <year>2020</year>
          .
          <volume>104105</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S. F.</given-names>
            <surname>Crone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lessmann</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. Stahlbock.</surname>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing</article-title>
          ,
          <source>European Journal of Operational Research</source>
          . 173. Pp.
          <volume>781</volume>
          -
          <fpage>800</fpage>
          . doi:
          <volume>10</volume>
          .1016/j.ejor.
          <year>2005</year>
          .
          <volume>07</volume>
          .023.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kashina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. D.</given-names>
            <surname>Lenivtceva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Kopanitsa</surname>
          </string-name>
          . (
          <year>2020</year>
          ).
          <article-title>Preprocessing of unstructured medical data: the impact of each preprocessing stage on classification</article-title>
          ,
          <source>Procedia Computer Science</source>
          . 178. Pp.
          <volume>284</volume>
          -
          <fpage>290</fpage>
          . doi:
          <volume>10</volume>
          .1016/j.procs.
          <year>2020</year>
          .
          <volume>11</volume>
          .030.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>K.V.</given-names>
            <surname>Balaji</surname>
          </string-name>
          ,
          <article-title>What is Data Visualization</article-title>
          and Why Is It Important?,
          <year>2020</year>
          . URL: https://medium.com/analytics-vidhya/
          <article-title>what-is-data-visualization-and-why-is-itimportant-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>