<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Explainable and Interpretable Dry Beans Classification using Soft Voting Classifier</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Belayneh Dejene</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gizachew Setegn</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Selamawit Belay</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Debark University, Department of Computer Science</institution>
          ,
          <addr-line>Debark, 90.</addr-line>
          <country country="ET">Ethiopia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Gondar, Department of Information Science</institution>
          ,
          <addr-line>Gondar, 196</addr-line>
          ,
          <country country="ET">Ethiopia</country>
        </aff>
      </contrib-group>
      <fpage>24</fpage>
      <lpage>36</lpage>
      <abstract>
        <p>Dry beans, integral to the Fabaceae family, boast global significance with their diverse genetic heritage tracing back to their dissemination from America centuries ago. This study endeavors to develop an explainable dry bean classification model using a soft voting classifier, juxtaposing its performance against classic and ensemble machine learning algorithms. Data preprocessing ensured suitability for classification algorithms, with feature selection employing information gain and variance inflation factors. The class imbalance was addressed via SMOTE + Tomek methods. Evaluation metrics encompassed accuracy, precision, recall, and F1-score. XGBoost led with 92.5065% accuracy, while soft voting classifiers (LGBM, XGB, CatBoost, RF, and DT) closely followed at 92.691%. The soft voting classifier proved optimal for dry bean classification, aiding in model interpretation and decision-making processes.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Classification</kwd>
        <kwd>Dry bean</kwd>
        <kwd>Explainable</kwd>
        <kwd>Machine learning</kwd>
        <kwd>voting classifier</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        reveal this color variation. Due to this reason, it is vital in economically technical aspects to build an
automated technique to detect as well as categorize seed features rapidly and repeatedly. Even, it is
difficult for a human operator to understand or handle the seeds except for specific tools or automatic
software procedures. The main problem dry bean producers and marketers face is in ascertaining
good seed quality. Lower quality of seeds leads to lower quality of produce. Seed quality is the key
to bean cultivation in terms of yield and disease. In today’s world, the inspection of the quality of
seeds, fruits, and vegetables along with the examination and categorization of seeds and grains have
been performed worldwide to meet these demands with the help of machine learning and computer
vision [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This is why we try to use a soft voting classifier and compare it with individual algorithms
to classify dry beans. In recent years, machine learning algorithms have been used in the inspection,
classification, prediction, and segmentation of food product quality. Classification techniques are
becoming more popular in the fields of medicine, biostatistics, bioinformatics, agriculture, business,
etc. as machine learning applications [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Machine learning is a subfield of artificial intelligence that
enables computers to understand existing data and estimate the existence of unidentified targets.
Seed quality is influential in crop production. Seed classification is important for both producers and
marketers to provide the values of sustainable agricultural systems. By applying predictive analysis
to agricultural data, significant decisions can be taken and classifications can be made.
      </p>
      <p>
        Besides the classification model conducting explainability and interpretability of the classification
model provide the professionals with insights into how the classifications are made, fostering trust
in the model's decisions [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The explainable machine learning model impacts professionals more
likely to trust and adopt understand and interpret the reasoning behind the model's
recommendations by solving the black box nature of the algorithms. To handle this problem, several
studies have been conducted to detect the quality of dry beans using various machine-learning
techniques. For example, [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ][
        <xref ref-type="bibr" rid="ref6">6</xref>
        ][
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] conducted on dry bean classifications. The previous research
on dry bean classification has largely neglected the crucial aspect of explainability and
interpretability in their models. Instead, researchers predominantly focused on employing various
algorithms without addressing the black box nature inherent in these methods. Classic machine
learning approaches were commonly utilized, often with default parameter settings, despite evidence
suggesting that optimizing these parameters could enhance classification performance [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Additionally, while some studies attempted to tackle class imbalance issues, they typically employed
simplistic oversampling methods, which could lead to the generation of redundant data. Advanced
techniques for addressing class imbalance were rarely explored. Furthermore, previous research
overlooked feature selection methods, which could potentially improve model efficiency and
interpretability. The absence of studies utilizing explainable techniques to handle black-box models,
as well as the scarcity of research employing soft voting classifiers and tuned parameters,
underscored the need for this study. Motivated by these gaps, this study endeavors to develop an
explainable and interpretable classification model for dry beans. It seeks to utilize soft voting
classifiers, a technique not extensively explored in previous research, and compare its performance
with individual machine learning algorithms. By incorporating explainable and interpretable
methods, this study aims to classify dry beans accurately while providing insights into the
decisionmaking process, thus facilitating evidence-based policies and interventions in the selection of
appropriate dry bean classes.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        Several studies such as [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ][
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], investigated the dry bean classifications using machine
learning algorithms. However, most of the previous researchers didn’t consider the explainability
and the interpretability of the dry beans’ classification model, most of these previous studies
developed a classification model by handling the class imbalance problem on the whole data and
developing the classification model without tuning relevant parameters. These studies did not
conduct any feature selection methods, they developed the classification model by using all the
features in the dataset. M. Koklu and I. A. Ozkan [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] develop multi-class dry bean classifiers using
MLP, SVM, kNN, and DT, classification models. The overall correct classification rates have been
determined as 91.73%, 93.13%, 87.92%, and 92.52% for MLP, SVM, kNN, and DT, respectively. The
SVM classification model has the highest performance with the accuracy of the Barbunya, Bombay,
Cali, Dermason, Horoz, Seker, and Sira bean varieties 92.36%, 100.00%, 95.03%, 94.36%, 94.92%, 94.67%,
and 86.84%, respectively. However, this researcher didn’t consider the explainability and the
interpretability of the dry beans’ classification model. G. Słowiński [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] tried to classify dry beans
using machine learning techniques: Multinomial Bayes, Support Vector Machines, Decision Trees,
Random Forests, and Voting Classifier. The overall accuracies obtained were in the range: of 88.35
93.61%. However, this researcher didn’t consider the explainability and the interpretability of the dry
beans’ classification model. M. Salauddin Khan et al [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] aimed to construct a multiclass dry bean
classification model using the eight most popular classifiers and compare their performances. The
algorithms they used, were LR, NB, KNN, DT, RF, XGB, SVM, and MLP with balanced and imbalanced
classes. The XGB classifier performed better than other classifiers with the balanced and imbalanced
dataset of dry beans within each class. It performed an accuracy of 93.0% and 95.4% in imbalanced
and balanced classes respectively. The overall performance is better than the previous studies,
however, the researchers didn’t consider the explainability and the interpretability of the dry beans’
classification model. The researcher develops the model without tuning the parameters and
developing the model without those parameters faces overfitting. Not only this but also, the
researcher handles the class imbalance problems on the whole dataset before splitting it, and
evaluating the model using those fabricated datasets.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and Methods</title>
      <sec id="sec-3-1">
        <title>3.1. Data collection methods</title>
        <p>To conduct this study, we have used the publicly available dataset in the Kaggle repository. The
extracted datasets consist of a total of 13,611 grains of 7 different registered dry beans with a total of
17 features including the class level (see table 1 here below for the dataset descriptions)
3.2.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Data preprocessing methods</title>
        <p>
          Data preparation involves data selection, data cleaning, data integration, feature selection, handling
imbalances, and data transformation to make it available to extract value from those data [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ][
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
In this subsection, we have detected the missing values, removed redundancies, detected outliers,
and handled class imbalance problems from the dataset using statistical methods
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.2.1. Data cleaning</title>
        <p>
          This is a way of removing noise, inconsistencies, redundancy, and missing values to carefully develop
the model. Without cleaning the collected data, we can’t get an accurate result [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ][13]. In the
dataset, there are no missing values, though we have not applied any methods to handle the missing
values. From the data, we have removed 68 redundant records using drop redundant methods. Most
of the variables have a higher proportion of outliers including Area, Perimeter, Minor Axis Length,
Eccentricity, Convex Area, EquivDiameter, and ShapeFactor4. To handle this outlier, we have used
interquartile range and boxplot methods.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.2.2. Data transformation</title>
        <p>Where data are transformed and consolidated into forms appropriate for extracting by performing
summary or aggregation operations. The data are transformed into forms appropriate for mining
[14][15]. In these datasets, only the class level needs to be transformed for mining purposes, but all
the remaining features don’t need to transform and we have used it as it is. To transform the class
level, we have used the level encoding methods and encoded them into numeric values. We have
encoded as ‘DERMASON’ = 0, ’SIRA’ = 1, ’SEKER’ = 2, ’HOROZ’ = 3, ’CALI’ = 4, ’BARBUNYA’ = 5,
and ’BOMBAY’ = 6.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.2.3. Feature selection</title>
        <p>In this method, we have checked the importance of all the features by using information gain, (see
Fig 1 here below), from the 16 features the last three, features (ShapeFactor4, Solidity, and Extent)
were the least important, but it is not mean that they are not valuable for the model. We have checked
the multicollinearity of the feature using the variance inflation factor, and the variance inflation
factor shows that all of the features were significant to the model. Due to this, we have not dropped
them for their usefulness and we used all of the 16 features for developing the classification model.</p>
      </sec>
      <sec id="sec-3-6">
        <title>Handling class imbalance</title>
        <p>By nature, the class level of the collected data is imbalanced see Figure 2 here below. To overcome
the imbalanced class distributions problem, we can add samples to or remove samples from the data
set [16]. Sampling can be achieved in two ways, Under-sampling, randomly removing the majority
class, oversampling the minority class, or by combining over and under-sampling techniques
[16][17]. The extracted dataset class level has 7 values, from these values, some of them have the
least values see Figure 2 here below. In the class distribution, the “BOMBAY” class has the least value
when we compare it with other classes. To conduct this research, we used the synthetic minority
over-sampling technique (SMOTE) + Tomek methods to handle the class imbalance of the class levels
of the dataset. The main reason that we use SMOTE + Tomek is, it avoids the loss of valuable
information [16][17]. In SMOTE + Tomek, the SMOTE combines the SMOTE ability to generate
synthetic data for the minority class and the Tomek ability to remove the data that are identified as
Tomek links from the majority class [18][19].
In model building, the researcher needs to develop datasets for training and testing to learn and
evaluate the machine appropriately [20][21]. To conduct this study, we used the stratified splitting
technique to split the whole dataset to train and test data and split the dataset into 80:20 train test
ratios.
3.4.</p>
      </sec>
      <sec id="sec-3-7">
        <title>Parameter tuning</title>
        <p>In the process of machine learning and deep learning algorithms, the performance of the algorithm
highly depends on the selection of hyperparameters, which has always been a crucial step in the
process of machine learning [22][23][24]. To improve the performance rate for each algorithm a
collection of hyperparameters has been tuned using grid search methods. Gird search is commonly
used as an approach to hyper-parameter tuning that will methodologically build and evaluate a
model for each combination of algorithm parameters specified in a grid [24]. Here, we used the
gridsearch with GridSearchCV for selecting tuning parameters for a homogeneous ensemble machine
learning algorithm.
3.5.</p>
      </sec>
      <sec id="sec-3-8">
        <title>Classification model</title>
        <p>
          In this study, to construct a dry bean classification model we have used a soft voting classifier in
both the balanced and the unbalanced dataset. To compare that the soft voting classifier can perform
better than other machine learning algorithms, another model was developed using decision tree
algorithms and other ensemble learning classifiers namely random forest, catboost XGBoost, and
LGBM classifiers. To improve each algorithm's performance rate, a collection of hyperparameters
has been tuned using grid search methods. The performance of each classification model was
evaluated using accuracy, precision, recall, and F1- score.
3.6.
To enhance the explainability of the classification model, we have employed various feature
relevance explanation techniques like Local Interpretable Model-agnostic Explanations (LIME) and
Shapley Additive Explanation (SHAP) to highlight the most influential features and regions in the
input data, and to explain the quality of the inner functioning of deep learning models and decisions
by calculating the influence of each input variable and producing relevant scores. Global
interpretability techniques, such as feature importance analysis or rule extraction, are employed to
reveal the underlying patterns and decision rules learned by the model [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Result and discussion</title>
      <p>Experiments have been carried out to develop a dry bean classification model by using a soft voting
classifier and comparing it with other classic and ensemble machine learning algorithms. To
construct a classification model for dry beans, we conducted two experiments on the imbalance data
and the balanced data using a soft voting classifier, RF, cat boost, XGB, LGBM, and DT. Each
experiment was conducted using 16 features and by using all the tuned parameters using grid search
(see Table 2). This experiment is multiclass classification because the dataset by nature has seven
class levels. In these experiments, we evaluated all the classification models using accuracy,
precision, recall, and f1_score evaluation metrics. Finally, we have explained the model using LIME
and SHAP feature relevancy explanation techniques.</p>
      <p>Experiment# 1: Imbalanced dataset
This experiment was conducted by using the imbalanced dataset or without applying any data
imbalance handling methods. We have developed the model by using DT, RF, Catboost, XGB, LGBM,
and a soft voting classifier. We have also evaluated those models’ using accuracy, precision, recall,
and f1_score (see Table 3 here below)</p>
      <p>Metrics
Precision
0.920002
0.936018
0.939536
0.940888
0.936619
0.940701
0.939835
As we see from Table 3 above, the XGBoost algorithm outperforms the best result with accuracy
precision, and f1_score of 0.928756%, 0.940888%, and 0.938008% respectively. But in the case of recall
cat boost algorithm performs the best with 0.939268%. When we see the soft voting classifiers, the
soft voting of the algorithms LGBM, cat boost, XGB, RF, and DT performs better than the soft voting
with other algorithms. In the soft voting algorithms, the voting that contains cat boost and XGBoost
algorithm performs a better result.</p>
      <p>Experiment# 2: balanced dataset
This experiment is conducted by balancing the dataset using SOMTE + Tomek methods on the
training set only and developing the model using DT, RF, Catboost, XGB, LGBM, and soft voting
classifiers. We have also evaluated those models’ using accuracy, precision, recall, and f1_score (see
Table 4 below).
Finally, in this experiment developing the model by handling the imbalance problem is not always a
good solution to get a better performance.
4.1.</p>
      <sec id="sec-4-1">
        <title>Model comparison</title>
        <p>As a result, the researcher compared the performance of algorithms to classify the dry bean using a
soft voting classifier and other classic and ensemble machine learning algorithms using both
imbalanced and balanced datasets. The dataset has seven classes. Then, the researcher used overall
accuracy, precision, recall, and f1_score as an evaluation for classification model comparison.
According to the overall performance, the classification algorithm that registered the highest
performance is selected as the best algorithm for the classification model for the dry bean. As
indicated in Table 3 and Table 4 above, the experiments are conducted on classification algorithms
for classifying the dry bean. The XGB algorithms registered the highest accuracy of 92.8756% in the
imbalanced dataset and the soft voting classifiers of the algorithm LGBM, cat boost, XGB, RF, and
DT performed an accuracy of 92.6541% using the imbalanced datasets. The soft voting classifiers of
LGBM, XGB, Cat boost, RF, and DT perform the best result next to XGBoost algorithms with overall
accuracy, precision, recall, and f1_score of 92.691%, 94.0701%, 93.913%, and 93.986% respectively. The
decision tree algorithm is registered with the lowest performance in both the imbalanced and the
balanced datasets, see Table 3 and Table 4. Therefore, the XGBoost algorithm is selected as the best
classifier as compared to other classic and ensemble machine learning algorithms, and the soft voting
classifiers of LGBM, XGB, Cat boost, RF, and DT are selected as the best classifier where we compared
with other voting classifiers.
4.2.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Model explainability</title>
        <p>To enhance the explainability of the classification model, we have employed various techniques. We
have explained and interpreted the classification model developed with each algorithm to make the
trust of how it achieves the result. The explainable AI approach with LIME and SHAP frameworks
is implemented to understand how the model predicts the final results. To explain the model, we
have randomly selected the rows 100, 150, 200, 250, and 300 in the dataset. But this row was selected
randomly and we can select any other rows in the dataset.</p>
        <p>Fig. 3. Model Explanation with LIME for row 100
Fig.5. Model Explanation with LIME for row 150
The figures 3, 4, 5, 6, and 7 above depict interpretations of an XGBoost model using the LIME
explainable AI method for classifying specific types of dry beans. In each case, the model achieves
100% accuracy in classifying the beans into their respective classes. Here are the key findings from
each interpretation:
Class 'BOMBAY' (Figure 3): The model identifies dry beans as 'BOMBAY' based on specific features
such as perimeter, shape factors, minor axis length, convex area, and area. For instance, the beans
are classified as 'BOMBAY' when perimeter &gt; 0.83, ShapeFactor1 &lt;= 0.78, MinorAxisLength &gt; 0.91,
Convex Area &gt; 0.94, and Area &gt; 0.94.</p>
        <p>Class 'SEKER' (Figure 4): The model correctly classifies dry beans as 'SEKER' by considering features
like shape factors, minor axis length, and compactness. For instance, beans are categorized as 'SEKER'
when ShapeFactor4 &gt; 0.33, ShapeFactor1 &lt; -0.15, MinorAxisLength &lt; -0.24, ShapeFactor3 &gt; 0.45, and
Compactness &gt; 0.44.</p>
        <p>Class 'HOROZ' (Figure 5): Dry beans are accurately classified as 'HOROZ' based on features such as
roundness, perimeter, convex area, equivalent diameter, and area. For example, beans are classified
as 'HOROZ' when roundness &lt;= -0.82, Perimeter &gt; 0.24 &amp; &lt;= 0.83, ConvexArea &gt; -0.21 &amp; &lt;= 0.19,
EquivDiameter &gt; -0.23 &amp; &lt;= 0.19, and Area &gt; -0.21 &amp; &lt;= 0.20.
Class 'SIRA' (Figure 6): The model identifies dry beans as 'SIRA' considering attributes like perimeter,
roundness, minor axis length, shape factors, and shape factor 3. For instance, beans are classified as
'SIRA' when Perimeter &gt; 0.24 &amp; &lt;= 0.83, roundness &gt; -0.22 &amp; &lt;= -0.12, MinorAxisLength &gt; -0.24 &amp;
&lt;= 0.09, ShapeFactor1 &gt; -0.15 &amp; &lt;= -0.08, and ShapeFactor3 &gt; -0.65 &amp; &lt;= -0.64.</p>
        <p>Class 'BARBUNYA' (Figure 7): Dry beans are correctly classified as 'BARBUNYA' based on features
like roundness, perimeter, minor axis length, shape factor 1, and convex area. For example, beans are
categorized as 'BARBUNYA' when roundness &lt;= -0.82, Perimeter &gt; 0.83, MinorAxisLength &gt; 0.91,
ShapeFactor1 &lt;= -0.78, and ConvexArea &gt; 0.94.</p>
        <p>These interpretations provide insights into how the model makes its predictions, highlighting the
specific features that are influential in classifying different types of dry beans.</p>
        <p>Figures 8, 9, 10, 11, and 12 below show the decisions generated by the XGBoost model for the
randomly selected rows of 100, 150, 200, 250, and 300 respectively. Based on the decisions generated
by the XGBoost model, the class value for rows 100, 150, 200, 250, and 300 is 6, 2, 3, 1, and 5
respectively. to check the name of the class, see section 3.2.2.
The figure 13 below shows the importance of each feature for each class in constructing the
classification model. Based on the result above we have decided that XGBoost is the best
classification model to classify the dry beans. So, we have explained the XGBoost model using SHAP
explainable AI methods that explain the model using the feature relevancy in the model. As we see
here below the figure shows the feature importance of each feature for each class.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Recommendation</title>
      <p>Dry beans belong to the diverse Fabaceae family, sometimes referred to as Leguminosae, and they
are the most important and the most produced pulse in the world. It is originally from America,
while there is a wide genetic diversity in the world since, in the 15th and 16th centuries, they were
transported to Europe and Africa and quickly spread to the rest of the globe. There are numerous
genetic diversities of dry beans, and it is the most produced one among the edible legume crops in
the world. According to the Turkish Standards Institution, dry beans are classified as Barbunya,
Battal, Bombay, Calı, Dermason, Horoz, Tombul, Selanik, and Seker” based on their botanical
characteristics. This study aimed to develop an explainable and interpretable classification model for
dry beans using a soft voting classifier and compare the performance with other classic and ensemble
machine learning algorithms. The data source for this research is publicly available datasets on
Kaggle. After applying the data preprocessing task, out of 13611 instances with 16 features and one
class level, 13543 instances with 16 features were used for developing the classification model, and
after handling class imbalance using SMOTE + Tomek, 7655 instances were used for the model. We
checked the multicollinearity of each feature using variance inflation factors to check the
significance of each feature, and we concluded that all the features were significant. The proposed
model was constructed using soft voting classifiers, decision trees, random forests, extreme gradient
boosting, cat boost, and LGBM algorithms using the balanced and unbalanced dataset. To conduct
this study, we have done a total of twelve experiments. The performances of the models are evaluated
using accuracy, precision, recall, and f1_score evaluation metrics. We have also explained the
classification model using LIME and SHAP feature relevancy explanation techniques, to enhance the
explainability and interpretability of the classification model by solving the black-box nature of the
algorithms. In this study, the best classification model is identified using the accuracy of the
developed classification model. Then, XGBoost is selected as the best algorithm that classifies the
dry bean using the balanced dataset with 92.5065% accuracy. At the end of this conclusion, the
researcher recommended that other researchers do: A dry bean classification model by including
additional features of the dry bean like 3D features or the suture axis of the bean. The future
researcher can also conduct a dry bean classification model using any other advanced algorithms to
improve the performances and develop a mobile application.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[13] S. B. Kotsiantis and D. Kanellopoulos, “Data preprocessing for supervised leaning,” Int. J. …, vol.</p>
      <p>1, no. 2, pp. 1–7, 2006, doi: 10.1080/02331931003692557.
[14] S. Manikandan, “Data transformation,” J. Pharmacol. Pharmacother., vol. 1, no. 2, p. 126, 2010,
doi: 10.4103/0976-500x.72373.
[15] J. W. Osborne, “Notes on the use of data transformations,” Pract. Assessment, Res. Eval., vol. 8,
no. 6, 2003.
[16] I. Journal and C. Science, “Class Imbalance Problem in Data Mining : Review,” vol. 2, no. 1, 2013.
[17] R. P. Ribeiro, “SMOTE for Regression,” no. October 2015, 2013, doi: 10.1007/978-3-642-40669-0.
[18] E. F. Swana, W. Doorsamy, and P. Bokoro, “Tomek Link and SMOTE Approaches for Machine
Fault Classification with an Imbalanced Dataset,” Sensors, vol. 22, no. 9, 2022, doi:
10.3390/s22093246.
[19] “Imbalanced Classification in Python: SMOTE-Tomek Links Method | by Raden Aurelius
Andhika Viadinugroho | Towards Data Science.” Accessed: Mar. 30, 2023. [Online]. Available:
https://towardsdatascience.com/imbalanced-classification-in-python-smote-tomek-linksmethod-6e48dfe69bbc
[20] “Training, Validation and Testing Data Explained | Applause.” Accessed: Aug. 16, 2021. [Online].</p>
      <p>Available: https://www.applause.com/blog/training-data-validation-data-vs-test-data
[21] M. K. Uçar, M. Nour, H. Sindi, and K. Polat, “The Effect of Training and Testing Process on
Machine Learning in Biomedical Datasets,” Math. Probl. Eng., vol. 2020, 2020, doi:
10.1155/2020/2836236.
[22] M. J. Healy, “Statistics from the inside. 15. Multiple regression (1).,” Arch. Dis. Child., vol. 73,
no. 2, pp. 177–181, 1995, doi: 10.1136/adc.73.2.177.
[23] R. G. Mantovani, A. L. D. Rossi, E. Alcobaça, J. C. Gertrudes, S. B. Junior, and A. C. P. de L. F. de
Carvalho, “Rethinking Defaults Values: a Low Cost and Efficient Strategy to Define
Hyperparameters,” 2020, [Online]. Available: http://arxiv.org/abs/2008.00025
[24] B. H. Shekar and G. Dagnew, “Grid search-based hyperparameter tuning and classification of
microarray cancer data,” 2019 2nd Int. Conf. Adv. Comput. Commun. Paradig. ICACCP 2019,
no. February, pp. 1–8, 2019, doi: 10.1109/ICACCP.2019.8882943.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Desole</surname>
          </string-name>
          , “Dry Bean Dataset Analysis,” Math. Mach. Learn.,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Sathe</surname>
          </string-name>
          , “
          <article-title>Dry bean protein functionality</article-title>
          ,
          <source>” Crit. Rev. Biotechnol.</source>
          , vol.
          <volume>22</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>175</fpage>
          -
          <lpage>223</lpage>
          ,
          <year>2002</year>
          , doi: 10.1080/07388550290789487.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bassett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cichy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Thompson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Morris</surname>
          </string-name>
          , “
          <article-title>Bean split ratio for dry bean canning quality and variety analysis</article-title>
          ,
          <source>” IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work.</source>
          , vol. 2019-June, pp.
          <fpage>2665</fpage>
          -
          <lpage>2668</lpage>
          ,
          <year>2019</year>
          , doi: 10.1109/CVPRW.
          <year>2019</year>
          .
          <volume>00323</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Koklu</surname>
          </string-name>
          and
          <string-name>
            <given-names>I. A.</given-names>
            <surname>Ozkan</surname>
          </string-name>
          , “
          <article-title>Multiclass classification of dry beans using computer vision and machine learning techniques</article-title>
          <source>,” Comput. Electron. Agric.</source>
          , vol.
          <volume>174</volume>
          , no. May, p.
          <fpage>105507</fpage>
          ,
          <year>2020</year>
          , doi: 10.1016/j.compag.
          <year>2020</year>
          .
          <volume>105507</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Słowiński</surname>
          </string-name>
          , “
          <article-title>Dry beans classification using machine learning</article-title>
          ,
          <source>” CEUR Workshop Proc.</source>
          , vol.
          <volume>2951</volume>
          , pp.
          <fpage>166</fpage>
          -
          <lpage>173</lpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Moshinsky</surname>
          </string-name>
          , “Dry Bean Classification,” Nucl.
          <source>Phys.</source>
          , vol.
          <volume>13</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>104</fpage>
          -
          <lpage>116</lpage>
          ,
          <year>1959</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Salauddin</surname>
          </string-name>
          Khan et al.,
          <article-title>“Comparison of multiclass classification techniques using dry bean dataset,”</article-title>
          <string-name>
            <surname>Int. J. Cogn. Comput. Eng.</surname>
          </string-name>
          , vol.
          <volume>4</volume>
          , no.
          <source>March</source>
          <year>2022</year>
          , pp.
          <fpage>6</fpage>
          -
          <lpage>20</lpage>
          ,
          <year>2023</year>
          , doi: 10.1016/j.ijcce.
          <year>2023</year>
          .
          <volume>01</volume>
          .002.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P. E. D.</given-names>
            <surname>Love</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matthews</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Porter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Luo</surname>
          </string-name>
          , and L. Ding, “
          <article-title>Explainable Artificial Intelligence ( XAI ): Precepts , Methods , and Opportunities for Research in Construction Explainable Artificial Intelligence ( XAI ): Precepts , Methods ,</article-title>
          and Opportunities for Research in Construction,” pp.
          <fpage>1</fpage>
          -
          <lpage>58</lpage>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B. E.</given-names>
            <surname>Dejene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Abuhay</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Bogale</surname>
          </string-name>
          , “
          <article-title>Predicting the level of anemia among Ethiopian pregnant women using homogeneous ensemble machine learning algorithm</article-title>
          ,
          <source>” BMC Med</source>
          . Inform. Decis. Mak., vol.
          <volume>22</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          ,
          <year>2022</year>
          , doi: 10.1186/s12911-022-01992-6.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Anynomous</surname>
          </string-name>
          , “
          <article-title>Data Preprocessing Techniques for Data Mining</article-title>
          ,”
          <string-name>
            <surname>Science</surname>
          </string-name>
          (
          <fpage>80</fpage>
          -. )., p.
          <fpage>6</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>A. M. Dymond</surname>
            ,
            <given-names>R. W.</given-names>
          </string-name>
          <string-name>
            <surname>Coger</surname>
            , and
            <given-names>E. A.</given-names>
          </string-name>
          <string-name>
            <surname>Serafetinides</surname>
          </string-name>
          , “
          <article-title>Data preprocessing applied to human average visual evoked potential P100-N140 amplitude, latency</article-title>
          , and slope,
          <source>” Psychiatry Res.</source>
          , vol.
          <volume>3</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>315</fpage>
          -
          <lpage>322</lpage>
          ,
          <year>1980</year>
          , doi: 10.1016/
          <fpage>0165</fpage>
          -
          <lpage>1781</lpage>
          (
          <issue>80</issue>
          )
          <fpage>90061</fpage>
          -
          <lpage>X</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N. H.</given-names>
            <surname>Son</surname>
          </string-name>
          , “
          <article-title>Data cleaning and Data preprocessing</article-title>
          ,”
          <year>2011</year>
          , [Online]. Available: http://www.mimuw.edu.pl/~son/datamining/DM/4-preprocess.pdf
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>