Iris database - Effectiveness of selected classifiers
                                Paulina Hałatek1, Katarzyna Wiltos1 and Mariusz Wróbel1

                                1
                                    Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, Poland


                                                 Abstract
                                                 Machine learning and artificial intelligence are crucial tools in the vast majority of different fields, but mainly in computer
                                                 science and technology. Classifiers play a vital role in this field, especially in predicting the class membership of a sample
                                                 under consideration. An example of the practical use of classifiers is the spam filter of email messages. The following paper
                                                 aims to determine the most efficient classifier from selected: kNN, Soft set, and Naive Bayes on Iris database. Different
                                                 versions of each of the classifiers have been considered. For kNN, the performance of various metrics was compared, for the
                                                 Soft set, two approaches for establishing intervals during the classification, and for Naive Bayes, the normal and triangular
                                                 distributions were compared. The most effective versions of the classifiers have been selected for the final comparison.

                                                 Keywords
                                                 Artificial intelligence, Iris, classifiers, kNN, Soft set, Naive Bayes


                                1. Introduction                                               data analytics. In [1] was proposed a model of soft set to
                                                                                              approximate reasoning from input data. A model based
                                Artificial intelligence is an important aspect in today’s on kNN classifier for big data analytics was presented
                                world. It brings a communication between human and in [7]. In decision processes we also very often use
                                machine. Thanks to that we are able to teach our com- bayesian approaches which analyze probability of
                                puter how to process a given data and get a response possible situations. In [2] was presented an wildfire
                                from it. This is a called machine learning.                   risk assessment from data of remote sensing.
                                   Machine learning is a field of study that uses diverse Transmission of sensor readings for classifiers is an
                                algorithms to make some analysis in the given data. That important topic, and there are many interesting models
                                can be for example:                                           to support this process [5]. Classification model also
                                                                                              depends on the type of the input information. In [6] was
                                     1. variety recognition,                                  discussed how to use bayesian model for text
                                     2. weather and disease prediction,                       classification. This kind of processes also need efficient
                                     3. puzzle or sudoku solver,                              data aggregation to improve efficiency of the classifier
                                     4. building a movie recommendation system.               [3].
                                                                                                 In our paper we are going to compare three classifiers:
                                This is actually a small fraction of the immeasurable pos-
                                                                                              kNN, Soft set, Naive Bayes and decide which of those is
                                sibilities in this field of study. Machine learning models
                                                                                              the most accurate and effective. We will be trying to use
                                are used to learn the patterns in data. Machine learning
                                                                                              different methods for each classifiers to determine the
                                algorithms can be used for example to gather information
                                                                                              most reliable results. Each of these classifiers is different
                                about data, split data for two parts and try to identify
                                                                                              in some aspects. But the thing which connects them is
                                unknown sample as a data element. The methods which
                                                                                              that all of theirs main purpose is to identify a given sam-
                                determine this, are called classifiers. We have various
                                                                                              ple by learning from a database using different types of
                                types of classifiers applicable to different task. In machine
                                                                                              identification. And in this paper we want to bring each
                                learning models neural networks based ideas are very
                                                                                              classifier closer in the meaning, we are going to explain
                                efficient in complex data analysis. In [4] was presented
                                                                                              each of those three classifiers, explain which techniques
                                how to use them in low-dimensional data feature learn-
                                                                                              of identification we were used and make overall conclu-
                                ing. The idea presented in [8] proposed neural network
                                                                                              sion which one is the best classifier.
                                for analytical purposes of data recorded form high-speed
                                train. There are also very efficient, however simple in          We decided to use Iris data base to compare the accu-
                                construction, classifiers based on approaches sourced in      racy  of those classifiers and identify the most suitable.

                                IVUS 2022: 27th International Conference on Information                                                      2. Iris database
                                Technology, May 12, 2022, Kaunas, Lithuania
                                $ paulhal456@student.polsl.pl (P. Hałatek);                                                                  Before we started working on our base, we made sure
                                katawil756@student.polsl.pl (K. Wiltos);
                                mariwro279@student.polsl.pl (M. Wróbel)
                                                                                                                                             that our Iris base didn’t have null or NaN values. We have
                                              © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License   created additional class called DataProcessing which
                                              Attribution 4.0 International (CC BY 4.0).
                                              CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
helped us to shuffle, normalize and split database by di-      2.1. Normalization
viding it into 70% as a training set and 30% as a validation
                                                               For all classifiers we were normalizing a given database
set. As we can see, the setos class is significantly sepa-
                                                               by taking all values from a specific column, determining
rated from the rest of the classes. This will result in a
                                                               of the lowest and highest value in the specific column
high proportion of correctly recognized objects for this
                                                               and changing all values in the column according to the
class.
                                                               formula which is given below:


                                                                                           𝑜𝑙𝑑𝑉 𝑎𝑙𝑢𝑒[𝑥][𝑦] − 𝑚𝑖𝑛
                                                                     𝑛𝑒𝑤𝑉 𝑎𝑙𝑢𝑒[𝑥][𝑦] =                                (1)
                                                                                                𝑚𝑎𝑥 − 𝑚𝑖𝑛

                                                               𝑥       a current row
                                                               𝑦       a current column
                                                               𝑚𝑖𝑛     a minimal value in the current column
                                                               𝑚𝑎𝑥     a maximal value in the current column

                                                               3. Methods
                                                               3.1. kNN
                                                               3.1.1. Formulas
                                                               In order to function properly, the kNN algorithm needs
                                                               functions that calculate the distance of the object for
                                                               which we are looking for a class to the objects of classes
Figure 1: Iris dataset graphs
                                                               already known to us. It is on the basis of this distance
                                                               that the kNN algorithm decides to which class a given
                                                               object may belong. There are many ways to calculate
                                                               distances, each with its pros and cons. When
                                                               calculating distances, we can, for example, use one of
                                                               the known metrics, e.g. Euclid:
                                                                                              𝑛
                                                                                             ∑︁
                                                                            ||𝑥 − 𝑥𝑖 ||2 =        (𝑥𝑗 − 𝑥𝑖𝑗 )2        (2)
                                                                                             𝑗


                                                                 Or Minkowski distance:
                                                                                            𝑛
                                                                                           ∑︁                 1
Figure 2: Iris dataset information                                         𝐿𝑚 (𝑥, 𝑦) = (         |𝑥𝑖 − 𝑦𝑖 |𝑚 ) 𝑚      (3)
                                                                                           𝑖=1

                                                                 In our algorithm, we chose the Minkowski metric.

                                                               3.1.2. Algorithm
                                                               The kNN (k-Nearest Neighbors) classifier is one of
                                                               the most important non-parametric classification
                                                               methods. The kNN algorithm does not create an
                                                               internal representation of the training data, but looks
                                                               for a solution only when the testing pattern appears. It
                                                               consists in assigning an object to a given class by
                                                               checking to which representatives a given object has
                                                               the shortest distance. The algorithm works as follows.
                                                               First, a sample is taken from the validation set. Next
                                                               for a given sample, the distance to each object in the
                                                               test set is calculated. Then list is created containing the
                                                               given test file object and the distance to the sample
                                                               which then is sorted from shortest distance to longest.
After that from this list, the k objects in the shortest test set were calculated. On their basis, the middle values
distance from the sample are analyzed. At the end the of species range values were determined.
sample is assigned to the class with the most objects.      The classifier considers samples from the validation
                                                         set together with the selected characteristic weight.

Algorithm 1 kNN algorithm                                        In the first approach implementation for each sample,
Input: Test set, validate set, 𝑘, 𝑚                              the distance from the center of the interval is calculated
Output: The class to which the sample may belong                 and the minimum value is chosen, which determines
                                                                 sample classification.
    while 𝑖 < 𝑙𝑒𝑛(𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑒 𝑠𝑒𝑡) do
      while 𝑗 < 𝑙𝑒𝑛(𝑡𝑒𝑠𝑡 𝑠𝑒𝑡) do
           Calculate the distance using Minkowski
           distance of test object j to the sample 𝑖
           and add the result to the list of distances.
           𝑗++
      Sort the list of distances in ascending order.             Figure 3: First approach algorithm visualization
      Take the k objects with the smallest distance
      and return the class x with the most objects.
    Return from dictionary variety with the highest prob- Algorithm 2 Soft set algorithm - first approach
    ability
                                                          Input: Test set, validation set, weight
                                                          Output: The class to which the sample was classified

3.2. Soft sets                                                     𝑐𝑒𝑛𝑡𝑒𝑟𝑠 ← 𝑐𝑒𝑛𝑡𝑟𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑖𝑟𝑖𝑠 𝑡𝑦𝑝𝑒
The soft set term as a mathematical model offers a tool
                                                                   for 𝑖𝑛𝑑𝑒𝑥 < 𝑙𝑒𝑛(𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 𝑠𝑒𝑡) do
for analysing vaguely defined objects. Soft set theory is
a generalisation of fuzzy set theory that was introduced
                                                                       Creates nested list with iris type name, minimal
in 1999 by Dmitri Molodtsov.
                                                                       and maximal values for each iris type in test set.

3.2.1. Formulas                                                        Creates nested list with iris type name, centre
There are a few ways in which soft set may be im-                      value for each iris type based on minimal
plemented, for example: including weight or not. In                    and maximal values.
this case weight was included. Pearson correlation
coefficients were calculated to properly choose the most               𝑟𝑜𝑤 ← 𝑙𝑖𝑠𝑡 𝑜𝑓 𝑡𝑟𝑎𝑖𝑡𝑠 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑜𝑛𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
appropriate weight values for particular characteristics.              𝑠𝑎𝑚𝑝𝑙𝑒𝑇 𝑦𝑝𝑒 ← 𝑖𝑟𝑖𝑠𝑡𝑦𝑝𝑒𝑛𝑎𝑚𝑒
                                                                       𝑠𝑎𝑚𝑝𝑙𝑒𝑉 𝑎𝑙𝑢𝑒 ← 0
                     ∑︀                                                𝑖←0
                 (        𝑖 (𝑥𝑖 − 𝑥
                                  ¯ )(𝑦𝑖 − 𝑦¯)                         for 𝑡 𝑖𝑛 𝑟𝑜𝑤 do ◁ Add all trait values for sample
         𝑟 = √︀∑︀                                          (4)
                                                                           𝑠𝑎𝑚𝑝𝑙𝑒𝑉 𝑎𝑙𝑢𝑒+ = 𝑡 * 𝑤𝑒𝑖𝑔ℎ𝑡[𝑖]
                                   √︀∑︀
                    𝑖 (𝑥𝑖 − 𝑥
                            ¯ )2          𝑖 (𝑦𝑖 − 𝑦
                                                  ¯)   2
                                                                           𝑖+ = 1
                                                                       for 𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑐𝑒𝑛𝑡𝑒𝑟𝑠 do ◁ Calculate distances
  𝑥𝑖    characteristic value for i = 0,1,...,n                             𝑐𝑒𝑛𝑡𝑟𝑒 = 𝑣𝑎𝑙𝑢𝑒[0]
  𝑥     mean value for particular characteristic                           𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = |𝑐𝑒𝑛𝑡𝑟𝑒 − 𝑠𝑎𝑚𝑝𝑙𝑒𝑉 𝑎𝑙𝑢𝑒|
  𝑦𝑖    value of compared characteristic for i = 0,1,...,n
  𝑦     mean value for compared characteristic                     Chooses minimal distance and corresponding iris type.
  𝑟     Pearson correlation coefficient value                      Returns classified type.

3.2.2. Algorithm
Prior to classification data was prepared through shuf-
fling, normalizing, and splitting the database into a test
set and validation set in the ratio of 70 to 30.
   In the developed implementation of the soft set, the
minimum and maximum values for each species of the
In the second approach algorithm, overlapping intervals        3.3. Naive Bayes
are considered and mean value is calculated to create
                                                               Naive Bayes classifier is a probabilistic machine learning
new intervals. Based on new intervals each sample is
                                                               model that’s used for classification task. At the begin-
being classified accordingly to these measures.
                                                               ning it reduces database by splitting an Iris database
                                                               to three smaller databases according to their variety.
                                                               After that classifier assigns the initial probability of a
                                                               given species appearing in the database. Next, it takes a
                                                               sample and counts a probability for each reduced
                                                               database. It uses one of two considered distribution
                                                               formulas. Subsequently, it multiplies the initial
Figure 4: Second approach algorithm - calculating mean value   probability with all partial probabilities (with all
for overlapping intervals                                      attributes that a reduced database has). And at the end
                                                               it compares which probability of three possible is the
                                                               highest.

                                                               3.3.1. Formulas
                                                               Normal distribution:
                                                                                         1         (𝑠𝑎 − 𝜇)2
                                                                       𝑃 (𝑎𝑖 |𝑉 ) = √         𝑒𝑥𝑝(− 𝑖 2      )         (5)
                                                                                        2𝜋𝜎 2         2𝜎
Figure 5: Second approach algorithm - determined intervals     Triangular distribution:

                                                                           ⎧                                  √
                                                                             0,                    𝑠𝑎𝑖 < 𝜇 − 6𝜎𝑉
Algorithm 3 Soft set algorithm - second approach
                                                                           ⎪
                                                                           ⎪
                                                                           ⎪ 𝑠𝑎 𝑖 −𝜇                   √
                                                                           ⎨
                                                                                 6𝜎 2
                                                                                         + √16𝜎 ,  𝜇 − 6𝜎 ≤ 𝑠𝑎𝑖 ≤ 𝜇
Input: Test set, validation set, weight                        𝑃 (𝑎𝑖 |𝑉 ) =       𝑠𝑎 𝑖 −𝜇      1
                                                                                                                 √
                                                                               −       2  + 6𝜎 , 𝜇 ≤ 𝑠𝑎𝑖 ≤ 𝜇 + 6𝜎)
                                                                                             √
Output: The class to which the sample was classified
                                                                             ⎪
                                                                             ⎩ 6𝜎                             √
                                                                             ⎪
                                                                             ⎪
                                                                               0                   𝑠𝑎𝑖 > 𝜇 + 6𝜎)
  Create sorted list of all minimal and maximal values                                                              (6)
  of each iris type form test set.                                𝑎𝑖     a current attribute in reduced database
                                                                  𝑠𝑎𝑖 a current attribute of a sample
  Creates nested list of new ranges for each iris type            𝜎      a standard deviation of an attribute
  where mean value was calculated for overlapping sets            𝜇      a mean of an attribute in reduced database
  and taken as new edge value for set.                            𝑉      current reduced variety database

  𝑠𝑜𝑟𝑡𝑒𝑑𝑉 𝑎𝑙𝑢𝑒𝑠 ← 𝑙𝑖𝑠𝑡 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑖𝑟𝑖𝑠 𝑡𝑦𝑝𝑒 𝑟𝑎𝑛𝑔𝑒                 Counting probability:
  𝑛𝑒𝑤𝑅𝑎𝑛𝑔𝑒𝑠 ← 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 𝑖𝑟𝑖𝑠 𝑡𝑦𝑝𝑒 𝑟𝑎𝑛𝑔𝑒𝑠
                                                                                               4
  for 𝑖𝑛𝑑𝑒𝑥 < 𝑙𝑒𝑛(𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 𝑠𝑒𝑡) do                                                          ∏︁
      𝑟𝑜𝑤 ← 𝑙𝑖𝑠𝑡 𝑜𝑓 𝑡𝑟𝑎𝑖𝑡𝑠 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑜𝑛𝑒 𝑠𝑎𝑚𝑝𝑙𝑒                   𝑃 (𝑉 𝑎𝑟𝑖𝑒𝑡𝑦) = 𝑃 (𝐼𝑛𝑖𝑡) *         𝑃 (𝑎𝑖 |𝑉 𝑎𝑟𝑖𝑒𝑡𝑦)   (7)
                                                                                              𝑖=1
      𝑠𝑎𝑚𝑝𝑙𝑒𝑇 𝑦𝑝𝑒 ← 𝑖𝑟𝑖𝑠𝑡𝑦𝑝𝑒𝑛𝑎𝑚𝑒
      𝑠𝑎𝑚𝑝𝑙𝑒𝑉 𝑎𝑙𝑢𝑒 ← 0
                                                               3.3.2. Algorithm
      𝑖←0
      for 𝑡 𝑖𝑛 𝑟𝑜𝑤 do                                          To simplify how Naive Bayes actually works I will explain
          𝑠𝑎𝑚𝑝𝑙𝑒𝑉 𝑎𝑙𝑢𝑒+ = 𝑡 * 𝑤𝑒𝑖𝑔ℎ𝑡[𝑖]                        everything based on Iris database.
          𝑖+ = 1                                                 At the beginning, the Naive Bayes algorithm takes
      if 𝑠𝑎𝑚𝑝𝑙𝑒𝑉 𝑎𝑙𝑢𝑒 ∈ 𝑛𝑒𝑤𝑅𝑎𝑛𝑔𝑒𝑠[0] then                      two parameters. First is a test set and the second is a
          𝐶𝑙𝑎𝑠𝑠𝑖𝑓 𝑖𝑒𝑑 𝑎𝑠 𝑆𝑒𝑡𝑜𝑠𝑎                                validation set.
      else if 𝑠𝑎𝑚𝑝𝑙𝑒𝑉 𝑎𝑙𝑢𝑒 ∈ 𝑛𝑒𝑤𝑅𝑎𝑛𝑔𝑒𝑠[1] then                   Afterwards, it splits the test set to three reduced
          𝐶𝑙𝑎𝑠𝑠𝑖𝑓 𝑖𝑒𝑑 𝑎𝑠 𝑉 𝑒𝑟𝑠𝑖𝑐𝑜𝑙𝑜𝑟                           databases according to their varieties.
      else if 𝑠𝑎𝑚𝑝𝑙𝑒𝑉 𝑎𝑙𝑢𝑒 ∈ 𝑛𝑒𝑤𝑅𝑎𝑛𝑔𝑒𝑠[2] then                   Subsequently, it calculates a initial probability by
          𝐶𝑙𝑎𝑠𝑠𝑖𝑓 𝑖𝑒𝑑 𝑎𝑠 𝑉 𝑖𝑟𝑔𝑖𝑛𝑖𝑐𝑎                            counting the number of elements in reduced database
  Returns classified type.                                     divided by the number of all elements in the main
                                                               database. Next, for a given sample it calculates a
                                                               partial probability for each attributes in each reduced
                                                               database.
To do so, it takes a list of elements in each attribute and         2. Sensitivity
then it calculates a mean and a standard deviation.
   After that, it calls a distribution function which passes                                    𝑇𝑃
                                                                                                                           (9)
the given sample’s current calculated attribute, the stan-                                    𝑇𝑃 + 𝐹𝑁
dard deviation and the mean.
   Next, it multiplies the initial probability with four par-
tial probabilities.                                                 3. Precision
   At the end, it returns the variety of the highest proba-
                                                                                                𝑇𝑃
bility.                                                                                                                   (10)
                                                                                              𝑇𝑃 + 𝐹𝑃
Algorithm 4 Naive Bayes algorithm
Input: Test set, sample                                             4. F1 Score
Output: The class to which the sample may belong
                                                                                              2𝑇 𝑃
                                                                                                                          (11)
Make three reduced databases according to theirs                                        2𝑇 𝑃 + 𝐹 𝑃 + 𝐹 𝑁
varieties;
Make a list of attributes names in reduced databases;
Make a empty dictionary;                                            5. Specificity
  𝑖←0
  while 𝑖 < 𝑙𝑒𝑛(𝑟𝑒𝑑𝑢𝑐𝑒𝑑𝐷𝑎𝑡𝑎𝑏𝑎𝑠𝑒𝑠) do                                                            𝑇𝑁
                                                                                                                          (12)
       𝑖𝑛𝑖𝑡𝑃 𝑟𝑜𝑏 ← 𝑙𝑒𝑛(𝑟𝑒𝑑𝑢𝑐𝑒𝑑𝐷𝑎𝑡𝑎𝑏𝑎𝑠𝑒𝑠[𝑖])                                                   𝑇𝑁 + 𝐹𝑃
                           𝑙𝑒𝑛(𝑑𝑎𝑡𝑎𝐵𝑎𝑠𝑒)
       𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 ← 𝑖𝑛𝑖𝑡𝑃 𝑟𝑜𝑏
       𝑗←0
                                                                  For this purpose we will make one confusion matrix
      while 𝑗 < 𝑙𝑒𝑛(𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠) do                              based on one of three abstract class called Setosa. Based
      Get a list of all values in the column[j]                 on that class the all three classifiers will be identify how
      Calculate mean and standard deviation from the list       well they classified given samples.
          𝑝𝑎𝑟𝑡𝑖𝑎𝑙𝑃 𝑟𝑜𝑏 ← DistributionFunction(
        sample[j],                                                                   Setosa     Versicolor    Virginica
        standardDeviation,                                            Setosa          TP             FP             FP
        mean;)                                                       Versicolor       FN             TN             FN
       𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 ← 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 * 𝑝𝑎𝑟𝑡𝑖𝑎𝑙𝑃 𝑟𝑜𝑏;                     Virginica        FN             FN             TN
    Add to dictionary a new record
Return from dictionary variety with the highest probability
                                                                4.1. kNN
                                                             After testing, we noticed that the results for any k are
                                                             very similar to each other. This may be due to the fact
4. Experiments                                               that the Setos class is significantly distant from the other
                                                             two classes, which means that there is a very high prob-
In order to properly analyze individual classifiers, we ability that the objects closest to the sample will also be
will perform a series of tests that will allow us to select Stetosa. Below are the results for k equal to 1 2 3 and
the best classifier using the confusion matrix. Confu- 4, respectively.
sion matrix is used in assessing the quality of a binary
classification. It describes how well the classifier classi- Table 1
fied given samples. It also gives us information about Table for k = 1
several things about the classifier such as:
                                                                              k       TP      FP      TN     FN
    1. Accuracy                                                               1       16       0      27      2
                         𝑇𝑃 + 𝑇𝑁                                            AC       SEN      PRE      F1    SPE
                                                         (8)                0.96     0.89     1.00    0.94   1.00
                    𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Table 2                                                                      Table 7
Table for k = 2                                                              Results for Virginica class
                  k      TP          FP        TN       FN
                                                                              Accuracy      Sensitivity        Precision        F1     Specificity
                  2      19           0        22        4
                                                                                 0.96          1.00              0.87           0.93      0.94
             AC         SEN          PRE        F1      SPE
             0.91       0.83         1.00      0.90     1.00                                     TP       FP     TN        FN
                                                                                                 13        2      30       3
Table 3
Table for k = 3
                  k      TP          FP        TN       FN                   For the Triangular distribution function results are:
                  3      14           0        29        2
             AC         SEN          PRE        F1      SPE                  Table 8
             0.96       0.88         1.00      0.93     1.00                 Results for Setosa class
                                                                              Accuracy      Sensitivity        Precision        F1     Specificity
Table 4                                                                          0.91          0.78              1.00           0.88      1.00
Table for k = 4
                                                                                                 TP       FP     TN        FN
                  k      TP          FP        TN       FN
                                                                                                 14        0      27       4
                  4      18           0        26        1
             AC         SEN          PRE        F1      SPE
             0.98       0.95         1.00      0.97     1.00
                                                                             Table 9
                                                                             Results for Versicolor class
4.2. Naive Bayes                                                              Accuracy      Sensitivity        Precision        F1     Specificity
We considered in our article two distribution formulas.                          0.91          0.83              0.94           0.88      0.96
In this section we decide which of these two are the best
                                                                                                 TP       FP     TN        FN
for our database.
                                                                                                 15        1      26       3
For the Normal distribution function results are:

                                                                             Table 10
Table 5                                                                      Results for Virginica class
Results for Setosa class
                                                                              Accuracy      Sensitivity        Precision        F1     Specificity
 Accuracy         Sensitivity        Precision        F1       Specificity
                                                                                 0.91          0.92              0.80           0.86      0.91
    0.96              0.88             1.00           0.93        1.00
                                                                                                 TP       FP     TN        FN
                       TP       FP      TN       FN
                                                                                                 12        3      29       1
                       14       0         29      2

                                                                                After analyzing above tables we decided that Normal
                                                                             distribution is the best distribution function and it will
Table 6                                                                      be considered in the final test.
Results for Versicolor class                                                    At the end we performed 100 test and we obtained the
 Accuracy         Sensitivity        Precision        F1       Specificity   following results:
    0.96              0.89             1.00           0.94        1.00
                                                                             Table 11
                       TP       FP      TN       FN                          Results for Setosa class
                       16       0         27      2                           Accuracy      Sensitivity        Precision        F1     Specificity
                                                                                 0.95          0.87              1.00           0.93      1.00
Table 12                                                               Table 15
Results for Versicolor class                                           Statistical results for second approach
 Accuracy      Sensitivity        Precision       F1     Specificity    Accuracy       F1          Sensitivity      Precision      Specificity
    0.95             0.93             0.94        0.93      0.97           0.96       0.93            0.92             0.95           0.97


Table 13
Results for Virginica class                                               For the most efficient implementation, the following
                                                                       results were obtained for individual types of iris flowers.
 Accuracy      Sensitivity        Precision       F1     Specificity   After performing 100 tests, the following results were
    0.95             0.93             0.92        0.93      0.96       obtained.

                                                                       Table 16
                                                                       Results for Setosa class
4.3. Soft sets
The determined values of Pearson correlation coefficients               Accuracy      Sensitivity          Precision        F1     Specificity
for each characteristic of iris flowers allowed choosing                   0.98             0.95             1.00           0.97      1.00
the weight of the features for the optimal classifier ac-
                                                                                             TP       FP     TN        FN
curacy. An association between individual features was
considered and their influence on the classifier efficiency.                                 18        0      26       1
Based on these factors different weights were applied to
select the most suitable solution.
                                                                       Table 17
                                                                       Results for Versicolor class
                                                                        Accuracy      Sensitivity          Precision        F1     Specificity
                                                                           0.98             0.93             1.00           0.97      1.00

                                                                                             TP       FP     TN        FN
                                                                                             14        0      30       1
Figure 6: Pearson correlation coefficients for iris characteris-
tics

                                                                       Table 18
   Considering obtained correlation values, it was                     Results for Virginica class
concluded that the most important characteristics
are the following in descending order: petal-length,                    Accuracy      Sensitivity          Precision        F1     Specificity
petal-width, sepal-length, and sepal-width. According                      0.98             1.00             0.92           0.96      0.97
to these observations, successively assigning different
weight values, the best results were observed with                                           TP       FP     TN        FN
weight 𝑤 = [0.1, 0, 0.5, 0.4].                                                               12        1      32       0

The analysis of the results for both the first and
the second algorithm showed that the first algorithm is a
more effective soft set implementation. After performing
100 tests, the following results were obtained.

Table 14
Statistical results for first approach

 Accuracy       F1          Sensitivity      Precision   Specificity
    0.96       0.94            0.92            0.96         0.98
5. Conclusion                                                           Table 23
                                                                        Results for Versicolor class
In order to establish the most efficient classifier, the pre-
                                                                         Accuracy      Sensitivity        Precision        F1     Specificity
pared implementations were compared on the same par-
tition of the Iris database. In comparison, the following                   0.96          0.89              1.00           0.94      1.00
results were statistically calculated.
                                                                                            TP       FP     TN        FN
   Results for kNN:
                                                                                            17        0      26       2
Table 19
Results for Setosa class
                                                                        Table 24
 Accuracy      Sensitivity        Precision        F1     Specificity   Results for Virginica class
    0.96          0.87              1.00           0.93      1.00        Accuracy      Sensitivity        Precision        F1     Specificity
                    TP       FP     TN        FN                            0.96          1.00              0.87           0.93      0.94
                    13        0      30       2                                             TP       FP     TN        FN
                                                                                            13        2      30       0

Table 20
Results for Versicolor class                                              Results for Soft sets:
 Accuracy      Sensitivity        Precision        F1     Specificity
                                                                        Table 25
    0.96          0.89              1.00           0.94      1.00
                                                                        Results for Setosa class
                    TP       FP     TN        FN                         Accuracy      Sensitivity        Precision        F1     Specificity
                    17        0      26       2                             0.98          0.93              1.00           0.96      1.00

                                                                                            TP       FP     TN        FN
                                                                                            13        0      31       1
Table 21
Results for Virginica class

 Accuracy      Sensitivity        Precision        F1     Specificity
                                                                        Table 26
    0.96          1.00              0.87           0.93      0.94       Results for Versicolor class
                    TP       FP     TN        FN                         Accuracy      Sensitivity        Precision        F1     Specificity
                    13        2      30       0                             0.98          0.93              1.00           0.96      1.00

                                                                                            TP       FP     TN        FN
  Results for Naive Bayes:
                                                                                            17        0      27       1

Table 22
Results for Setosa class
                                                                        Table 27
 Accuracy      Sensitivity        Precision        F1     Specificity   Results for Virginica class
    0.96          0.87              1.00           0.93      1.00
                                                                         Accuracy      Sensitivity        Precision        F1     Specificity
                    TP       FP     TN        FN
                                                                            0.98          1.00              0.93           0.97      0.97
                    13        0      30       2
                                                                                            TP       FP     TN        FN
                                                                                            14        1      30       0
   Through analysis of attained results for all classes of    References
all classifiers, it can be noted that the level of accuracy
is the highest for the soft set classifier. The values of     [1] Akram, M., Ali, G., Butt, M. A., Alcantud, J. C. R.
other statistically obtained characteristics also reach the      (2021). Novel MCGDM analysis under m-polar fuzzy
highest levels for the previously mentioned classifier. It       soft expert sets. Neural Computing and Applications,
is worth mentioning that the obtained results are similar        33(18), 12051-12071.
for particular characteristics of the kNN and Naive Bayes     [2] Chen, W., Zhou, Y., Zhou, E., Xiang, Z., Zhou, W., Lu,
classifiers.                                                     J. (2021). Wildfire risk assessment of transmission-line
   Based on the obtained results, it can be concluded            corridors based on Naïve Bayes network and remote
that the soft set classifier implementation classifies most      sensing data. Sensors, 21(2), 634.
effectively. All of the implemented classifiers have been     [3] Dong, W., Wozniak, M., Wu, J., Li, W., Bai, Z. (2022).
properly implemented. The results of the best classifier         De-Noising Aggregation of Graph Neural Networks
differ only by a few percentage points from each other.          by Using Principal Component Analysis. IEEE Trans-
   The work and effort that was applied to completing            actions on Industrial Informatics.
this article are practical and applicable. This research      [4] Dong, W., Wu, J., Bai, Z., Hu, Y., Li, W., Qiao, W.,
offered an opportunity to learn and expand knowledge             Woźniak, M. (2021). MobileGCN applied to low-
about the different approaches to assessing and teaching         dimensional node feature learning. Pattern Recogni-
chosen classifiers as well as through the process of iden-       tion, 112, 107788.
tifying the best solution. The analysis allowed acquiring     [5] Rani, P., Verma, S., Kaur, N., Wozniak, M., Shafi, J.,
practical experience in implementing machine learning            Ijaz, M. F. (2021). Robust and secure data transmis-
algorithms.                                                      sion using artificial intelligence techniques in ad-hoc
   In the future, the project could be extended and fol-         networks. Sensors, 22(1), 251.
lowed with further analysis of other classifiers, for in-     [6] Ruan, S., Chen, B., Song, K., Li, H. (2022). Weighted
stance through rebuilding the current classifiers in a more      Naïve Bayes text classification algorithm based on im-
advanced way and selecting even more efficient solutions.        proved distance correlation coefficient. Neural Com-
                                                                 puting and Applications, 34(4), 2729-2738.
Table 28                                                      [7] Shokrzade, A., Ramezani, M., Tab, F. A., Mohammad,
Results                                                          M. A. (2021). A novel extreme learning machine based
                                                                 kNN classification method for dealing with big data.
   Classifier          Setosa    Versicolor   Virginica
                                                                 Expert Systems with Applications, 183, 115293.
                ACC     0.96        0.96        0.96
                SEN     0.87        0.89        1.00          [8] Siłka, J., Wieczorek, M., Wozniak, M. (2022). Re-
                PRE     1.00        1.00        0.87             current neural network model for high-speed train
                 F1     0.93        0.94        0.93             vibration prediction from time series. Neural Comput-
   kNN          SPE     1.00        1.00        0.94             ing and Applications, 1-14.
                 TP      13          17          13           [9] https://c3.ai/glossary/data-science/classifier/
                 FP       0          0            2           [10] https://www.sas.com/en_th/insights/articles/big-
                 TN      30          26          30              data/artificial-intelligence-machine-learning-deep-
                 FN       2           2           0              learning-and-beyond.html
                ACC     0.96        0.96        0.96          [11] https://www.sciencedirect.com/topics/computer-
                SEN     0.87        0.89        1.00             science/machine-learning
                PRE     1.00        1.00        0.87
                                                              [12] https://www.sciencedirect.com/science/article/
                 F1     0.93        0.94        0.93
   Naive                                                         pii/S0898122199000565
                SPE     1.00        1.00        0.94
   Bayes                                                      [13] https://towardsdatascience.com/machine-
                 TP      13          17          13
                 FP       0          0            2              learning-classifiers-a5cc4e1b0623
                 TN      30          26          30           [14] https://monkeylearn.com/blog/what-is-a-
                 FN       2           2           0              classifier/
                ACC     0.98        0.98        0.98
                SEN     0.93        0.93        1.00
                PRE     1.00        1.00        0.93
                 F1     0.96        0.96        0.97
   Soft sets    SPE     1.00        1.00        0.97
                 TP      13          17          14
                 FP       0          0            1
                 TN      31          27          30
                 FN       1           1           0