=Paper= {{Paper |id=Vol-2472/p3 |storemode=property |title=Decision tree approach for IRI |pdfUrl=https://ceur-ws.org/Vol-2472/p3.pdf |volume=Vol-2472 |authors=Marcelina Lachowicz }} ==Decision tree approach for IRI== https://ceur-ws.org/Vol-2472/p3.pdf
                Decision tree approach for IRIS database
                              classification
                                                              Marcelina Lachowicz
                                                           Institute of Mathematics
                                                      Silesian University of Technology
                                                    Kaszubska 23, 44-100 Gliwice, Poland
                                                         Email: marcelina.lac@o2.pl


   Abstract—Data classification is one of important topics in                                      II. RONALD AYLMER F ISHER
information technology. There are many methods used in this
process. This article presents classification of iris flowers by the                  Ronald Aylmer Fisher was born on February 17, 1890 in
use of decision tree. In the system was implemented a procedure                    London, and died on July 29, 1962 in Adelaide. He was
to used an open data set for classification of applied types of                    a British geneticist and statistician. The Briton graduated
flowers by numerical features describing them. Results show that                   from Gonville and Caius College at the University of
proposed model is very simple but also efficient in numerical
                                                                                   Cambridge and worked as a professor of eugenics at the
classification.
                                                                                   London School of Economics in 1933-1943 and professor
                         I. I NTRODUCTION                                          of genetics at the University of Cambridge (1943-1957).
                                                                                   Anders Hald described him as ”a genius who almost created
   In information processing very important role is for data                       the foundations of contemporary statistics”, and Richard
analysis and decision processes. From information hidden in                        Dawkins as ”the greatest heir of Darwin” and a member of
data we can get knowledge about various things. In general                         the Royal Society in London (Royal Society). The geneticist
we can use various methods to process the data. Artificial                         created, among others, maximum likelihood, analysis of
intelligence gives us many interesting approaches to data                          variance (ANOVA) and linear discriminant analysis. He also
science.                                                                           dealt with methods of hypothesis verification using statistical
   In general input data is organized in smaller groups called                     methods (in anthropology, genetics, ecology) and was one
classes in which final classification is done. By the use of                       of the creators of modern mathematical statistics. He is also
this kind of decision making processes we can estimate many                        known for the development of experimental results at the
things. In [1] was implemented a method to estimate energetic                      Rothamsted Institute of Agricultural Research (1919 - 1933)
efficiency. In [2], [3] was done a prediction on wind farming,                     and as the author of the Statistical Methods for Research
while in [4], [5] cancer classification from medical images was                    Workers (1925), Statistical Methods and Scientific Inference
done. Among methods of data science very often decision                            (1956).
trees are used. Some of first approaches to use decision
trees as classifiers were presented in [6], where toxic hazards
estimation was done using decision trees. In [7] decision trees                                     III. P ROPOSED CLASSIFIER
were used to help on remote sensing to classify land images.                          The principle of proposed decision model is based on
Recently there are many optimized decision tree structures                         decision tree.
developed for specific examples of input data.                                        1) Decision Tree: A decision tree is a (graphical) method of
   In [8] was presented how to join decision tree with bee                         supporting the decision-making process. It is used in decision
algorithm on the way for faster data classification. In [9]                        theory and in machine learning. In the second application, it
decision trees were joined with Bayesian methods to efficiently                    helps in acquiring knowledge based on examples.
classify smoking cessation. There are many approaches where                        Algorithm - it works recursively for each node. We have to
decision trees give very good results. An interesting survey                       decide whether the node will be:
over various decision tree approaches and their implementa-
                                                                                      1) leaf - we end this recursive call,
tions was presented in [10].
                                                                                      2) a branch node according to the values that the given
   In this article i show how to implement a simple python                               attribute takes and for each child node we create a
procedure based on decision tree classifier.                                             recursive call to the algorithm with the list of attributes
   Implemented idea was used to recognize iris flowers from                              reduced by the attribute just selected.
open data set. Results show that proposed implementation
                                                                                   Building a tree - The tree consists of nodes: decisions and
works well returning very good results.
                                                                                   states of nature and branches. Rectangles are decisions, and
  ©2019 for this paper by its authors. Use permitted under Creative Commons        states of nature are circles. We start with the root. At the
License Attribution 4.0 International (CC BY 4.0).                                 very beginning we have the first given sepal length (figure



                                                                              10
                                                                        print(res2)
                                                                        pr = tree.DecisionTreeClassifier()
                                                                        pr = pr.fit(list2, res2)\\

                                                                           The code was written in Python. As you can see, we have
                                                                        introduced the data of IRIS databases to the program. Then
                                                                        we entered the function to learn our network through a library
                                                                        that Python has:

                                                                        sklearn.tree. DecisionTreeClassifier



         Fig. 1. First step in proposed decision tree reasoning.
                                                                                           IV. T HE IRIS DATABASE


                                                                           The IRIS database contains a set of iris flower measurement
                                                                        and was first made available by Ronald Fisher in 1936. This
                                                                        is one of the most well-known collections, in addition, as
                                                                        we’ll see in a moment is also very simple. The set of irises
                                                                        consists of 4 measurements of flower petals and a leaf: width
                                                                        and length. There are three types of flowers:


                                                                          •   Versicolor - This flower is found in North America and
                                                                              develops up to a height of 80 centimeters. The leaves of
                                                                              this plant have a width of more than one centimeter, and
                                                                              the roots form large and thick clumps. A well-developed
                                                                              plant has 6 blue petals and blooms from May to July,
                                                                              while large seeds can be observed appearing in autumn.
        Fig. 2. Second step in proposed decision tree reasoning.




1), then we analyze the second variable - sepal width (figure
2). In this way, we continue the construction of the whole tree.


A. Coding
from sklearn import tree
import numpy as np
iris = open("iris.txt","r")
list = []
for line in iris:
data = line.split(",")
list.append(data)
list2 = [[float(column) for column in row]
for row in list]
print(list2)
iris_res = open("iris-result.txt","r")
res = []                                                                  •   Setosa - the flower is found in Canada, Russia, north- east
for line in iris_res:                                                         Asia, China, Korea, southern Japan and Alaska. The plant
data = line.split(",")                                                        has half-green leaves, high branched stems and purple-
res.append(data)                                                              blue flowers similar to lavender (there are also pink and
res2 = [[int(column) for column in row]                                       white flowers). The roots are shallow, large and rapidly
for row in res]                                                               spreading.



                                                                   11
                                                                         table of errors arises from the intersection of the predicted
                                                                         class and the class actually observed. Our matrix is presented
                                                                         in figure 3. Now i want to analyze measure of results to show
                                                                         how proposed classification works.
                                                                            1) Analysis of the Confusion Matrix: Terminology and
                                                                         derivations from a confusion matrix:
                                                                           1) For Versicolor
                                                                                • sensitivity, recall, hit rate or true positive rate
                                                                                  (TPR)
                                                                                  T P R = TPP = T PT+F    P          10
                                                                                                            N = 10+0 = 10 = 1
                                                                                                                             10

                                                                                  TPR = 1 − FNR
                                                                                • specificity, selectivity or true negative rate (TNR)
                                                                                  T N R = TNN = T NT+F    N           20
                                                                                                             P = 20+0 = 20 = 1
                                                                                                                              20

                                                                                  TNR = 1 − FPR
                                                                                • precision or positive predictive value (PPV)
                                                                                  P P V = T PT+F P          10
                                                                                                    P = 10+0 = 10 = 1
                                                                                                                     10

                                                                                  P P V = 1 − F DR
                                                                                • negative predictive value (NPV)

  •   Virginica - this flower is native to North America. The                     N P V = T NT+F N           20
                                                                                                     N = 20+0 = 20 = 1
                                                                                                                       20

      leaves are 1 to 3 centimeters long and sometimes longer                     N P V = 1 − F OR
      than the flower stalk. The plant has 2 to 4 erect or                      • textitmiss rate or false negative rate (FNR)

      arching, bright green. The roots are spread underground.                    F N R = FPN = F NF+T     N           0
                                                                                                             P = 0+10 = 10 = 0
                                                                                                                               0

      The seeds are light brown and differently shaped, and                       FNR = 1 − TPR
      are born in three-part fruit capsules. The petals vary in                 • fall-out or false positive rate (FPR)

      color from dark purple to pinkish- white. These plants                      F P R = FNP = F PF+T    P           0
                                                                                                             N = 0+20 = 20 = 0
                                                                                                                              0

      bloom from April to May and have from 2 to 6 flowers.                       FPR = 1 − TNR
                                                                                • false discovery rate (FDR)
                                                                                  F DR = F PF+T  P           0
                                                                                                    P = 0+10 = 10 = 0
                                                                                                                      0

                                                                                  F DR = 1 − P P V
                                                                                • false omission rate (FOR)
                                                                                  F OR = F NF+T  N           0
                                                                                                    N = 0+20 = 20 = 0
                                                                                                                       0

                                                                                  F OR = 1 − N P V
                                                                                • accuracy (ACC)
                                                                                  ACC = T P         P +T N
                                                                                                      +N       = T P +TTN    P +T N
                                                                                                                               +F P +F N =
                                                                                     10+20         30
                                                                                  10+20+0+0 = 30 = 1
                                                                                • F1 score - harmonic mean of precision and sensi-
                                                                                  tivity
                                                                                                   ·T P R
                                                                                  F1 = 2 · PPPPVV+T                  2T P
                                                                                                      P R = 2T P +F P +F N = 20+0+0 =
                                                                                                                                      20
                                                                                  20
                                                                                  20 = 1
                                                                                • Matthews correlation coefficient (MCC)
                                                                                                             T P ·T N −F P ·F N
                                                                                  M CC = √                                                 =
                                                                                                 (T P +F P )·(T P +F N )·(T N +F P )·(T N +F N )
                                                                                        300        300     3
                                                                                    10·10·20·20 =  200 = 2
                      V. O UR EXAMPLE
                                                                                • informedness or Bookmaker Informedness (BM)
   Our database has 150 data (50 for each type of flower),                        BM = T P R + T N R − 1 = 1 + 1 − 1 = 1
of which 120 we used to learn the artificial neural network,                    • Markedness (MK)
and 30 (10 for each species) to test the artificial neural                        MK = PPV + NPV − 1 = 1 + 1 − 1 = 1
network. The question then arises: Do the data groups we
                                                                           2) For Setosa
have received correspond to the three species of iris? To see
this, let’s look at the error matrix.                                           • sensitivity, recall, hit rate or true positive rate
                                                                                  (TPR)
                                                                                  T P R = TPP = T PT+F   P        10     10
                                                                                                           N = 10+0 = 10 = 1
A. Confusion Matrix                                                               TPR = 1 − FNR
   Confusion Matrix is the basic tool used to assess the quality                • specificity, selectivity or true negative rate (TNR)
of the classification. In our table, we consider three classes of                 T N R = TNN = T NT+F   N         20     20
                                                                                                            P = 20+0 = 20 = 1
abstraction as a result of which we get a 3 x 3 matrix. The                       TNR = 1 − FPR



                                                                    12
      •precision or positive predictive value (PPV)
       P P V = T PT+F P           10
                         P = 10+0 = 10 = 1
                                            10

       P P V = 1 − F DR
    • negative predictive value (NPV)
       N P V = T NT+F N            20
                          N = 20+0 = 20 = 1
                                             20

       N P V = 1 − F OR
    • textitmiss rate or false negative rate (FNR)
       F N R = FPN = F NF+T     N             0
                                   P = 0+10 = 10 = 0
                                                     0

       FNR = 1 − TPR
    • fall-out or false positive rate (FPR)
       F P R = FNP = F PF+T    P             0
                                  N = 0+20 = 20 = 0
                                                    0

       FPR = 1 − TNR
    • false discovery rate (FDR)
       F DR = F PF+T  P            0
                         P = 0+10 = 10 = 0
                                             0

       F DR = 1 − P P V
    • false omission rate (FOR)
       F OR = F NF+T  N            0
                         N = 0+20 = 20 = 0
                                             0
                                                                      Fig. 3. Confusion matrix for classification results by the use of implemented
       F OR = 1 − N P V                                               decision tree method
    • accuracy (ACC)
       ACC = T P         P +T N
                           +N        = T P +TTN    P +T N
                                                     +F P +F N   =
          10+20         30                                                      • false omission rate (FOR)
       10+20+0+0    =   30  =  1
    • F1 score - harmonic mean of precision and sensi-
                                                                                   F OR = F NF+T  N             0
                                                                                                      N = 0+20 = 20 = 0
                                                                                                                         0

       tivity                                                                      F OR = 1 − N P V
                  P P V ·T P R              2T P            20                  • accuracy (ACC)
       F1 = 2 · P P V +T P R = 2T P +F P +F N = 20+0+0 =
       20                                                                          ACC = T P         P +T N
                                                                                                                  = T P +TTN     P +T N
                                                                                                                                                 =
       20 = 1
                                                                                                       +N                          +F P +F N
                                                                                       10+20        30
    • Matthews correlation coefficient (MCC)                                        10+20+0+0   =   30  =   1
                                   T P ·T N −F P ·F N                           • F1 score - harmonic mean of precision and sensi-
       M CC = √                                                     =
                    (T P +F P )·(T P +F N )·(T N +F P )·(T N +F N )                tivity
            300        300      3
       10·10·20·20 = 200 = 2
                                                                                                    ·T P R
                                                                                   F1 = 2 · PPPPVV+T                    2T P
                                                                                                        P R = 2T P +F P +F N = 20+0+0 =
                                                                                                                                           20
    • informedness or Bookmaker Informedness (BM)                                   20
                                                                                    20 = 1
       BM = T P R + T N R − 1 = 1 + 1 − 1 = 1                                   • Matthews correlation coefficient (MCC)
    • Markedness (MK)                                                              M CC = √                     T P ·T N −F P ·F N
                                                                                                                                                    =
       MK = PPV + NPV − 1 = 1 + 1 − 1 = 1                                                        (T P +F P )·(T P +F N )·(T N +F P )·(T N +F N )
                                                                                        300        300       3
3) For Virginica                                                                    10·10·20·20 = 200 = 2
                                                                                • informedness or Bookmaker Informedness (BM)
    • sensitivity, recall, hit rate or true positive rate                          BM = T P R + T N R − 1 = 1 + 1 − 1 = 1
       (TPR)                                                                    • Markedness (MK)
       T P R = TPP = T PT+F    P           10
                                  N = 10+0 = 10 = 1
                                                   10
                                                                                   MK = PPV + NPV − 1 = 1 + 1 − 1 = 1
       TPR = 1 − FNR
                                                                      Legend:
    • specificity, selectivity or true negative rate (TNR)
       T N R = TNN = T NT+F    N            20      20                   • P - condition positive - the number of real positive cases
                                   P = 20+0 = 20 = 1
       TNR = 1 − FPR                                                        in the data,
    • precision or positive predictive value (PPV)                       •  N - condition negative - the number of real negative cases
                    TP            10
       P P V = T P +F P = 10+0 = 10 = 1     10                              in the data,
       P P V = 1 − F DR                                                  •  TP - true positive,
    • negative predictive value (NPV)                                    •  TN - true negative,
                    TN             20
       N P V = T N +F N = 20+0 = 20 = 1      20                          •  FP - false positive,
       N P V = 1 − F OR                                                  •  FN - false negative
    • textitmiss rate or false negative rate (FNR)                    We present our error matrix in the form of a table in figure 3,
       F N R = FPN = F NF+T     N
                                   P  =       0
                                           0+10  =   0
                                                    10 =  0           in which the poems correspond to the species of iris, to which
       FNR = 1 − TPR                                                  the data point belonged, while the columns tell the genre to
    • fall-out or false positive rate (FPR)                           which it was qualified. The elements inside the table specify
       F P R = FNP = F PF+T    P
                                  N   =      0
                                          0+20   =  0
                                                   20  = 0            the number of data points corresponding to the species of iris
       FPR = 1 − TNR                                                  specified in the row header assigned to the data group specified
    • false discovery rate (FDR)                                      in the column header.
       F DR = F PF+T  P            0
                         P = 0+10 = 10 = 0
                                             0
                                                                       Let us note the perfect compatibility between data groups and
       F DR = 1 − P P V                                               species of iris. Each point has been correctly classified.



                                                                      13
                          VI. C ONCLUSIONS
   Proposed reasoning was easy to implement. Results show
decisions were very good and all inputs were classified cor-
rectly. The code of the reasoning in python used a library for
artificial intelligence where the method was coded.
   In future research i want to develop another method for
data classification based on other probabilistic methods where
decision between classes will be related to statistical measures.
                              R EFERENCES
 [1] G. Capizzi, G. L. Sciuto, G. Cammarata, and M. Cammarata, “Ther-
     mal transients simulations of a building by a dynamic model based
     on thermal-electrical analogy: Evaluation and implementation issue,”
     Applied energy, vol. 199, pp. 323–334, 2017.
 [2] S. Brusca, G. Capizzi, G. Lo Sciuto, and G. Susi, “A new design
     methodology to predict wind farm energy production by means of a
     spiking neural network–based system,” International Journal of Numer-
     ical Modelling: Electronic Networks, Devices and Fields, vol. 32, no. 4,
     p. e2267, 2019.
 [3] G. Capizzi, G. L. Sciuto, C. Napoli, and E. Tramontana, “Advanced and
     adaptive dispatch for smart grids by means of predictive models,” IEEE
     Transactions on Smart Grid, vol. 9, no. 6, pp. 6684–6691, 2017.
 [4] M. Woźniak, D. Połap, G. Capizzi, G. L. Sciuto, L. Kośmider, and
     K. Frankiewicz, “Small lung nodules detection based on local variance
     analysis and probabilistic neural network,” Computer methods and
     programs in biomedicine, vol. 161, pp. 173–180, 2018.
 [5] M. Wozniak, D. Polap, L. Kosmider, C. Napoli, and E. Tramontana,
     “A novel approach toward x-ray images classifier,” in 2015 IEEE
     Symposium Series on Computational Intelligence. IEEE, 2015, pp.
     1635–1641.
 [6] G. Cramer, R. Ford, and R. Hall, “Estimation of toxic hazarda decision
     tree approach,” Food and cosmetics toxicology, vol. 16, no. 3, pp. 255–
     276, 1976.
 [7] M. A. Friedl and C. E. Brodley, “Decision tree classification of land
     cover from remotely sensed data,” Remote sensing of environment,
     vol. 61, no. 3, pp. 399–409, 1997.
 [8] H. Rao, X. Shi, A. K. Rodrigue, J. Feng, Y. Xia, M. Elhoseny, X. Yuan,
     and L. Gu, “Feature selection based on artificial bee colony and gradient
     boosting decision tree,” Applied Soft Computing, vol. 74, pp. 634–642,
     2019.
 [9] A. Tahmassebi, A. H. Gandomi, M. H. Schulte, A. E. Goudriaan, S. Y.
     Foo, and A. Meyer-Baese, “Optimized naive-bayes and decision tree
     approaches for fmri smoking cessation classification,” Complexity, vol.
     2018, 2018.
[10] S. R. Safavian and D. Landgrebe, “A survey of decision tree classifier
     methodology,” IEEE transactions on systems, man, and cybernetics,
     vol. 21, no. 3, pp. 660–674, 1991.




                                                                                 14