=Paper=
{{Paper
|id=Vol-2472/p3
|storemode=property
|title=Decision tree approach for IRI
|pdfUrl=https://ceur-ws.org/Vol-2472/p3.pdf
|volume=Vol-2472
|authors=Marcelina Lachowicz
}}
==Decision tree approach for IRI==
Decision tree approach for IRIS database classification Marcelina Lachowicz Institute of Mathematics Silesian University of Technology Kaszubska 23, 44-100 Gliwice, Poland Email: marcelina.lac@o2.pl Abstract—Data classification is one of important topics in II. RONALD AYLMER F ISHER information technology. There are many methods used in this process. This article presents classification of iris flowers by the Ronald Aylmer Fisher was born on February 17, 1890 in use of decision tree. In the system was implemented a procedure London, and died on July 29, 1962 in Adelaide. He was to used an open data set for classification of applied types of a British geneticist and statistician. The Briton graduated flowers by numerical features describing them. Results show that from Gonville and Caius College at the University of proposed model is very simple but also efficient in numerical Cambridge and worked as a professor of eugenics at the classification. London School of Economics in 1933-1943 and professor I. I NTRODUCTION of genetics at the University of Cambridge (1943-1957). Anders Hald described him as ”a genius who almost created In information processing very important role is for data the foundations of contemporary statistics”, and Richard analysis and decision processes. From information hidden in Dawkins as ”the greatest heir of Darwin” and a member of data we can get knowledge about various things. In general the Royal Society in London (Royal Society). The geneticist we can use various methods to process the data. Artificial created, among others, maximum likelihood, analysis of intelligence gives us many interesting approaches to data variance (ANOVA) and linear discriminant analysis. He also science. dealt with methods of hypothesis verification using statistical In general input data is organized in smaller groups called methods (in anthropology, genetics, ecology) and was one classes in which final classification is done. By the use of of the creators of modern mathematical statistics. He is also this kind of decision making processes we can estimate many known for the development of experimental results at the things. In [1] was implemented a method to estimate energetic Rothamsted Institute of Agricultural Research (1919 - 1933) efficiency. In [2], [3] was done a prediction on wind farming, and as the author of the Statistical Methods for Research while in [4], [5] cancer classification from medical images was Workers (1925), Statistical Methods and Scientific Inference done. Among methods of data science very often decision (1956). trees are used. Some of first approaches to use decision trees as classifiers were presented in [6], where toxic hazards estimation was done using decision trees. In [7] decision trees III. P ROPOSED CLASSIFIER were used to help on remote sensing to classify land images. The principle of proposed decision model is based on Recently there are many optimized decision tree structures decision tree. developed for specific examples of input data. 1) Decision Tree: A decision tree is a (graphical) method of In [8] was presented how to join decision tree with bee supporting the decision-making process. It is used in decision algorithm on the way for faster data classification. In [9] theory and in machine learning. In the second application, it decision trees were joined with Bayesian methods to efficiently helps in acquiring knowledge based on examples. classify smoking cessation. There are many approaches where Algorithm - it works recursively for each node. We have to decision trees give very good results. An interesting survey decide whether the node will be: over various decision tree approaches and their implementa- 1) leaf - we end this recursive call, tions was presented in [10]. 2) a branch node according to the values that the given In this article i show how to implement a simple python attribute takes and for each child node we create a procedure based on decision tree classifier. recursive call to the algorithm with the list of attributes Implemented idea was used to recognize iris flowers from reduced by the attribute just selected. open data set. Results show that proposed implementation Building a tree - The tree consists of nodes: decisions and works well returning very good results. states of nature and branches. Rectangles are decisions, and ©2019 for this paper by its authors. Use permitted under Creative Commons states of nature are circles. We start with the root. At the License Attribution 4.0 International (CC BY 4.0). very beginning we have the first given sepal length (figure 10 print(res2) pr = tree.DecisionTreeClassifier() pr = pr.fit(list2, res2)\\ The code was written in Python. As you can see, we have introduced the data of IRIS databases to the program. Then we entered the function to learn our network through a library that Python has: sklearn.tree. DecisionTreeClassifier Fig. 1. First step in proposed decision tree reasoning. IV. T HE IRIS DATABASE The IRIS database contains a set of iris flower measurement and was first made available by Ronald Fisher in 1936. This is one of the most well-known collections, in addition, as we’ll see in a moment is also very simple. The set of irises consists of 4 measurements of flower petals and a leaf: width and length. There are three types of flowers: • Versicolor - This flower is found in North America and develops up to a height of 80 centimeters. The leaves of this plant have a width of more than one centimeter, and the roots form large and thick clumps. A well-developed plant has 6 blue petals and blooms from May to July, while large seeds can be observed appearing in autumn. Fig. 2. Second step in proposed decision tree reasoning. 1), then we analyze the second variable - sepal width (figure 2). In this way, we continue the construction of the whole tree. A. Coding from sklearn import tree import numpy as np iris = open("iris.txt","r") list = [] for line in iris: data = line.split(",") list.append(data) list2 = [[float(column) for column in row] for row in list] print(list2) iris_res = open("iris-result.txt","r") res = [] • Setosa - the flower is found in Canada, Russia, north- east for line in iris_res: Asia, China, Korea, southern Japan and Alaska. The plant data = line.split(",") has half-green leaves, high branched stems and purple- res.append(data) blue flowers similar to lavender (there are also pink and res2 = [[int(column) for column in row] white flowers). The roots are shallow, large and rapidly for row in res] spreading. 11 table of errors arises from the intersection of the predicted class and the class actually observed. Our matrix is presented in figure 3. Now i want to analyze measure of results to show how proposed classification works. 1) Analysis of the Confusion Matrix: Terminology and derivations from a confusion matrix: 1) For Versicolor • sensitivity, recall, hit rate or true positive rate (TPR) T P R = TPP = T PT+F P 10 N = 10+0 = 10 = 1 10 TPR = 1 − FNR • specificity, selectivity or true negative rate (TNR) T N R = TNN = T NT+F N 20 P = 20+0 = 20 = 1 20 TNR = 1 − FPR • precision or positive predictive value (PPV) P P V = T PT+F P 10 P = 10+0 = 10 = 1 10 P P V = 1 − F DR • negative predictive value (NPV) • Virginica - this flower is native to North America. The N P V = T NT+F N 20 N = 20+0 = 20 = 1 20 leaves are 1 to 3 centimeters long and sometimes longer N P V = 1 − F OR than the flower stalk. The plant has 2 to 4 erect or • textitmiss rate or false negative rate (FNR) arching, bright green. The roots are spread underground. F N R = FPN = F NF+T N 0 P = 0+10 = 10 = 0 0 The seeds are light brown and differently shaped, and FNR = 1 − TPR are born in three-part fruit capsules. The petals vary in • fall-out or false positive rate (FPR) color from dark purple to pinkish- white. These plants F P R = FNP = F PF+T P 0 N = 0+20 = 20 = 0 0 bloom from April to May and have from 2 to 6 flowers. FPR = 1 − TNR • false discovery rate (FDR) F DR = F PF+T P 0 P = 0+10 = 10 = 0 0 F DR = 1 − P P V • false omission rate (FOR) F OR = F NF+T N 0 N = 0+20 = 20 = 0 0 F OR = 1 − N P V • accuracy (ACC) ACC = T P P +T N +N = T P +TTN P +T N +F P +F N = 10+20 30 10+20+0+0 = 30 = 1 • F1 score - harmonic mean of precision and sensi- tivity ·T P R F1 = 2 · PPPPVV+T 2T P P R = 2T P +F P +F N = 20+0+0 = 20 20 20 = 1 • Matthews correlation coefficient (MCC) T P ·T N −F P ·F N M CC = √ = (T P +F P )·(T P +F N )·(T N +F P )·(T N +F N ) 300 300 3 10·10·20·20 = 200 = 2 V. O UR EXAMPLE • informedness or Bookmaker Informedness (BM) Our database has 150 data (50 for each type of flower), BM = T P R + T N R − 1 = 1 + 1 − 1 = 1 of which 120 we used to learn the artificial neural network, • Markedness (MK) and 30 (10 for each species) to test the artificial neural MK = PPV + NPV − 1 = 1 + 1 − 1 = 1 network. The question then arises: Do the data groups we 2) For Setosa have received correspond to the three species of iris? To see this, let’s look at the error matrix. • sensitivity, recall, hit rate or true positive rate (TPR) T P R = TPP = T PT+F P 10 10 N = 10+0 = 10 = 1 A. Confusion Matrix TPR = 1 − FNR Confusion Matrix is the basic tool used to assess the quality • specificity, selectivity or true negative rate (TNR) of the classification. In our table, we consider three classes of T N R = TNN = T NT+F N 20 20 P = 20+0 = 20 = 1 abstraction as a result of which we get a 3 x 3 matrix. The TNR = 1 − FPR 12 •precision or positive predictive value (PPV) P P V = T PT+F P 10 P = 10+0 = 10 = 1 10 P P V = 1 − F DR • negative predictive value (NPV) N P V = T NT+F N 20 N = 20+0 = 20 = 1 20 N P V = 1 − F OR • textitmiss rate or false negative rate (FNR) F N R = FPN = F NF+T N 0 P = 0+10 = 10 = 0 0 FNR = 1 − TPR • fall-out or false positive rate (FPR) F P R = FNP = F PF+T P 0 N = 0+20 = 20 = 0 0 FPR = 1 − TNR • false discovery rate (FDR) F DR = F PF+T P 0 P = 0+10 = 10 = 0 0 F DR = 1 − P P V • false omission rate (FOR) F OR = F NF+T N 0 N = 0+20 = 20 = 0 0 Fig. 3. Confusion matrix for classification results by the use of implemented F OR = 1 − N P V decision tree method • accuracy (ACC) ACC = T P P +T N +N = T P +TTN P +T N +F P +F N = 10+20 30 • false omission rate (FOR) 10+20+0+0 = 30 = 1 • F1 score - harmonic mean of precision and sensi- F OR = F NF+T N 0 N = 0+20 = 20 = 0 0 tivity F OR = 1 − N P V P P V ·T P R 2T P 20 • accuracy (ACC) F1 = 2 · P P V +T P R = 2T P +F P +F N = 20+0+0 = 20 ACC = T P P +T N = T P +TTN P +T N = 20 = 1 +N +F P +F N 10+20 30 • Matthews correlation coefficient (MCC) 10+20+0+0 = 30 = 1 T P ·T N −F P ·F N • F1 score - harmonic mean of precision and sensi- M CC = √ = (T P +F P )·(T P +F N )·(T N +F P )·(T N +F N ) tivity 300 300 3 10·10·20·20 = 200 = 2 ·T P R F1 = 2 · PPPPVV+T 2T P P R = 2T P +F P +F N = 20+0+0 = 20 • informedness or Bookmaker Informedness (BM) 20 20 = 1 BM = T P R + T N R − 1 = 1 + 1 − 1 = 1 • Matthews correlation coefficient (MCC) • Markedness (MK) M CC = √ T P ·T N −F P ·F N = MK = PPV + NPV − 1 = 1 + 1 − 1 = 1 (T P +F P )·(T P +F N )·(T N +F P )·(T N +F N ) 300 300 3 3) For Virginica 10·10·20·20 = 200 = 2 • informedness or Bookmaker Informedness (BM) • sensitivity, recall, hit rate or true positive rate BM = T P R + T N R − 1 = 1 + 1 − 1 = 1 (TPR) • Markedness (MK) T P R = TPP = T PT+F P 10 N = 10+0 = 10 = 1 10 MK = PPV + NPV − 1 = 1 + 1 − 1 = 1 TPR = 1 − FNR Legend: • specificity, selectivity or true negative rate (TNR) T N R = TNN = T NT+F N 20 20 • P - condition positive - the number of real positive cases P = 20+0 = 20 = 1 TNR = 1 − FPR in the data, • precision or positive predictive value (PPV) • N - condition negative - the number of real negative cases TP 10 P P V = T P +F P = 10+0 = 10 = 1 10 in the data, P P V = 1 − F DR • TP - true positive, • negative predictive value (NPV) • TN - true negative, TN 20 N P V = T N +F N = 20+0 = 20 = 1 20 • FP - false positive, N P V = 1 − F OR • FN - false negative • textitmiss rate or false negative rate (FNR) We present our error matrix in the form of a table in figure 3, F N R = FPN = F NF+T N P = 0 0+10 = 0 10 = 0 in which the poems correspond to the species of iris, to which FNR = 1 − TPR the data point belonged, while the columns tell the genre to • fall-out or false positive rate (FPR) which it was qualified. The elements inside the table specify F P R = FNP = F PF+T P N = 0 0+20 = 0 20 = 0 the number of data points corresponding to the species of iris FPR = 1 − TNR specified in the row header assigned to the data group specified • false discovery rate (FDR) in the column header. F DR = F PF+T P 0 P = 0+10 = 10 = 0 0 Let us note the perfect compatibility between data groups and F DR = 1 − P P V species of iris. Each point has been correctly classified. 13 VI. C ONCLUSIONS Proposed reasoning was easy to implement. Results show decisions were very good and all inputs were classified cor- rectly. The code of the reasoning in python used a library for artificial intelligence where the method was coded. In future research i want to develop another method for data classification based on other probabilistic methods where decision between classes will be related to statistical measures. R EFERENCES [1] G. Capizzi, G. L. Sciuto, G. Cammarata, and M. Cammarata, “Ther- mal transients simulations of a building by a dynamic model based on thermal-electrical analogy: Evaluation and implementation issue,” Applied energy, vol. 199, pp. 323–334, 2017. [2] S. Brusca, G. Capizzi, G. Lo Sciuto, and G. Susi, “A new design methodology to predict wind farm energy production by means of a spiking neural network–based system,” International Journal of Numer- ical Modelling: Electronic Networks, Devices and Fields, vol. 32, no. 4, p. e2267, 2019. [3] G. Capizzi, G. L. Sciuto, C. Napoli, and E. Tramontana, “Advanced and adaptive dispatch for smart grids by means of predictive models,” IEEE Transactions on Smart Grid, vol. 9, no. 6, pp. 6684–6691, 2017. [4] M. Woźniak, D. Połap, G. Capizzi, G. L. Sciuto, L. Kośmider, and K. Frankiewicz, “Small lung nodules detection based on local variance analysis and probabilistic neural network,” Computer methods and programs in biomedicine, vol. 161, pp. 173–180, 2018. [5] M. Wozniak, D. Polap, L. Kosmider, C. Napoli, and E. Tramontana, “A novel approach toward x-ray images classifier,” in 2015 IEEE Symposium Series on Computational Intelligence. IEEE, 2015, pp. 1635–1641. [6] G. Cramer, R. Ford, and R. Hall, “Estimation of toxic hazarda decision tree approach,” Food and cosmetics toxicology, vol. 16, no. 3, pp. 255– 276, 1976. [7] M. A. Friedl and C. E. Brodley, “Decision tree classification of land cover from remotely sensed data,” Remote sensing of environment, vol. 61, no. 3, pp. 399–409, 1997. [8] H. Rao, X. Shi, A. K. Rodrigue, J. Feng, Y. Xia, M. Elhoseny, X. Yuan, and L. Gu, “Feature selection based on artificial bee colony and gradient boosting decision tree,” Applied Soft Computing, vol. 74, pp. 634–642, 2019. [9] A. Tahmassebi, A. H. Gandomi, M. H. Schulte, A. E. Goudriaan, S. Y. Foo, and A. Meyer-Baese, “Optimized naive-bayes and decision tree approaches for fmri smoking cessation classification,” Complexity, vol. 2018, 2018. [10] S. R. Safavian and D. Landgrebe, “A survey of decision tree classifier methodology,” IEEE transactions on systems, man, and cybernetics, vol. 21, no. 3, pp. 660–674, 1991. 14