=Paper=
{{Paper
|id=Vol-2951/paper3
|storemode=property
|title=Dry Beans Classification Using Machine Learning
|pdfUrl=https://ceur-ws.org/Vol-2951/paper3.pdf
|volume=Vol-2951
|authors=Grzegorz Słowiński
|dblpUrl=https://dblp.org/rec/conf/csp/Slowinski21a
}}
==Dry Beans Classification Using Machine Learning==
Dry Beans Classification Using Machine Learning Grzegorz Słowiński University of Technology and Economics, ul. Jagiellońska 82f, 03-301 Warsaw, Poland Abstract A dataset containing over 13k samples of dry beans geometric features is being analysed using machine learning (ML) and deep learning (DL) techniques with the goal to automatically classify the bean species. First the original dataset was reduced to eliminate redundant features (too strongly correlated and echoing others). Then the dataset was visualised and analysed with machine learning techniques: Multinomial Bayes, Support Vector Machines, Decision Tree, Random Forest, Voting Classifier and Artificial Neural Network. The overall accuracies obtained were in range: 88.35 – 93.61%. Keywords 1 machine learning, deep learning, classification of dry beans. 1. Introduction Classification of dry beans is of some economic importance. Manual classification is labour intensive, etc. Over 13 k samples of dry beans of 7 various species were photographed and their geometry was measured via computer vision techniques in [1]. Then the set was analysed via several machine learning (or data science) and deep learning (or artificial neural network) techniques. The overall accuracy obtained was 87.92-93.13%, depending on technique used. The dataset used in [1] has been published in the UCI machine learning repository [2]. This work analyses the same dataset using slightly different techniques. Data dimensionality has been reduced. Slightly better accuracies has been achieved. Discussion and comparison to [1] has been carried out. 2. Tools The entire analysis was done using Python and its ML frameworks: numpy, pandas, matplotlib, seaborn, scikit-learn and keras. Google Colab a free cloud version of jupyter notebook was used. The reader can find the Python scripts under link [3]. 3. Preliminary analysis and visualisation of the dataset The dataset under study consists of 13611 samples. A sample amounts to 16 geometrical features and a label identifying the species of the bean. The species are as follows: Barbunya, Bombay, Cali, Dermason, Horoz, Seker, and Sira. The features are: Area, Perimeter, MajorAxisLength, MinorAxisLength, AspectRatio, Eccentricity, ConvexArea, EquivDiameter, Extent, Solidity, Roundness, Compactness, ShapeFactor1, ShapeFactor2, ShapeFactor3, and ShapeFactor4. A detailed explanation how the features were calculated is presented in [1]. The geometrical data carry no information about the bean colour. From the practical point of view it is unfortunate, as different dry bean species tend to vary in colour. On the other hand, it makes little difference if we just want to treat the dry beans classification problem as an exercise in building and comparing machine learning models. 1 29th International Workshop on Concurrency, Specification and Programming (CS&P'21) EMAIL: grzegorz.slowinski@uth.edu.pl ORCID: 0000-0001-9770-5063 ©️ 2021 Copyright for this paper by its author. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 3.1. Correlation analysis and feature reduction Correlation analysis has shown that several of the features are strongly (positively or negatively) correlated. This is due to the fact that basically all of them are kind of geometric measures. The decision was taken to drop some features to avoid correlations over 0.9 (or negative correlation below -0.9) between them. The benefits of such a decision should be: 1) a significant reduction of the computational complexity 2) a lower risk of overfitting 3) ease of visualisation. The disadvantage is a limited risk of loosing some valuable information and, as a result, a decrease in accuracy. Table 1 Correlation between selected beans features MajorAxis MinorAxis Aspect- Extent Solidity Roundness Shape Shape Length Length Ratio Factor2 Factor4 MajorAxis 1.0000 0.8261 0.5503 -0.0781 -0.2843 -0.5964 -0.8592 -0.4825 Length MinorAxis 0.8261 1.0000 -0.0092 0.1460 -0.1558 -0.2103 -0.4713 -0.2637 Length AspectRatio 0.5503 -0.0092 1.0000 -0.3702 -0.2678 -0.7670 -0.8378 -0.4493 Extent -0.0781 0.1460 -0.3702 1.0000 0.1914 0.3444 0.2380 0.1485 Solidity -0.2843 -0.1558 -0.2678 0.1914 1.0000 0.6072 0.3436 0.7022 Roundness -0.5964 -0.2103 -0.7670 0.3444 0.6072 1.0000 0.7828 0.4721 Shape- -0.8592 -0.4713 -0.8378 0.2380 0.3436 0.7828 1.0000 0.5299 Factor2 Shape- -0.4825 -0.2637 -0.4493 0.1485 0.7022 0.4721 0.5299 1.0000 Factor4 Thus, in this work it was decided to limit the set of features list to these 8 members: MajorAxisLength, MinorAxisLength, AspectRatio, Extent, Solidity, Roundness, ShapeFactor2, ShapeFactor4, and to exclude: Area, Perimeter, Eccentricity, ConvexArea, EquivDiameter, Compactness, ShapeFactor1, ShapeFactor3. The issue of high correlations among some features was not addressed in [1]. The visualisation of the data was done by pair-plot, and is presented in figure 1. It shows that the Bombay species is trivial to classify as its beans are significantly bigger than others. the classification of other species seems to be much more difficult, and we can expect more errors. The correlations between pairs of the selected features are listed in Table 1. 4. Machine Learning techniques used and results In this work the following techniques were used: Multinomial Gaussian Classifier, Support Vector Classifier, Decision Tree, Random Forest, Voting Classifier, Artificial Neural Network (Multilayer Perceptron or MLP). The full dataset was divided into the training and test subsets. 80% of samples were used for training and 20% for testing. Division of all available samples into the training and test subsets is crucial for a correct methodology. The aim of all ML or DL methods is to achieve a ”generalization” ability. Thus it is important to check the accuracy of classifying new samples, ones that have not been used during training. Otherwise, there is a very serious risk that the model will suffer form overfitting. Overfitted models perform very well on the training data but much worse on new data. Overfitting (as one of the most important issues in ML) is widely discussed in ML handbooks [4-6]. 4.1. Multinomial Naive Bayes classifier Naive Bayes models are based on Bayes's theorem. They are extremely fast and simple, but on the other hand, their performance is usually limited. They can be used as a baseline for classification problems (see [4], p. 382). The overall accuracy obtained with Multinomial Bayes Classifier was 64,30 %. The problem was to classify into 7 different classes. Thus the blind (random) classification should result in about 1/7 = 14,29% accuracy. As one can see, classification is more difficult if there are more classes. Random classification should give accuracy equal to about 1/(number of classes). Thus we can see that even this simple model perform about 50 percent points better than the random approach. Figure 1: The selected features of dry beans (pairplot) 4.2. Support Vector Classifier Support vector machines (SVM), which can be used as regressors or classifiers, are considered a very powerful and flexible algorithms. On the other hand. they may need a lot of computing power (see [4] p.405). The SVM principle is to partition the classes by ”drawing a line” (or plane) in a way that maximises the margin between classes. As straight lines (or planes) do not usually produce the best solution, SVC can apply different kernels (polynomial, radial and others). SVC is wider explained in [4,5]. SVCs with different kernels were tried. Table 2 presents the parameters used and the accuracy obtained. The results are quite similar for all kernels. The accuracy can be further improved to some extent (tenths of %, maybe 1%) by increasing C, but this will also significantly increase the training time. Table 2 The parameters for SVM (other parameters have default values). Kernel type C parameter Approx. computing time* Overall accuracy 5 Linear function 10 41 s 91.55% 5 Polynomial, 10 21 s 91.26% degree=3 Radial basis 107 34 s 92.18% *- computation was done on colab: Intel(R) Xeon(R) CPU @ 2.20GHz and 12,69GB RAM 4.3. Decision Tree A decision tree (DT) belongs to the class of so called non-parametric algorithms. The term non- parametric can be misleading. In fact, a decision tree has parameters, but their number is not constant. During the learning phase, a decision tree tries to find the best questions partitioning the dataset in order to reduce information impurity (the measure is the Gini index or information entropy). The great advantage of decision trees is that they are extremely intuitive. On the other hand, a decision tree has no limited degrees of freedom, so it is easy to overfit (if the user is not aware of that). The splits made by a decision tree are always orthogonal (made on one feature at a time), so the decision tree is very sensitive to data rotation (see [5], p.188). In [1] the authors created a decision tree with the depth of 4 (4 questions max) and 9 leaves. We decided to limit the depth of our decision tree to 5 and to 16 leaves max in order to get a decision tree that has size similar to DT obtained in [1]. Figure 2 shows the decision tree obtained under the above limits. The overall accuracy is 88.35%. Preliminary tests showed that a better accuracy of about 92,3% could be obtained with a bigger decision tree; however, the bigger the decision tree, the less intuitive it becomes, and the more difficult it is to visualise. Figure 2: Visualisation of the obtained decision tree produced with the plot_tree method from sci-kit learn. In another experiment with a big decision tree we set max depth =10 and max leaf nodes =30. The accuracy improved and reached 91.59% (Table 3). Table 3 Decision trees parameters and performance Decision tree Max depth Max leaf nodes Overall accuracy small 5 16 88.35% big 10 30 91.59% It can be seen that to obtain an accuracy similar to that reported in [1] with a decision tree, the tree would have to be much bigger (losing the main advantage of decision trees, i.e., the intuitive interpretation). One should keep in mind that in this work the amount of features has been reduced from 16 to 8. The excluded features were highly correlated to the retained features (being other geometrical measures of the same beans), so they accounted for little additional information. However, they present this information in a slightly different manner (”rotated”), making the task easier for the decision tree. To see this better, assume that in some dataset we have two parameters A and B, and the classification is obvious, but it depends on the A/B ratio that is not explicitly present in the dataset. This can be hard for a decision tree to solve. The addition of an extra column, A/B, will add no new information to the dataset, but it will help the decision tree quite a bit. It can be supposed that in the dry bean case, the 8 removed categories contained little extra information, but they presented essentially the same information in a way more appropriate for the decision tree. 4.4. Random Forest The random forest idea is as follows: take many decision trees (employing some randomness, so the trees differ) and let them vote. So the classification decision taken by a random forest is a decision taken by the most numerous group of decision trees in a random set of trees. Usually a random forest performs better than a single decision tree. However, a random forest is considered a ”black-box” model being very hard to interpret. A random forest of 150 decision trees was created. No restrictions on trees were applied. The accuracy obtained was 93.61%, the best so-far, better than the best accuracy reported in [1]. In addition, the training process was fast and took about 2 s, which 10-20 times faster than for SVC. 4.5. Voting Classifier The idea of ”voting”, which by default is used in random forests, can be applied to any classifiers. There are 2 main ways of voting: ”hard” (straightforward, direct voting) and ”soft” (the votes are weighted depending on how confident the classifier is with its choice). Like in the case of a random forest, there is a good chance that the voting result will be more accurate than for any particular classifier. The hard voting classifier was implemented using 3 classifiers described above: the radial kernel SVC, the ”big” decision tree, and the random forest. The obtained accuracy was 92.80%. Thus in this case it is worse than for the random forest. This gives us a clue that voting should be used carefully and preferably with models exhibiting similar performance; otherwise ”stupid” models can outvote ”smart” models. It seems that this flaw of democracy does not only apply to human societies, but is more universal in nature. 4.6. Artificial Neural Network Besides the (shallow) machine learning/data science methods presented above, a deep learning technique, the so-called dense artificial neural network, has also been tried. For an artificial neural network the data needs additional treatment. First, the names of bean species were labelled with numbers and then these numbers 0-6 were codded as so called ”one-hot”. The reason of using ”one-hot” encoding is well explained for example in [4] p. 376 or [6] pp. 190- 194. The other operation is scaling, a standardisation or normalisation of the training data. The data (each feature) is centred around zero (by subtracting the average) and normalised (by dividing by the standard deviation). Standardisation is said to ease the training process and tends to bring in improvement in performance [5] p. 72. Two architectures of ANN were tried. The first one is similar to the one described in [1], except that the input layer size in our case is 8 not 16. The network has 3 hidden layers with 17, 12, 3, neurons, respectively. Rectified Linear Unit was used as an activation function in hidden layers. The output layer has 7 neurons, one for each bean species. Sigmoid is the activation function for the output layer. The optimiser used was: RMSprop, and the loss function was the categorical cross entropy. The validation set was 20% of training set. The network was trained for 24 epochs. The architecture of this network and the training process are presented in figure 3. Figure 3. The architecture (left) and the training process (right) of the first ANN. The overall accuracy was 92.58%. In an attempt to improve it, another ”bigger” ANN was tried. Besides the normal layers, a dropout layer was added. A dropout layer only works during the training and randomly ”cuts off” (sets to zero) some inputs. It is expected that dropout layers reduce the risk of overfitting [6] p.109. The architecture of the network and its training are shown in figure 4. The same optimiser and loss function were used: RMSprop and the categorical cross entropy, respectively. The network was also trained for 24 epochs. The overall accuracy was 92.77%, thus the improvement was not much. 5. Results and Conclusions The dry beans dataset has been analysed by different machine learning and deep learning techniques. Table 4 shows the summary results. It can be seen that in general the task of beans classification is a relatively simple task in terms of the necessary computing power. All training times were shorter than 1 minute using a free google- colab computer. The accuracy of Naive Bayes is much worse than for the other methods. This is not surprising, as this method is known to be fast but not very accurate, and is suggested as a preliminary method to check if there is „something” in the data, rather than to do a full analysis. The other methods give accuracy in a relatively close range of 88.3-93.6%. This is comparable to [1] where the accuracy was in range 87.9-93.13%. The authors in [1] trained their models with all 16 features. Here, some strongly correlated features were eliminated. The results show that this elimination has not decreased the accuracy. The only technique, that suffered from this elimination to some extend seems to be decision tree. This is due to the fact that decision tree operates on one feature at a time and cannot ”combine” features. See also the discussion in p. 4.3 devoted to the decision tree section. Figure 4. The architecture (left) and the training process (right) of the second ANN. One can see that the models vary strongly in the terms of the computing time. SVC and ANN (and also Voting Classifier, because it includes SVC) are the slowest learners. Random forest seems to be the best method in this case, as it perform best and its training time is also reasonable. Confusion matrices provide a comfortable way to visualise results in more details and compare actual values with predicted ones. The confusion matrix for the random forest classifier (the best performer) will be discussed further. It is presented in figure 5. The most frequent mistakes were between Dermason and Sira (38 + 44). On the other hand, Bombay was classified perfectly which is not surprising. It is easy to notice that Bombay beans are significantly bigger than other species. The dry beans dataset appeared to be an interesting dataset to demonstrate and compare ML techniques. Two ideas for further research: Deeper insight how and if the elimination of correlated features influences the ML training process. This study shows that there is little, if any, performance decrease. One may try to investigate if the elimination reduces the training time and how much. Despite ”manual” feature reduction, as done in this work, on may try to use PCA (the primary component analysis) to reduce the dimensionality of data and also analyse its influence on model performance (accuracy and training time) Figure 5. Test subset confusion matrix for the random forest classifier. 6. References [1] Murat Koklu, Ilker Ali Ozkan, Multiclass classification of dry beans using computer vision and machine learning techniques, Computers and Electronics in Agriculture 174 (2020) 105507 [2] Dry beans dataset at UCI repository: https://archive.ics.uci.edu/ml/datasets/Dry+Bean+Dataset, access 23.06.2021 [3] Colab notebook containing computation scripts for this work: https://colab.research.google.com/ drive/11X6VevSMybenGkRqK1Xj_1EJmU3vomFB?usp=sharing [4] Jake VanderPlass, Python Data Science Handbook, O’Reilly, 2017 [5] Aurelien Geron, Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow, O’Reilly, 2019 [6] Francois Chollet, Deep Learning with Python, Manning Publications, 2018