=Paper=
{{Paper
|id=Vol-3118/p05
|storemode=property
|title=Water Potability Classification using Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-3118/p05.pdf
|volume=Vol-3118
|authors=Patryk Rozynek,Michal Rozynek
|dblpUrl=https://dblp.org/rec/conf/icyrime/RozynekR21
}}
==Water Potability Classification using Neural Networks==
Water Potability Classification using Neural Networks Patryk Rozynek1 , Michal Rozynek1 1 Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, POLAND Abstract Nowadays, the Internet of Things, intelligent systems are becoming very popular and wanted. These tools can be used for smart and fast analysis of different data. In this paper, we focused on the automatic analysis of water quality by the use of artificial intelligence methods. As the main tool for this analysis, k-nearest neighbor and artificial neural network were used. Both methods were tested, and the results were discussed in terms of selecting the best tool in the topic of water potability. Keywords water potability, artificial neural network, kNN algorithm 1. Introduction in our project. The project was implemented using the database available on the website https://www.kaggle.com/. Today, the Internet of Things (IoT) is one of the most The algorithm is based on counting the distance between important areas of developing and practical implement- the given sample and each object in the training set. We ing smart solutions. Especially, the last years show that used the Minkowsky algorithm to calculate the distance: smart things can be used everything for different manage- ment and systems. One such area is water management ∑︁𝑚 and automatic evaluation/analysis. In [1], a system for 𝐷(𝑥, 𝑦) = ( |𝑥𝑖 − 𝑦𝑖 |𝑟 )1/𝑟 (1) water management was shown where sustainable net- 𝑖=1 works were analyzed. Moreover, similar solutions were In this way, we obtain a table of distances from our shown in [2, 11]. The authors of this research use deep sample. We just need to sort it so that the shortest dis- machine learning like a convolutional neural network for tances are at the front. The classification decision is then the analysis of water pollination for agricultural irriga- made by voting ’k’ neighbors or ’k’ of the first objects in tion resources. Not only in water management, machine the distance table. The result will be the value of the class learning solutions are used, but also to detect and classify that was voted more times. If there are the same number different objects on water. One such task is to analyze of votes for both variants of the class, then the algorithm the ship passing some areas. In [3, 4, 9], two solutions chooses the first of them and votes for it. Therefore, it is for taking an image on a river and used for classification recommended that the number of neighbors is odd. purposes were presented. Both solutions show practical potential in implementation based on performed real case studies. All machine learning solutions in IoT solutions 3. Artificial neural network uses a whole data [5, 6, 7, 8, 10] to analysis or extracted A neural network is a software modeled after the oper- features [12, 13, 14, 15]. In both cases, the results show ation of neurons in the human brain. They consist of great accuracy in using these approaches. In this paper, three types of layers: input (it collects data and passes it we propose a solution for analyzing the water by the use on), hidden (here the connections between neurons are of two tools like K-nn and an artificial neural network. looked for, here the learning process takes place), and output (collects conclusions, analysis results). Typically, 2. Methodology the neural network is made up of many layers. The first layer - as in the case of images recorded, for example, The algorithm classifies whether the water with the given by the optic nerves of a human - goes to raw input data. properties is safe to drink. We used two classification Each subsequent layer receives data as a result of data methods: the KNN algorithm. The code of the algorithms processing in the previous layer. What the last layer pro- were written in Python. Most of the article compares duces is the so-called System output. A neural network these algorithms to see which one is more suitable for use functions like the human brain: each neuron carries out its simple calculations, and the network made up of all ICYRIME 2021 @ International Conference of Yearly Reports on neurons multiplies the potential of that calculation. Neu- Informatics Mathematics and Engineering, online, July 9, 2021 ral networks used in artificial intelligence are organized " patrroz599@student.polsl.pl (P. Rozynek); patrroz599@student.polsl.pl (M. Rozynek) on the same principle - but with one exception: to per- © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). form a specific task, the connections between neurons CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) can be adjusted accordingly. Searching for information 34 Patryk Rozynek et al. CEUR Workshop Proceedings 34–39 is the process of searching a specific set of documents relating to the subject or object indicated in the query or containing facts necessary for the user. However, this process has not been precisely and finitely defined by patterns, standards, or algorithms and is largely based on heuristics, in this case, defined as a set of rules and guidelines that may or may not lead to the right solution. For each neuron, the sum of the products of previous neurons and associated synapses (weights) is calculated. The result is then passed on to the activation function. The formula for calculating the value of a neuron: 𝑛 ∑︁ 𝑜 = 𝑓( 𝑥𝑖 * 𝑤𝑖 ) (2) 𝑖=0 where o - output, w - weights, x - neurons The acti- vation function can be any function. It is often taken as a hyperbolic tangent. After all the values for the neu- rons in the hidden layer have been calculated, the output layer is recalculated in the same way. This layer has as many neurons as there are classes. Then the global error is calculated based on the values from this layer. If the error is smaller, the weights that were used are saved and the previous ones are forgotten. Heuristics • 4 (1) A distance table is created for the training is a method of finding solutions for which there is no set guarantee of finding the optimal, or often even correct, • 5 - 6 (1) Minkowsky is called for each sample in solution. These solutions are used, for example, when the training set. The first argument is an example the full algorithm is too expensive for technical reasons set of water parameters, and the second is another or when it is unknown. The method is also used to find sample from the training set. approximate solutions, based on which the final result is calculated using the full algorithm. The latter application • 3 - 5 (2) The distance is counted. primarily applies to cases where heuristics are used to • 6 (2) The distance between the sample sample direct the full algorithm to the optimal solution to reduce and the comparison object to the distance table the program runtime in a typical case without sacrificing is returned. the quality of the solution. • 7 (1) The algorithm sorts the list of distances so that the most similar cases are at the top of the list. The same changes are performed on the training 4. Description set • 7 8 - 9 (1) Now is the vote. The kof the first records Pseudocode: Part of code in Python: (1) on the distance list are taken. Each neighbor votes for the class they own. • 2 - 3 (1) After getting the input data, the training set is copied to the new facility • 10 (1) The class with the most votes is returned. If they have tied votes, the first class, Potable is 35 Patryk Rozynek et al. CEUR Workshop Proceedings 34–39 Figure 2: Graph showing the effectiveness of the algorithm depending on the number of neighbors Figure 1: An example of the algorithm execution during the research returned. Sample program execution is shown in Fig.1: 5. Experiments 5.1. KNN Figure 3: There is no difference between potable and non The KNN algorithm achieved the following results pre- potable water for an example trait sented in Fig. 2. Based on Fig. 2, the accuracy of the per- formed method shows very similar results. For analyzed three different numbers of neighbors like k ∈ {1, 2, 3} the accuracy was on the same level that is 50%. We can say that the number of neighbors (on a small number of parameters k is irrelevant for the classification task. But more importantly, why is the effectiveness so low? The base we used has drinking and non-drinking water records with similar parameters. For example, the figure below shows potable and non-potable water samples in Fig. 3. As you can see, an example feature has differ- ent values for potable and non-potable water. The next example in the picture in Fig. 4. Figure 4: 0 - non-potable water, 1 - potable water, The average The Hardness trait for drinking water has a wider values are very similar range of values, but the average values are very close to each other. To sum up, the use of knn algorithm seems to be a tool that for a small number of parameters k the results are not promising. The accuracy on a level of 50% faster and the results were almost the same. To check cannot be used in practical implementation. which neural network is the best for our program, we created 15 architectures of artificial neural networks with a different number of hidden layers and the number of 5.2. Neural Network neurons in these layers. Three of these nets were deep The set was divided into a training set and a validation nets and the rest were shallow. Each network has been set in proportion 1:10. As a result, the program counted trained 1000 times. Additionally, the adopted parameters 36 Patryk Rozynek et al. CEUR Workshop Proceedings 34–39 Figure 5: Accuracy Figure 6: Training Time were the same for all networks and amounted to: 𝜑1 = 6. Conclusion 0.1, 𝜑2 = 0.9 Ω = 𝑟𝑎𝑛𝑑𝑜𝑚 =< 0.2; 0.85 > The obtained results and comparison for different ar- After performing the tests and comparing the two clas- chitectures were presented in Fig.5,6,7 and 8. Based on sification methods, it can be easily stated that using an these results artificial neural network shows much better artificial neural network the results are closer to the truth. results then compared knn An obstacle in the KNN classification was that the values of each of the water properties were too similar. No mat- ter how many neighbors there were, the effect was the same. It can be said that the algorithm guesses the result. No data manipulation, such as deleting one feature, gave 37 Patryk Rozynek et al. CEUR Workshop Proceedings 34–39 Figure 7: Accuracy Figure 8: Training Time better results. In fact, it is also difficult to judge whether error, it changed less frequently. This is because the net- the water is drinkable. With the characteristics listed in work has more neurons and therefore more weight. It is the database, it is not possible to determine whether the more difficult for the network to change the weights so water is potable with the KNN algorithm. When compar- that the next calculated global error is better. Some of the ing the results of the deep network with the shallow one, weights may have changed for the better, but the network it can be said that they are not very different. The two has not caught them because the error was not any less. deep nets had the best two accuracies, but the third is The addition of the ability to remove some particles from below average. Unfortunately to train a network with 5 the swarm made it possible to create new particles with hidden layers and 6 neurons in each of them took almost random weights at startup. This is a good way to improve an hour. In this type of network, looking at the global results without investing much in time. Especially if the 38 Patryk Rozynek et al. CEUR Workshop Proceedings 34–39 parameters 𝜑1 and 𝜑2 are 0.1 and 0.2 the particles tend to jda A., Automatic RGB Inference Based on Facial be the best global particle, not to their best position. The Emotion Recognition (2021) CEUR Workshop Pro- best results are obtained with a network with 5 hidden ceedings, 3092, pp. 66 - 74, layers and 2 neurons in this layer and reached 59,32% [12] G. Lo Sciuto, G. Capizzi, R. Shikler, C. Napoli, Or- and took almost 20 minutes ganic solar cells defects classification by using a new feature extraction algorithm and an ebnn with an innovative pruning algorithm, International References Journal of Intelligent Systems 36 (2021) 2443–2464. [13] D. Połap, M. Woźniak, Meta-heuristic as manager in [1] H. M. Ramos, A. McNabola, P. A. López-Jiménez, M. federated learning approaches for image processing Pérez-Sánchez, Smart water management towards purposes, Applied Soft Computing (2021) 107872. future water sustainable networks, Water 12 (2020) [14] D. Połap, Analysis of skin marks through the 58 use of intelligent things, IEEE Access 7 (2019) [2] H. Chen, A. Chen, L. Xu, H. Xie, H. Qiao, Q. Lin, 149355–149363 K. Cai, A deep learning cnn architecture applied [15] Caggiano, G., Napoli, C., Coretti, C. et al., Mold in smart near-infrared analysis of water pollution contamination in a controlled hospital environment: for agricultural irrigation resources, Agricultural a 3-year surveillance in southern Italy. BMC Infect Water Management 240 (2020) 106303 Dis 14, 595 (2014). [3] D. Połap, M. Włodarczyk-Sielicka, N. Wawrzyniak, [16] Acciarito, S., Cristini, A., Di Nunzio, L., Khanal, Automatic ship classification for a riverside mon- G.M., Susi, G. An a VLSI driving circuit for itoring system using a cascade of artificial intelli- memristor-based STDP (2016) 2016 12th Confer- gence techniques including penalties and rewards, ence on Ph.D. Research in Microelectronics and ISA transactions (2021). Electronics, PRIME 2016, art. no. 7519503, . [4] T. Hyla, N. Wawrzyniak, Identification of vessels on inland waters using low-quality video streams, in: Proceedings of the 54th Hawaii International Conference on System Sciences, 2021, p. 7269. [5] V. Nourani, E. Foroumandi, E. Sharghi, D. Dąbrowska, Ecological-environmental quality esti- mation using remote sensing and combined artifi- cial intelligence techniques, Journal of Hydroinfor- matics 23 (2021) 47–65. [6] M. Woźniak, M. Wieczorek, J. Siłka, D. Połap, Body pose prediction based on motion sensor data and recurrent neural network, IEEE Transactions on Industrial Informatics 17 (2020) 2101–2111. [7] Y. Shao, J. C.-W. Lin, G. Srivastava, D. Guo, H. Zhang, H. Yi, A. Jolfaei, Multi-objective neural evo- lutionary algorithm for combinatorial optimization problems, IEEE Transactions on Neural Networks and Learning Systems (2021). [8] C. Napoli, F. Bonanno, G. Capizzi, Exploiting solar wind time series correlation with magnetospheric response by using an hybrid neuro-wavelet ap- proach (2010) Proceedings of the International As- tronomical Union, 6 (S274), pp. 156 - 158. [9] Brociek R., Magistris G.D., Cardia F., Coppa F., Russo S., Contagion Prevention of COVID-19 by means of Touch Detection for Retail Stores (2021) CEUR Workshop Proceedings, 3092, pp. 89 - 94. [10] G. Capizzi, G. Lo Sciuto, C. Napoli, M. Woźniak, G. Susi, A spiking neural network-based long-term prediction system for biogas production (2020) Neu- ral Networks, 129, pp. 271 - 279. [11] Brandizzi N., Bianco V., Castro G., Russo S., Wa- 39