=Paper=
{{Paper
|id=Vol-3118/p05
|storemode=property
|title=Water Potability Classification using Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-3118/p05.pdf
|volume=Vol-3118
|authors=Patryk Rozynek,Michal Rozynek
|dblpUrl=https://dblp.org/rec/conf/icyrime/RozynekR21
}}
==Water Potability Classification using Neural Networks==
<pdf width="1500px">https://ceur-ws.org/Vol-3118/p05.pdf</pdf>
<pre>
Water Potability Classification using Neural Networks
Patryk Rozynek1 , Michal Rozynek1
1
    Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, POLAND


                                             Abstract
                                             Nowadays, the Internet of Things, intelligent systems are becoming very popular and wanted. These tools can be used for
                                             smart and fast analysis of different data. In this paper, we focused on the automatic analysis of water quality by the use of
                                             artificial intelligence methods. As the main tool for this analysis, k-nearest neighbor and artificial neural network were used.
                                             Both methods were tested, and the results were discussed in terms of selecting the best tool in the topic of water potability.

                                             Keywords
                                             water potability, artificial neural network, kNN algorithm


1. Introduction                                              in our project. The project was implemented using the
                                                             database available on the website https://www.kaggle.com/.
Today, the Internet of Things (IoT) is one of the most The algorithm is based on counting the distance between
important areas of developing and practical implement- the given sample and each object in the training set. We
ing smart solutions. Especially, the last years show that used the Minkowsky algorithm to calculate the distance:
smart things can be used everything for different manage-
ment and systems. One such area is water management                                    ∑︁𝑚

and automatic evaluation/analysis. In [1], a system for                   𝐷(𝑥,  𝑦) =  (     |𝑥𝑖 − 𝑦𝑖 |𝑟 )1/𝑟       (1)
water management was shown where sustainable net-                                       𝑖=1

works were analyzed. Moreover, similar solutions were           In this way, we obtain a table of distances from our
shown in [2, 11]. The authors of this research use deep sample. We just need to sort it so that the shortest dis-
machine learning like a convolutional neural network for tances are at the front. The classification decision is then
the analysis of water pollination for agricultural irriga- made by voting ’k’ neighbors or ’k’ of the first objects in
tion resources. Not only in water management, machine the distance table. The result will be the value of the class
learning solutions are used, but also to detect and classify that was voted more times. If there are the same number
different objects on water. One such task is to analyze of votes for both variants of the class, then the algorithm
the ship passing some areas. In [3, 4, 9], two solutions chooses the first of them and votes for it. Therefore, it is
for taking an image on a river and used for classification recommended that the number of neighbors is odd.
purposes were presented. Both solutions show practical
potential in implementation based on performed real case
studies. All machine learning solutions in IoT solutions
                                                             3. Artificial neural network
uses a whole data [5, 6, 7, 8, 10] to analysis or extracted A neural network is a software modeled after the oper-
features [12, 13, 14, 15]. In both cases, the results show ation of neurons in the human brain. They consist of
great accuracy in using these approaches. In this paper, three types of layers: input (it collects data and passes it
we propose a solution for analyzing the water by the use on), hidden (here the connections between neurons are
of two tools like K-nn and an artificial neural network. looked for, here the learning process takes place), and
                                                                                                                           output (collects conclusions, analysis results). Typically,
2. Methodology                                                                                                             the neural network is made up of many layers. The first
                                                                                                                           layer - as in the case of images recorded, for example,
The algorithm classifies whether the water with the given                                                                  by the optic nerves of a human - goes to raw input data.
properties is safe to drink. We used two classification                                                                    Each subsequent layer receives data as a result of data
methods: the KNN algorithm. The code of the algorithms                                                                     processing in the previous layer. What the last layer pro-
were written in Python. Most of the article compares                                                                       duces is the so-called System output. A neural network
these algorithms to see which one is more suitable for use                                                                 functions like the human brain: each neuron carries out
                                                                                                                           its simple calculations, and the network made up of all
ICYRIME 2021 @ International Conference of Yearly Reports on                                                               neurons multiplies the potential of that calculation. Neu-
Informatics Mathematics and Engineering, online, July 9, 2021
                                                                                                                           ral networks used in artificial intelligence are organized
" patrroz599@student.polsl.pl (P. Rozynek);
patrroz599@student.polsl.pl (M. Rozynek)                                                                                   on the same principle - but with one exception: to per-
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                           form a specific task, the connections between neurons
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                             can be adjusted accordingly. Searching for information


                                                                                                                      34
Patryk Rozynek et al. CEUR Workshop Proceedings                                                                      34–39


is the process of searching a specific set of documents
relating to the subject or object indicated in the query
or containing facts necessary for the user. However, this
process has not been precisely and finitely defined by
patterns, standards, or algorithms and is largely based
on heuristics, in this case, defined as a set of rules and
guidelines that may or may not lead to the right solution.
For each neuron, the sum of the products of previous
neurons and associated synapses (weights) is calculated.
The result is then passed on to the activation function.
The formula for calculating the value of a neuron:
                             𝑛
                            ∑︁
                   𝑜 = 𝑓(         𝑥𝑖 * 𝑤𝑖 )             (2)
                            𝑖=0

   where o - output, w - weights, x - neurons The acti-
vation function can be any function. It is often taken as
a hyperbolic tangent. After all the values for the neu-
rons in the hidden layer have been calculated, the output
layer is recalculated in the same way. This layer has
as many neurons as there are classes. Then the global
error is calculated based on the values from this layer.
If the error is smaller, the weights that were used are
saved and the previous ones are forgotten. Heuristics
                                                                   • 4 (1) A distance table is created for the training
is a method of finding solutions for which there is no
                                                                     set
guarantee of finding the optimal, or often even correct,
                                                                   • 5 - 6 (1) Minkowsky is called for each sample in
solution. These solutions are used, for example, when
                                                                     the training set. The first argument is an example
the full algorithm is too expensive for technical reasons
                                                                     set of water parameters, and the second is another
or when it is unknown. The method is also used to find
                                                                     sample from the training set.
approximate solutions, based on which the final result is
calculated using the full algorithm. The latter application        • 3 - 5 (2) The distance is counted.
primarily applies to cases where heuristics are used to            • 6 (2) The distance between the sample sample
direct the full algorithm to the optimal solution to reduce          and the comparison object to the distance table
the program runtime in a typical case without sacrificing            is returned.
the quality of the solution.                                       • 7 (1) The algorithm sorts the list of distances so
                                                                     that the most similar cases are at the top of the list.
                                                                     The same changes are performed on the training
4. Description                                                       set
                                                                   • 7 8 - 9 (1) Now is the vote. The kof the first records
Pseudocode: Part of code in Python: (1)                              on the distance list are taken. Each neighbor votes
                                                                     for the class they own.
     • 2 - 3 (1) After getting the input data, the training
       set is copied to the new facility                           • 10 (1) The class with the most votes is returned.
                                                                     If they have tied votes, the first class, Potable is


                                                              35
Patryk Rozynek et al. CEUR Workshop Proceedings                                                                           34–39


                                                                  Figure 2: Graph showing the effectiveness of the algorithm
                                                                  depending on the number of neighbors


Figure 1: An example of the algorithm execution during the
research


       returned.

  Sample program execution is shown in Fig.1:


5. Experiments
5.1. KNN
                                                                  Figure 3: There is no difference between potable and non
The KNN algorithm achieved the following results pre-             potable water for an example trait
sented in Fig. 2. Based on Fig. 2, the accuracy of the per-
formed method shows very similar results. For analyzed
three different numbers of neighbors like k ∈ {1, 2, 3}
the accuracy was on the same level that is 50%. We can
say that the number of neighbors (on a small number
of parameters k is irrelevant for the classification task.
But more importantly, why is the effectiveness so low?
The base we used has drinking and non-drinking water
records with similar parameters. For example, the figure
below shows potable and non-potable water samples in
Fig. 3. As you can see, an example feature has differ-
ent values for potable and non-potable water. The next
example in the picture in Fig. 4.
                                                                  Figure 4: 0 - non-potable water, 1 - potable water, The average
   The Hardness trait for drinking water has a wider
                                                                  values are very similar
range of values, but the average values are very close to
each other. To sum up, the use of knn algorithm seems
to be a tool that for a small number of parameters k the
results are not promising. The accuracy on a level of 50%         faster and the results were almost the same. To check
cannot be used in practical implementation.                       which neural network is the best for our program, we
                                                                  created 15 architectures of artificial neural networks with
                                                                  a different number of hidden layers and the number of
5.2. Neural Network
                                                                  neurons in these layers. Three of these nets were deep
The set was divided into a training set and a validation          nets and the rest were shallow. Each network has been
set in proportion 1:10. As a result, the program counted          trained 1000 times. Additionally, the adopted parameters


                                                             36
Patryk Rozynek et al. CEUR Workshop Proceedings                                                                     34–39


Figure 5: Accuracy


Figure 6: Training Time


were the same for all networks and amounted to: 𝜑1 =         6. Conclusion
0.1, 𝜑2 = 0.9 Ω = 𝑟𝑎𝑛𝑑𝑜𝑚 =< 0.2; 0.85 >
  The obtained results and comparison for different ar-      After performing the tests and comparing the two clas-
chitectures were presented in Fig.5,6,7 and 8. Based on      sification methods, it can be easily stated that using an
these results artificial neural network shows much better    artificial neural network the results are closer to the truth.
results then compared knn                                    An obstacle in the KNN classification was that the values
                                                             of each of the water properties were too similar. No mat-
                                                             ter how many neighbors there were, the effect was the
                                                             same. It can be said that the algorithm guesses the result.
                                                             No data manipulation, such as deleting one feature, gave


                                                        37
Patryk Rozynek et al. CEUR Workshop Proceedings                                                                        34–39


Figure 7: Accuracy


Figure 8: Training Time


better results. In fact, it is also difficult to judge whether    error, it changed less frequently. This is because the net-
the water is drinkable. With the characteristics listed in        work has more neurons and therefore more weight. It is
the database, it is not possible to determine whether the         more difficult for the network to change the weights so
water is potable with the KNN algorithm. When compar-             that the next calculated global error is better. Some of the
ing the results of the deep network with the shallow one,         weights may have changed for the better, but the network
it can be said that they are not very different. The two          has not caught them because the error was not any less.
deep nets had the best two accuracies, but the third is           The addition of the ability to remove some particles from
below average. Unfortunately to train a network with 5            the swarm made it possible to create new particles with
hidden layers and 6 neurons in each of them took almost           random weights at startup. This is a good way to improve
an hour. In this type of network, looking at the global           results without investing much in time. Especially if the


                                                             38
Patryk Rozynek et al. CEUR Workshop Proceedings                                                               34–39


parameters 𝜑1 and 𝜑2 are 0.1 and 0.2 the particles tend to      jda A., Automatic RGB Inference Based on Facial
be the best global particle, not to their best position. The    Emotion Recognition (2021) CEUR Workshop Pro-
best results are obtained with a network with 5 hidden          ceedings, 3092, pp. 66 - 74,
layers and 2 neurons in this layer and reached 59,32%      [12] G. Lo Sciuto, G. Capizzi, R. Shikler, C. Napoli, Or-
and took almost 20 minutes                                      ganic solar cells defects classification by using a
                                                                new feature extraction algorithm and an ebnn with
                                                                an innovative pruning algorithm, International
References                                                      Journal of Intelligent Systems 36 (2021) 2443–2464.
                                                           [13] D. Połap, M. Woźniak, Meta-heuristic as manager in
 [1] H. M. Ramos, A. McNabola, P. A. López-Jiménez, M.
                                                                federated learning approaches for image processing
     Pérez-Sánchez, Smart water management towards
                                                                purposes, Applied Soft Computing (2021) 107872.
     future water sustainable networks, Water 12 (2020)
                                                           [14] D. Połap, Analysis of skin marks through the
     58
                                                                use of intelligent things, IEEE Access 7 (2019)
 [2] H. Chen, A. Chen, L. Xu, H. Xie, H. Qiao, Q. Lin,
                                                                149355–149363
     K. Cai, A deep learning cnn architecture applied
                                                           [15] Caggiano, G., Napoli, C., Coretti, C. et al., Mold
     in smart near-infrared analysis of water pollution
                                                                contamination in a controlled hospital environment:
     for agricultural irrigation resources, Agricultural
                                                                a 3-year surveillance in southern Italy. BMC Infect
     Water Management 240 (2020) 106303
                                                                Dis 14, 595 (2014).
 [3] D. Połap, M. Włodarczyk-Sielicka, N. Wawrzyniak,
                                                           [16] Acciarito, S., Cristini, A., Di Nunzio, L., Khanal,
     Automatic ship classification for a riverside mon-
                                                                G.M., Susi, G. An a VLSI driving circuit for
     itoring system using a cascade of artificial intelli-
                                                                memristor-based STDP (2016) 2016 12th Confer-
     gence techniques including penalties and rewards,
                                                                ence on Ph.D. Research in Microelectronics and
     ISA transactions (2021).
                                                                Electronics, PRIME 2016, art. no. 7519503, .
 [4] T. Hyla, N. Wawrzyniak, Identification of vessels
     on inland waters using low-quality video streams,
     in: Proceedings of the 54th Hawaii International
     Conference on System Sciences, 2021, p. 7269.
 [5] V. Nourani, E. Foroumandi, E. Sharghi, D.
     Dąbrowska, Ecological-environmental quality esti-
     mation using remote sensing and combined artifi-
     cial intelligence techniques, Journal of Hydroinfor-
     matics 23 (2021) 47–65.
 [6] M. Woźniak, M. Wieczorek, J. Siłka, D. Połap, Body
     pose prediction based on motion sensor data and
     recurrent neural network, IEEE Transactions on
     Industrial Informatics 17 (2020) 2101–2111.
 [7] Y. Shao, J. C.-W. Lin, G. Srivastava, D. Guo, H.
     Zhang, H. Yi, A. Jolfaei, Multi-objective neural evo-
     lutionary algorithm for combinatorial optimization
     problems, IEEE Transactions on Neural Networks
     and Learning Systems (2021).
 [8] C. Napoli, F. Bonanno, G. Capizzi, Exploiting solar
     wind time series correlation with magnetospheric
     response by using an hybrid neuro-wavelet ap-
     proach (2010) Proceedings of the International As-
     tronomical Union, 6 (S274), pp. 156 - 158.
 [9] Brociek R., Magistris G.D., Cardia F., Coppa F.,
     Russo S., Contagion Prevention of COVID-19 by
     means of Touch Detection for Retail Stores (2021)
     CEUR Workshop Proceedings, 3092, pp. 89 - 94.
[10] G. Capizzi, G. Lo Sciuto, C. Napoli, M. Woźniak,
     G. Susi, A spiking neural network-based long-term
     prediction system for biogas production (2020) Neu-
     ral Networks, 129, pp. 271 - 279.
[11] Brandizzi N., Bianco V., Castro G., Russo S., Wa-


                                                               39

</pre>