=Paper=
{{Paper
|id=Vol-3118/p03
|storemode=property
|title=Hybridization of K Nearest Neighbors Classifier with
Cuckoo Search Algorithm
|pdfUrl=https://ceur-ws.org/Vol-3118/p03.pdf
|volume=Vol-3118
|authors=Katarzyna Prokop
|dblpUrl=https://dblp.org/rec/conf/icyrime/Prokop21
}}
==Hybridization of K Nearest Neighbors Classifier with
Cuckoo Search Algorithm==
Hybridization of K Nearest Neighbors Classifier with
Cuckoo Search Algorithm
Katarzyna Prokop
Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, POLAND
Abstract
Artificial intelligence methods are one of the most used algorithms in big data and Internet of Things solutions. Therefore, a
very important aspect is to create new algorithms and improve existing ones. In this paper, the proposition of a hybrid method
for classifying elements in certain datasets is presented. The proposed method joins properties of πΎ Nearest Neighbors
Algorithm (a classifier) and Cuckoo Search Algorithm (a heuristic algorithm). The proposed model was presented, tested, and
discussed on a selected dataset of Iris Flowers. The effectiveness of the method is compared for different variants of input
parameters to show the efficiency of the proposition.
Keywords
data classification, knn, heuristic, clustering
1. Introduction propriate hypotheses. They are defined as methods of
searching solutions that have no guarantee of finding
In everyday life, people are constantly faced with the the optimal solution [5, 6]. Heuristics are used when
need to make choices. Depending on the method of se- conventional methods are too costly, e.g. due to the com-
lection, the decisions turn out to be more or less accurate putational complexity or when the appropriate algorithm
in relation to a given problem. It is often justified to use is not fully known. Therefore, it is possible that heuristics
certain mathematical techniques in the decision-making would solve the problem incorrectly. In the last years,
process for example to assess the probability that a choice heuristic algorithms were improved by applied paral-
will bring the expected result [1, 2, 3]. lel computing to decrease time complexities [7, 8] and
The main intention of the work is to propose a hybrid in many applications like classifications [9, 10, 11] and
method for classifying an object into one of the existing clustering[12, 13, 14].
groups from a specific data set. Such grouping of data is The effectiveness of the proposed method has been
related to the concept of clustering, which is the ordering tested for different sets of input parameters.
of database records into collections based on established
guidelines [4].
The idea of clustering implies the separation of ele- 2. Cuckoo Search Algorithm
ments into clusters with similar characteristics. Simul-
taneously objects belonging to one group should be less Cuckoo Search Algorithm [15] is an example of a stochas-
similar to elements from other clusters. This process is tic optimization method β in other words an optimization
unsupervised, based on measurements of the value of process whose essential element is the randomization of
the similarity metric between items without making any certain parameters [16]. Owing to the fact of using ran-
other assumptions. However, the clustering process does dom values the solution is quickly obtained. Cuckoo
not guarantee the ordering of the data while maintaining Search Algorithm aim is to find an answer for a given
the actual relations between the elements because there problem by imitating the behavior of some species of
are many possible divisions of the data set. cuckoos.
The aim of the hybrid classification method described The cuckoo is a medium-sized migratory bird in the
in the paper is to assign an object to the appropriate cuckoo Cuculidae family. It is representative of the nest-
group using a heuristic algorithm. The term βheuristicβ ing parasites which means animals that use the parents of
is derived from the Greek word heuresis which means other species to take care of their offspring. Cuckoos drop
invention and from the word heureka which stands for their eggs into foreign nests, where their progeny grows
βI have foundβ. In ancient times heuristics were con- up [17]. Over the generations cuckoos have learned to
cerned with solving problems based on formulating ap- adapt to their surroundings, making themselves similar
to other bird species to increase the chance to survive in
ICYRIME 2021 @ International Conference of Yearly Reports on a foreign nest. This ability is called mimicry [17], which
Informatics Mathematics and Engineering, online, July 9, 2021 can be also defined by masking. Cuckoos become similar
" ktn.prokop@gmail.com (K. Prokop)
Β© 2021 Copyright for this paper by its authors. Use permitted under Creative to the host species even before they hatch. The host is of-
CEUR
Workshop
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
Proceedings
http://ceur-ws.org
ISSN 1613-0073
ten unable to distinguish the cuckoos from his offspring
18
Katarzyna Prokop CEUR Workshop Proceedings 18β25
due to the mimicry, which results in him raising these where 0 means throwing away the intruder from the nest,
individuals like his own hatchlings. However, when he 1 saving the cuckoo, π is a random value in the range
discovers the intruder, throws it out of the nest without (0, 1), generated separately for each individual and the
hesitation. minimum of the function π (Β·) is searched. If the function
Every cuckoo in the algorithm is interpreted as a point π (Β·) were to be maximised, the function π(Β·, Β·)would take
in n-dimensional space. The individual is evaluated re- the form:
garding the value of the fitness function based on taken
coordinates. Thanks to that, the best cuckoos solving a
given problem can be selected. All birds are in the nest,
which is the set of the currently best-adapted points. Ad- (4)
ditionally, the movement of cuckoos takes place. It is When the final set of the best solutions is known and
set by a given equation of displacement. Another im- all the actions characteristic of the algorithm were per-
portant element of the algorithm is the so-called βhost formed a certain number of times (iteratively), the last
decisionβ, when the cuckoo can be thrown away from the step is to find the best individual. This is done by compar-
nest and replaced with another, newly hatched bird. This ing the values of the fitness function for all the cuckoos
operation depends on the comparison of the values of in the nest.
the fitness function for both individuals and the assumed The structure of the algorithm is presented in pseu-
probability detection of an intruder in the nest. docode 1.
The mathematical model of the cuckoo search algo-
rithm can be presented in the steps described below. Algorithm 1 Cuckoo Search Algorithm.
The algorithm starts by generating an initial cuckoo
population. It is assumed that the cuckoo is the point of
n coordinates (depending on the size of the problem, i.e.
the number of variables of the fitness function π (Β·)):
(1)
where π₯1 , ..., π₯π a re chosen e.g. randomly from a given
interval. Then each individual is displaced using LΓ©vyβs
movement. The flight of the cuckoos is modeled as the
following function:
(2)
where π₯π is the appropriate coordinate and π and π are the
initially determined coefficients. Therefore, the Cuckoo
Search Algorithm will depend on these parameters. In
addition, the value of displacement depends on the fixed
step πΌ > 0. Then the value πΌΒ·π(π₯; π; π) be added to every
coordinate to move the cuckoo. In nature, many animal
species use this type of movement strategy when they
do not remember or have no information about where
the food can be found. 3. K Nearest Neighbors Algorithm
Another operation that goes into the algorithm is βhost
The K Nearest Neighbors Algorithm is a simple classi-
decisionβ. Detection of intruder birds takes place with
fier, which is based on finding π elements in a given
a certain fixed probability π β (0, 1). Removal of a bird
database as similar as possible to the tested element, i.e.
takes place if and only if the individual π¦ which is com-
its so-called neighbors [18]. Then it is assumed that the
pared with individual π₯ is better suited to the fitness
neighbors by βvoice of the majorityβ will settle to which
function. The decision process can be written as the
group the new record will belong. The result of the clas-
following function π(Β·, Β·):
sification is the group that contains the largest number
of neighbors. The similarity of records is determined
by measuring the distance between individual elements,
which can be treated as vectors. For this measurement,
(3) the Euclidean or the Manhattan metric can be used.
19
Katarzyna Prokop CEUR Workshop Proceedings 18β25
Suppose that π and π are records with π characteristics, given database allows us to determine the configuration
between which the distance is measured. The elements of that will assign the class with the highest probability of
the dataset are interpreted as vectors. Then the Euclidean correctness.
distance function can be defined as:
4. Hybrid classification method
Heuristic algorithms can be used in the process of data
(5) clustering. The idea of this concept is based on combin-
ing the classic classifier with the heuristic. This paper
and the Manhattan distance function can be described
presents a hybrid classification method that combines the
by the formula:
K Nearest Neighbors Algorithm with the Cuckoo Search
Algorithm.
Let π¦ = π¦1 , ..., π¦π be a π-dimensional vector of un-
known group. The task of the method is to assign a vector
(6) to the appropriate cluster. The idea is that cuckoos, which
are points in πdimensional space, can be identified with
These functions satisfy the condition of the identity records of the database with π features. Then the initial
of indiscernibles, symmetry, and triangle inequality thus population is generated by assigning one specific element
represent metrics. from the dataset to a single cuckoo. A population is there-
The K Nearest Neighbors Algorithm is an example fore a set of a size equal to the size π of the considered
of a lazy classification method [18]. It does not gather database, and the cuckoo coordinates will reflect the val-
any information about a given problem. A solution is ues of records features. As a consequence the ππ‘β cuckoo
selected in real-time, directly after passing along a vector storing values from the ππ‘β record where π = 1, ..., π
to classification. can be represented as
Let π₯π = (π₯π1 , ..., π₯ππ ) be the ππ‘β record with π features
for π = 1, ..., π where π is the number of all vectors in
the dataset. It is clear that π β₯ π. An element given to be (7)
classified is described by π¦ = (π¦1 , ..., π¦π ). The distance The cuckoos are displaced by Levyβs movement ac-
measure π was assumed. Then the K Nearest Neighbors cording to the formula (2). This is followed by a decision
Algorithm may be represented by the pseudocode 2. process called βhost decisionβ, described as a function (3).
In this case, the π fitness function is the distance function
Algorithm 2 K Nearest Neighbors Algorithm. (Euclidean, described by the formula (5) or Manhattan,
described by the function (6)). Its role is to measure the
distance between the cuckoo and the considered vector
π¦ The result of the βhost decisionβ is to save the best
specimens in the nest, i.e. records with the shortest dis-
tance from π¦. Such refinement of the population occurs
a certain number of times iteratively. The next step is to
find the best cuckoo in a final population.
The above operations are performed exactly π times to
get π of the best individuals, one from each population.
These cuckoos represent the π nearest neighbors of the K
Nearest Neighbors Algorithm. The last step is to select an
appropriate cluster for the π¦ vector based on information
about groups stored by neighbors. It is decided by the
βmajority voteβ, which means that the vector π¦ is assigned
to the most frequently appearing cluster.
The concept of using the Cuckoo Search Algorithm in
classification is presented by the pseudocode 3.
5. Experiments
Depending on the established distance measure and
the number of neighbors, it is possible to obtain differ- The hybrid classification method was tested in the
ent classification results. Testing various variants for a decision-making process, the purpose of which was to
20
Katarzyna Prokop CEUR Workshop Proceedings 18β25
Algorithm 3 Application of the cuckoo algorithm in Table 1
classification. Arithmetic mean of the πππ coefficient for π‘ = 100, πΌ =
0.5, π = 0.2, π = 0.2.
method, according to the following formula:
(8)
where ππ is the number of correct matches and ππ‘ is
the number of all tested items. The final results of the
obtained correctness have been rounded to two decimal
places.
The experiments were performed for eight different
sets of input parameters π‘, πΌ, π, π. Depending on the used
match the corresponding class to the object from the distance function (the tested variants are the Euclidean
database. A dataset of iris flowers was used for this distance (5) and the Manhattan distance (6)) and the num-
task. These data were shared in 1936 by the British biolo- ber of π nearest neighbors of the K Nearest Neighbors
gist and statistician Ronald Fisher [19]. The set contains Algorithm, the effectiveness was compared. The program
150 records, each of which consists of 5 attributes that was launched five times for each of the integers π in the
determine the length of the plot of the flower cup, the range [1, 15] using the Euclidean distance, and then the
width of the plot, the length of the petal and its width, same was repeated using the Manhattan distance. The
as well as the name of the species. Therefore, the first arithmetic mean of the correctness coefficient for the
four attributes are expressed by numerical values, while obtained effects (expressed by the formula (8)) was calcu-
the species name is presented in words. The collection lated for each value of π and the used type of the distance
contains data on three species of irises: Iris setosa, Iris function. The results are presented in the form of a table
virginica, Iris versicolor. for each parameter set, respectively.
To test the method, the program was created. Its task The first set of input parameters was π‘ = 100, πΌ =
was to randomly select 30 records to be checked, which 0.2, π = 0.2, π = 0.2 regardless of the value of the π
is 20% of all objects in the database. For testing, these parameter, the Euclidean distance was 100% effective in
objects were stripped of the fifth attribute (species name), matching the iris species to the tested record. In the case
which was remembered by the program to later deter- of the Manhattan distance, the obtained values of the πππ
mine whether the match using the hybrid classification coefficient were slightly lower, but the arithmetic mean
method was successful. The test set was subjected to the for each value of π did not fall below 74%. The maximum
method so that each record was matched with a class value was 90%. Detailed results are presented in the table
(species of the iris). On the output, the program returns 1.
information about the obtained πππ correctness of the Then πΌ was increased to value 0.7, leaving the rest of
the parameters unchanged. The program efficiency was
21
Katarzyna Prokop CEUR Workshop Proceedings 18β25
Table 2 Table 3
Arithmetic mean of the πππ coefficient for π‘ = 100, πΌ = Arithmetic mean of the πππ coefficient for π‘ = 100, πΌ =
0.7, π = 0.2, π = 0.2. 0.2, π = 0.5, π = 0.5.
Table 4
again flawless for the Euclidean distance. For the Manhat- Arithmetic mean of the πππ coefficient for π‘ = 100, πΌ =
tan distance, the arithmetic means of the πππ coefficient 0.7, π = 0.5, π = 0.5.
assumed values from 72% to 86%, which is presented in
the table 2. It follows that for a higher value of the πΌ
parameter, grouping using the Manhattan distance pro-
duced slightly worse results.
In the next step, the effectiveness of the method was
tested for the number of iterations π‘ = 100 and the
coefficients πΌ = 0.2, π = 0.5, π = 0.5. Concerning
the first test, the π and π parameters have been changed.
The use of the Euclidean distance provided 100% of the
correctness of the obtained results for all π values. For
the Manhattan distance, the program achieved average
effectiveness ranging between 76.66% and 89.33%, which
can be read from the table 3. There was no significant
increase in the correctness of the matches πππ to the
change in the values of the input parameters.
In the fourth test, the πΌ parameter was increased to
0.7 again, leaving the other parameters unchanged com-
pared to the previous test. The results are shown in the
table 4. The arithmetic mean of the πππ coefficient for tance was the smallest 78% and the largest 87.33%. A
the Euclidean distance in all cases was 100%. For the much smaller range of the obtained results was observed
Manhattan distance, this value varied between 78.67% compared to the previous tests.
and 91.33%. The latter number represents the highest av- In the sixth variant, the parameter set from the fifth
erage performance achieved so far using the Manhattan test was modified by increasing the πΌ parameter to 0.75.
distance. Compared to the previous test, the modification The results are presented in the table 6. The program
of the πΌ parameter positively influenced the operation unmistakably allocated the species to the irises using the
of the program. Euclidean distance for all the considered π values. Using
The next four tests were performed for the number of the Manhattan distance, an average πππ between 74% and
iterations π‘= 250. At the beginning, the following param- 91.33% was obtained. Such a high value did not occur in
eter values were adopted: πΌ = 0.05, π = 0.7, π = 0.33. the previous test, when πΌ = 0.05, but compared to that
The results were placed in the table 5. Again, the method variant of parameters, the range of the arithmetic mean
was 100% effective for the Euclidean distance. The arith- of the πππ coefficient in this sample is larger.
metic mean of the πππ coefficient for the Manhattan dis- In the penultimate test, the πΌ arameter was returned to
22
Katarzyna Prokop CEUR Workshop Proceedings 18β25
Table 5 Table 7
Arithmetic mean of the πππ coefficient for π‘ = 250, πΌ = Arithmetic mean of the πππ coefficient for π‘ = 250, πΌ =
0.05, π = 0.7, π = 0.33. 0.05, π = 1, π = 1.
Table 6 Table 8
Arithmetic mean of the πππ coefficient for π‘ = 250, πΌ = Arithmetic mean of the πππ coefficient for π‘ = 250, πΌ =
0.75, π = 0.7, π = 0.33. 0.75, π = 1, π = 1.
0.05, while π and π where assumed to be π = π = 1. The tance produced slightly worse results. The efficiency
result of the program is shown in the table 7. Using the of the program using this distance function, depending
Euclidean distance again ensured that the method was on the different values of π,ranged between 70.67% and
100% effective, while for the Manhattan distance the mean 87.33%. A fairly low value of the coefficient for π= 2
of the πππ ranged from 75.33% to 88.67%. Modifications to can be noticed. This is the lowest average value of πππ
π and π have not been observed to significantly improve obtained during method testing. Modifying the πΌ param-
program operation. eter did not bring any significant benefits in terms of the
The last test was carried out for the number of iter- methodβs efficiency.
ations π‘= 250, while for the remaining parameters the
values were π = π = 1, as in the previous test, and the
πΌ parameter was increased to 0.75. Table 8 presents the
6. Conclusions
results obtained during this experiment. The program The conducted tests showed that the hybrid classification
matched all records with the species correctly when it method solves the problem of matching the appropriate
used the Euclidean distance. Using the Manhattan dis-
23
Katarzyna Prokop CEUR Workshop Proceedings 18β25
Figure 1: Arithmetic mean of the method effectiveness depending on the variant of input parameters.
species for the iris without mistakes when the Euclidean waspas decision-making approach, Journal of En-
distance is used in the K Nearest Neighbors Algorithm. terprise Information Management (2021).
When using the Manhattan distance function, the user [3] D. PoΕap, M. WΕodarczyk-Sielicka, N. Wawrzyniak,
can expect most records to be classified correctly, but Automatic ship classification for a riverside mon-
there are few mistakes. The lowest mean value of the itoring system using a cascade of artificial intelli-
πππ correctness coefficient obtained was 70.67%, which gence techniques including penalties and rewards,
roughly means that 21 valid matches were made per 30 ISA transactions (2021).
of the records under test. For the Manhattan distance [4] P. Kulczycki, M. Charytanowicz, Kompletny al-
several times one of the highest effectiveness of the pro- gorytm gradientowej klasteryzacji, Sterowanie i
gram was noted for π=13. However, the differences are automatyzacja: aktualne problemy i ich rozwiaza-
not large enough to clearly state that this is the rule. It nia (2008) 312β321.
was not observed that any of the tested values of π sig- [5] A. V. Aho, J. E. Hopcroft, J. D. Ullman, Algorytmy i
nificantly improved the effectiveness of the method, but struktury danych, Helion, 2003.
in most cases, the use of the π parameter with the value [6] M. WoΕΊniak, D. PoΕap, G. Borowik, C. Napoli, A
from the set 1, 2, 3 resulted in a decrease in the program first attempt to cloud-based user verification in dis-
effectiveness. The graph 1 shows the arithmetic mean tributed system, 2015, p. 226 β 231. doi:10.1109/
of the obtained results for eight trials with specific input APCASE.2015.47.
parameters. [7] D. PoΕap, K. KΔsik, M. WoΕΊniak, R. DamaΕ‘eviΔius,
Parallel technique for the metaheuristic algorithms
using devoted local search and manipulating the
References solutions space, Applied Sciences 8 (2018) 293.
[8] M. WoΕΊniak, D. PoΕap, C. Napoli, E. Tramontana,
[1] U. Ahmed, J. C.-W. Lin, G. Srivastava, P. Fournier-
Graphic object feature extraction system based on
Viger, A transaction classification model of fed-
cuckoo search algorithm, Expert Systems with Ap-
erated learning, in: International Conference on
plications 66 (2016) 20 β 31. doi:10.1016/j.eswa.
Industrial, Engineering and Other Applications of
2016.08.068.
Applied Intelligent Systems, Springer, 2021, pp. 509β
[9] G. De Magistris, S. Russo, S. J. Roma, P. and,
518.
C. Napoli, An explainable fake news detector based
[2] M. Alrasheedi, A. Mardani, A. R. Mishra, P. Rani,
on named entity recognition and stance classifica-
N. Loganathan, An extended framework to evalu-
tion applied to covid-19, Information (Switzerland)
ate sustainable suppliers in manufacturing compa-
13 (2022). doi:10.3390/info13030137.
nies using a new pythagorean fuzzy entropy-swara-
[10] D. PoΕap, M. WoΕΊniak, R. DamaΕ‘eviΔius,
R. MaskeliuΜnas, Bio-inspired voice evalua-
24
Katarzyna Prokop CEUR Workshop Proceedings 18β25
tion mechanism, Applied Soft Computing 80 (2019)
342β357.
[11] G. Lo Sciuto, G. Capizzi, S. Coco, R. Shikler, Geo-
metric shape optimization of organic solar cells for
efficiency enhancement by neural networks, Lec-
ture Notes in Mechanical Engineering (2017) 789 β
796. doi:10.1007/978-3-319-45781-9_79.
[12] G. Capizzi, G. Lo Sciuto, M. WoΕΊniak, R. DamaΕ‘e-
viΔius, A clustering based system for automated oil
spill detection by satellite remote sensing, Lecture
Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lec-
ture Notes in Bioinformatics) 9693 (2016) 613 β 623.
doi:10.1007/978-3-319-39384-1_54.
[13] G. C. Cardarilli, L. Di Nunzio, R. Fazzolari,
M. Panella, M. Re, A. Rosato, S. Span, A paral-
lel hardware implementation for 2-d hierarchical
clustering based on fuzzy logic, IEEE Transactions
on Circuits and Systems II: Express Briefs 68 (2020)
1428β1432.
[14] N. Brandizzi, V. Bianco, G. Castro, S. Russo, A. Wa-
jda, Automatic rgb inference based on facial emo-
tion recognition, volume 3092, 2021, p. 66 β 74.
[15] X.-S. Yang, S. Deb, Cuckoo search via lΓ©vy flights,
in: 2009 World congress on nature & biologically
inspired computing (NaBIC), Ieee, 2009, pp. 210β
214.
[16] G. Capizzi, C. Napoli, S. Russo, M. WoΕΊniak, Lessen-
ing stress and anxiety-related behaviors by means
of ai-driven drones for aromatherapy, volume 2594,
2020, p. 7 β 12.
[17] J. Sudyka, Genetyczne zagadki specjalizacji u
kukuΕki cuculus canorus, Ornis Polonica 53 (2012).
[18] Q. Kuang, L. Zhao, A practical gpu based knn al-
gorithm, in: Proceedings. The 2009 International
Symposium on Computer Science and Computa-
tional Technology (ISCSCI 2009), Citeseer, 2009, p.
151.
[19] R. A. Fisher, The use of multiple measurements in
taxonomic problems, Annals of eugenics 7 (1936)
179β188.
25