=Paper= {{Paper |id=Vol-2386/paper10 |storemode=property |title=Two Step Density-Based Object-Inductive Clustering Algorithm |pdfUrl=https://ceur-ws.org/Vol-2386/paper10.pdf |volume=Vol-2386 |authors=Volodymyr Lytvynenko,Irina Lurie,Jan Krejci,Mariia Voronenko,Nataliіa Savina,Mohamed Ali Taif |dblpUrl=https://dblp.org/rec/conf/momlet/LytvynenkoLKVST19 }} ==Two Step Density-Based Object-Inductive Clustering Algorithm== https://ceur-ws.org/Vol-2386/paper10.pdf

Two Step Density-Based Object-Inductive Clustering
Algorithm

Volodymyr Lytvynenko1[0000-0002-1536-5542], Irina Lurie1[0000-0001-8915-728X], Jan Krejci2
[0000-0003-4365-5413]
, Mariia Voronenko1[0000-0002-5392-5125], Nataliіa Savina3[0000-0001-8339-1219],
Mohamed Ali Taif 1[0000-0002-3449-6791]
1
Kherson National Technical Uneversity, Kherson, Ukraine,
2
Jan Evangelista Purkyne University in Usti nad Labem, Czech Republic,
3
National University of Water and Environmental Engineering,
Rivne, Ukraine,
immun56@gmail.com,lurieira@gmail.com,mary_voronenko@i.ua,
taifmohamedali@gmail.com, jan.krejci@ujep.cz,
n.b.savina@nuwm.edu.ua

Abstract. The article includes the results of study into the practical implementa-
tion of two-step DBSCAN and OPTICS clustering algorithms in the field of ob-
jective clustering of inductive technologies. The architecture of the objective
clustering technology was developed founded on the two-step clustering algo-
rithm DBSCAN and OPTICS. The accomplishment of the technology includes
the simultaneous data’s clustering on two subsets of the same power by the
DBSCAN algorithm, which involve the same number of pairwise objects simi-
lar to each other with the subsequent correction of the received clusters by the
OPTICS algorithm. The finding the algorithm’s optimal parameters was carried
out based on the clustering quality criterion's maximum value of a complex bal-
ance, which is rated as the geometric average of the Harrington desirability in-
dices for clustering quality criteria (internal and external).

Keywords: Clustering, Density-based clustering, Objective clustering, Induc-
tive clustering, clustering quality criteria, Two Step Clustering, DBSCAN,
OPTICUS

1 Introduction

Clustering is the primary method for extracting data. The task of clustering is a spe-
cial case of the task of learning without a teacher and reduces to break the set of data
objects into subsets so that the elements of one subset are significantly different in
some set of properties from the elements of all other subsets. Clustering can be a pre-
processing step in other data extraction applications.
There are many different clustering algorithms. Some of them divide the set into a
known amount of clusters, but some of them automatically select the amount of clus-
ters.
The density-based algorithm is a highly efficient and simple algorithm [1]. Differ-
ent methods are best suited for different databases. In this paper, we consider the
DBSCAN and OPTICS clustering algorithms, which are used to find clusters of vari-
ous shapes, densities and sizes in spatial data sets with noise.
The clustering algorithm, Named DBSCAN, (Density Based Spatial Clustering of
Applications with Noise) was proposed in [1]. It is based on the assumption that the
density of points, which are located inside the clusters, is greater than behind the clus-
ters. This algorithm allows finding nonlinearly separable clusters of arbitrary shape. It
can detect clusters completely encircled, but not connected with other clusters. It does
not need specification of the amount of clusters, distinguishes noise and is resistant to
outliers.
However, the DBSCAN algorithm is not without flaws. The boundary points that
can be reached from more than one cluster can belong to any of these clusters, which
rely on the order of viewing the points.
OPTICS clustering algorithm (Ordering points to identify the clustering structure)
as well as DBSCAN allows finding clusters in data space based on density and was
proposed in [2]. However, unlike DBSCN, this algorithm uses the distance between
neighboring objects to obtain the availability field, which is used to separate clusters
of different densities from noise, which solves the problem of finding content clusters
in data that have different densities. To do this, the data is ordered, so that the spatial-
ly close points become adjacent in the ordering. For each point, a special distance is
stored that represents the density that should be taken for the cluster so that the points
belong to the same cluster. The result of this procedure is presented in the form of a
dendrogram.
Algorithms based on density are highly efficient and simple algorithms [1]. Differ-
ent methods are best suited for different databases. Here we are dealing with
DBSCAN and OPTICS, which are used to find clusters of various shapes, densities
and sizes in spatial data sets with noise.
The idea underlying this algorithm is that inside each cluster there is a typical den-
sity of points (objects), which is noticeably higher than the density outside the cluster,
as well as the density in areas with noise lower than the density of each cluster.
On the other hand, inductive clustering methods [3] allow for inaccurate noisy data
and short samples, using the minimal amount of the chosen quadratic criterion, to find
a non-physical model (decision rule), the accuracy of which is less than the structure
of the full physical model.
Examining the set of candidate models by external criteria is necessary only for
non-physical models. In case of small dispersion of interference, it is advisable to use
internal search criteria. With increasing interference, it is advisable to move to non-
parametric algorithms. The use of inductive clustering methods is advisable because
they almost always ensure that the optimal amount of clusters is found that is ade-
quate for the noise level in the data sample.
The main idea of this work is to combine the density algorithms DBSCAN and
OPTICS, which allow you to recognize clusters of various shapes, as well as define
content clusters for data with different densities and in the form of a two-step algo-
rithm and an inductive clustering method that will significantly improve the accuracy
when recognition of complex objects. It is assumed that by combining these methods,
it is possible to solve some of the problems listed above with a sufficiently high re-
sult.
The aim of the work is to develop a methodological basis for constructing hybrid
inductive cluster-analysis algorithms for isolating (clustering) objects with complex
non-linear forms with high recognition accuracy and resolution.

2 Review of the Literature

The classification of several clustering algorithms by their categories is presented in
[4]. Each of them has its advantages and disadvantages. The choice of an appropriate
clustering algorithm is definited by the type of data being examined and the purpose
of the current task.
Non-parametric algorithms capable of distinguishing clusters of arbitrary shape al-
so allow obtaining a hierarchical representation of data.
The approach used by these algorithms for non-parametric density estimation is
that the density is characterized by the number of nearby elements [5]. Thus, the prox-
imity of a pair of elements is determined by the amount of common neighboring ele-
ments. The most prominent representative of this approach is the DBSCAN clustering
algorithm [6].
Its basic idea is that if an element in the radius keeps a specified amount (MinPts)
of neighboring elements, then all its “neighbors” are placed in the same cluster with
it. Elements that do not have a sufficient number of "neighbors" and are not included
in any cluster belong to "noise". DBSCAN allows you to select clusters of complex
shape and cope with the choices and "noise" in the data. The disadvantages of the
algorithm are the complexity of setting parameter values (and MinPts) [7] and the
difficulty in identifying clusters with significantly different densities.
The OPTICS algorithm [8] is a generalization of DBSCAN, where elements are
ordered into a spanning tree so that the spatially close elements are close together. In
this case, there is no need to carefully adjust the appropriate parameter, and the result
is a hierarchical result [5]. One of the major drawbacks of the existing clustering algo-
rithms is the reproducibility error. The basic idea for solving this problem was pro-
posed in [9].
In [10,11], the authors showed that a decrease in reproducibility error can be
achieved through the use of inductive modeling methods for complicated systems,
which are a logical prolongation of group data processing methods. The issues of
creating a methodology for analyzing inductive systems as a tool for analytical plan-
ning of engineering research are considered in [12]. In [13], the authors first proposed
a hybrid inductive clustering algorithm based on DBSCAN.
The work [14] presents the results of computational experiments using objective
cluster inductive technology of multidimensional high-dimensional data. The authors
showed that the implementation of this technology based on some clustering algo-
rithm involves determining the affinity function between objects, clusters, and ob-
jects, and clusters at the first stage. Then we need to share the investigated data into
two subsets of the same power, which contain the same number of pairs of similar
objects. The formation of quality criteria for the clustering of internal, external and
complex balance should be carried out at the next stage. Optimal clustering is deter-
mined on the basis of the extreme values of the criteria used in the sequential enumer-
ation of admissible clustering.
The article [15] describes the study’s results of the practical accomplishment of the
DBSCAN clustering algorithm within the objective clustering of inductive technolo-
gy. In this paper, the finding of the optimal parameters of the algorithm was per-
formed by use of the complex criterion maximum value for the quality of clustering,
which is calculated as the geometric average of the indicators of the desirability of
Harrington for external and internal criteria for the quality of clustering.
The work [16] investigated the problem of clustering complex data of inductive ob-
jective clustering technology. A practical accomplishment of the hybrid data cluster-
ing model based on the integrated use of R and KNIME software tools has been im-
plemented. The model performance was evaluated using various types of data. The
simulation results showed the high efficiency of the proposed technology. It is shown
that the proposed method allows reducing the reproducibility error value because the
final decision on determining the optimal parameters of the clustering algorithm is
made on the basis of parallel analysis of clustering results obtained on equally power-
ful data sets taking into account the difference in clustering results obtained on these
subsets.
In this article, we describe a hybrid model of an objective cluster inductive tech-
nology founded on the two-step clustering algorithm DBSCAN and OPTICS. The
practical implementation of the proposed model was performed on R.

3 Problem Statement

The formulation of the clustering problem is as follows: let X is the set of objects, Y
is the set of amounts (names, labels) of clusters. The function of the distance between
objects   x, x   is also set. It is necessary to divide the sample into subsets that do
not overlap (clusters), so that each cluster composes of objects  that are close in
metric, and the objects of different clusters are significantly different. In addition,
each object xi  X m corresponds to a cluster number yi . In this case, the clustering
algorithm can be considered as a function a : X  Y that assigns a cluster number
y  Y to any object x  Y . In some cases, the set Y is known in advance, but more
often the task is to find the optimal amount of clusters according to one or another
criterion of the quality of clustering [3].
In inductive clustering methods, the cluster model is selected using the minimum
external balance criterion, which characterizes the quality of clustering of the corre-
sponding model on two identical power sets.
Formally, the optimal inductive clustering model can be presented as:


M : R  K  | e  e0 ,   0 
 
CR
 opt  (1)
RK  is the result of clustering, e is the error of clustering on the training and test
samples,  is the time interval of the clustering process, CR is the set of internal and
external criteria for assessing the quality of clustering.

4 Materials and Methods

4.1 DBSCAN Сlustering Аlgorithm
The idea underlying the algorithm is that inside each cluster there is a typical density
of points (objects), which is noticeably higher than the density outside the cluster, as
well as the density in areas with noise below the density of any of the clusters. For
each point of the cluster, its neighborhood of a given radius must contain at least a
certain amount of points, this amount of points is specified by a threshold value.
Most algorithms that produce a flat partition create clusters in the form close to
spherical, since they minimize the interval of documents to the center of the cluster
[17].
DBSCAN authors have shown experimentally that their algorithm is capable of
recognizing clusters of different shapes. The basic idea behind the algorithm lies in
the fact that within each cluster there is a typical density of points (objects) that is
noticeably higher than the outside density of the cluster, as well as the density in areas
with noise below the density of any of the clusters. Even more precisely, for each
point of the cluster, its neighborhood of a given radius must contain some amount of
points, this amount of points is given by the limit values [18].
The basis of this algorithm is several definitions [18]:
  is the vicinity of the object is called the outskirts of the radius  of some
object;
 the root object is named an object  is the neighborhood of which contains
some minimum amount of MinPts objects;
 the object p is directly tightly accessible from the object q if p located in 
is the neighborhood q and q is the root object;
 the object p is the tight reachable from the object q for the given  and the
parameter MinPts, if there is a sequence of objects p, , p , where p  q and
p  p such that p  1 is directly densely achievable with p , 1  i  n ;
 the object p is tightly connected to the object q when given  and MinPts, if
there is an object o that p is the same as q the available volume from o .

To search for clusters, the DBSCAN algorithm checks  is the neighborhood of each
object. If  is the neighborhood of the object p contains more points than MinPts,
then a new cluster with a root object p created. Then, DBSCAN iterative collects
objects directly tightly reachable from the root objects, which can lead to the union of
several tight reachable clusters. The process is completed when no new object can be
added to one cluster.
Although the DBSCAN algorithm does not need the pre-specified amount of clus-
ters received, it will be necessary to specify parameters values  and MinPts that
directly affect the clustering result. The optimal values of these parameters are diffi-
cult to determine, especially for multidimensional data spaces. In addition, the distri-
bution of data in such spaces is often asymmetric, which does not allow them to be
used for clustering of global density parameters.
The work of the DBSCAN algorithm is as follows.
Enter: the set of objects S, Eps and MinPt.
An object can be in one of three states:
1. Not noted.
2. It is noted that no cluster is the internal object.
3. Attributed to some cluster.
Step 1. To set all the elements of the set S flag S "not marked". Assign the current
cluster C j to a zero number, j  0 . The set of noise points Noise = 0.
Step 2. For each si  S such flag  si  = "not marked", execute:
Step 3. Flag  si  = "not marked";
Step 4 Ni  N Eps  si   q  S dist  si , q   Eps
Step 5. If si  MinPt , then Noise  Noise  si 
Otherwise the number of the next cluster j  j  1 ;

EXPANDCLUSTER si , Ni , C j , Eps, MinPt ; 
 
Exit: The set of clusters C  C j .
EXPANDCLUSTER
Login: The current object si , its eps neighbor Ni , the current cluster Ni and
Eps, MinPt .
Step 1 C j  C j  si  ;
Step 2. For all points sk  Ni :
Step 3. If the flag  sk  = "not marked", then
Step 4 flag  sk  = "marked";
Step 5. N ik  N Eps  sk  ;
Step 6. If N ik  MinPt , then Ni  Ni  Nik ;
Step 7. If ∄ p : sk  C p , p  1,  C  , those C j  C j  sk  ;
Exit: cluster C j .

As the research shows [18], the considered clustering algorithm has a number of ad-
vantages that make it possible to use this method for working with clusters of differ-
ent nature (forms); the application of this algorithm allows you to work with large-
scale samples and allows you to work with n-dimensional objects (these are objects
whose attributes are more than 3 if the function is appropriately selected for calculat-
ing the distance (in the general case it is possible to use the Markov metric) However,
a significant disadvantage is a rather laborious procedure for determining the required
parameters for the correct operation of the algorithm. More detailed descriptions and
drawbacks of the DBSCAN algorithm are shown in Table 1.

Table 1. Advantages and disadvantages of the DBSCAN algorithm

Advantages Disadvantages
1. DBSCAN can find arbitrary clus- 1. DBSCAN is not a fully deterministic algo-
ters. rithm: the boundary points that can be ac-
2. DBSCAN has a notion of noise and cessed from more than one cluster may be
is resistant to emissions, i.e. all part of another cluster, depending on the
emissions are made in a separate order of data processing.
cluster. 2. The quality of DBSCAN operation depends
3. Does not need a priori task of the on the distance used. The most commonly
amount of clusters, in contrast to used Euclidean distance; But for multidi-
the K-mean algorithm. mensional data, this indicator can be almost
4. Uses only two parameters and is useless due to the so-called "curse of di-
basically not sensitive to the order- mension", which makes it difficult to find
ing of points in the database. the nearest value for ε. This effect is also
5. Allows working with samples of located in any other algorithm based on the
large dimensional data. Euclidean distance.
6. Defining the parameters MinPts 3. DBSCAN cannot copy the data to a large
and ε allow working with n- difference in density, since the combination
dimensional objects provided that MinPts ε cannot be selected appropriately
an appropriate function is selected for all clusters.
for the calculation of the distance. 4. If the scale and data are not understandable,
it may be very difficult to choose a signifi-
cant distance from the threshold ε.
5. Significant drawback is a very laborious
procedure for determining the required pa-
rameters for the correct algorithm proce-
dure.

In the general case, the DBSCAN algorithm has a quadratic computational complexi-
ty due to the search for the Eps is neighborhood. However, the authors of the algo-
rithm used a special data structure for this purpose R * are trees, as a result, the search
for Eps is neighborhood for one point O (log n). The total computational complexity
of DBSCAN is O (n * log n) [19].

4.2 OPTICS Сlustering Аlgorithm

The concept of the OPTICS algorithm [8] is similar to DBSCAN, but the algorithm is
designed to get rid of one of the main weaknesses of the DBSCAN algorithm is the
problem of finding content clusters in data that has different densities.
To do this, the database points are (linearly) ordered so that the spatially close
points become adjacent in the ordering. In addition, for each point, a special distance
is stored that represents the density that should be taken for the cluster, so that the
points belong to the same cluster. This is presented in the form of a dendrogram.
In this case, there is no need to carefully adjust the appropriate parameter, and the
result is a hierarchical result [5]. However, the parameter is specified in the algorithm
as the maximum radius considered. Ideally, it can be set very large, but this leads to
exorbitant computational costs.
OPTICS density algorithm [8] also allows you to select a hierarchical structure and
clusters of complex shape. The data is ordered into a spanning tree so that the spatial-
ly close elements are located nearby. In this case, the hierarchy is represented in the
form of a reachability diagram, on which the reachability distances for the constructed
sequence of elements are marked. The peaks in the diagram correspond to the divi-
sions between the clusters, and their height to the distance. If necessary, a dendrogram
can be easily constructed from a reachability diagram. Since for each element only
adjacent elements in a limited radius ε are considered, the OPTICS algorithm can be
implemented with computational complexity, which is not sufficient for processing
large data arrays.
DBSCAN requires two parameters, the optimal values of which are difficult to de-
termine. Therefore, an OPTICS algorithm was proposed in [8], which makes it possi-
ble to order the initial set and simplify the clustering process. In accordance with it, a
reachability diagram is constructed, thanks to which, with a fixed MinPts value, it is
possible to process not only the specified value e, but also all e *