<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Forecasting via Distributed Density-Based Clustering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Roberto Corizzo</string-name>
          <email>roberto.corizzo@uniba.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianvito Pio</string-name>
          <email>gianvito.pio@uniba.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michelangelo Ceci</string-name>
          <email>michelangelo.ceci@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Donato Malerba</string-name>
          <email>donato.malerba@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CINI - Consorzio Interuniversitario Nazionale per l'Informatica - Bari</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Bari Aldo Moro Department of Computer Science - Via Orabona</institution>
          ,
          <addr-line>4 - 70125 Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The generation of massive amounts of data in di erent forms (such as activity
logs and measurements produced by sensor networks) increased the need of novel
data mining algorithms which are capable to build accurate models e ciently
and in a distributed fashion. In the recent years, several approaches able to
distribute the workload to several machines have been proposed for the classical
clustering, classi cation and regression tasks. However, they often su er from
strong limitations in their applicability, i.e. they are limited to data organized
in a speci c structure, are able to analyze only low-dimensional data, or su er
from overhead and scalability issues when the number of instances and attributes
increase considerably. For example, to the best of our knowledge, in the literature
there are very few density-based clustering algorithms (mostly inspired by the
well known algorithm DBSCAN) able to work with an arbitrary number of
features and with good scalability performances. Moreover, most of them are
exploitable only for pure clustering purposes. At this respect, in this paper we
propose a novel density-based clustering algorithm, implemented in the Apache
Spark framework, which is able to handle large-scale and high-dimensional data
by exploiting the Locality Sensitive Hashing (LSH). Moreover, the proposed
approach is able to exploit the identi ed clusters, built on historical data, to
forecast the values assumed by a target variable in the future. Therefore, the
proposed method can be adopted also for forecasting purposes.</p>
      <p>The rest of the paper is structured as follows: in Section 2, we brie y
review existing methods to solve the classical density-based clustering task as well
as recently proposed clustering methods which are able to handle large-scale
data in parallel on multiple processors. In Section 3, we propose our distributed
density-based clustering solution, which is able to work on high-dimensional and
large-scale data, and describe our approach to exploit the identi ed clusters for
forecasting purposes. In Section 4, we report the results of our experimental
evaluation, showing that the proposed method is able to obtain accurate predictions
and appears very e cient in dealing massive amounts of high-dimensional data.
Finally, in Section 5 we draw some conclusions and outline some future work.</p>
    </sec>
    <sec id="sec-2">
      <title>Background on Density-based Clustering</title>
      <p>
        Density-based clustering was introduced in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] with the algorithm DBSCAN.
This approach is able to identify arbitrarily shaped clusters (not only spherical,
as partitioning-based approaches) without requiring the number of clusters to
be extracted as an input parameter. However, it requires two other parameters,
i.e., eps and minP ts. Since several concepts about density-based algorithms are
common to those adopted in this paper, here we recall some useful basic notions:
{ The neighborhood of an object p is de ned as N (p) = fq j dist(p; q) &lt; epsg;
{ An object p is a core object w.r.t. eps and minP ts if jN (p)j minP ts;
{ An object p is directly density-reachable from an object q if p 2 N (q) and q
is a core object ;
{ An object pw is density-reachable from an object p1 if there exists a chain
of objects p1; p2; : : : ; pw, such that for each pair of objects hpi, pi+1i, pi+1 is
directly density-reachable from pi w.r.t. eps and minP ts;
{ An object p is density-connected to an object q if there is an object o such
that both p and q are density-reachable from o w.r.t. eps and minP ts;
{ A cluster is a non-empty subset of objects, such that for each pair of objects
hp; qi, p is density-connected to q;
{ Non-core objects belonging to at least one cluster are called border objects,
whereas objects not belonging to any cluster are called noise objects.
The algorithm DBSCAN starts with an arbitrary object o and, if it is a core
object, retrieves all the objects which are density-reachable from o w.r.t. eps
and minP ts. This procedure returns a cluster and the algorithm proceeds with
the next unclustered object. This algorithm has been proved to identify accurate
and arbitrary shaped clusters and is almost independent of the order of the
analysis of the objects. Several variants have been proposed in the literature,
aiming at facing di erent limitations of the original DBSCAN (e.g., [
        <xref ref-type="bibr" rid="ref2 ref5">2, 5</xref>
        ] for the
estimation of the value of the input parameters) or at extending its applicability,
for example, to streams of data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or to spatio-temporal data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        In order to be able to process large datasets, other works focused on the
computational complexity, which is originally O(n2 m) (where n is the number
of objects and m is the number of features), dominated by the computation of the
neighborhood of each node. An example can be found in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], where the authors
proposed an approximated variant of DBSCAN, based on Locality Sensitive
Hashing (LSH) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] which requires O(n m) steps to compute the neighborhood
of the nodes. Still focusing on the possibility to process large datasets, in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
a parallel implementation of DBSCAN for MapReduce has been proposed. This
method is based on the partitioning of objects according to each dimension, thus
it su ers from a computational viewpoint with high-dimensional data. For this
reason, the authors limit their experiments to 2D datasets. The same limitation
a ects other existing methods implemented in the Apache Spark framework [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        In this context, the algorithm proposed here attacks the issues raised by
high-dimensional and large-scale datasets through an approach which is
computationally e cient and distributed in all its steps and that, inspired by works on
predictive clustering [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], can be exploited for forecasting purposes.
      </p>
    </sec>
    <sec id="sec-3">
      <title>The Proposed Method</title>
      <p>In this Section, we describe our distributed density-based clustering method,
which is able to work with high-dimensional and large-scale datasets and that can
be exploited for forecasting purposes. In Figure 1 we show the general work ow
of the proposed approach, while in the following subsections we describe in detail
each step of the method, brie y analyzing their computational complexity and
showing that they can be easily parallelized in the Apache Spark framework.
3.1</p>
      <sec id="sec-3-1">
        <title>Computation of the Neighborhood Graph and Core Objects</title>
        <p>
          Given a dataset D of n objects described by m features (m 1 descriptive
attributes and one target attribute), the rst goal is to identify groups of similar
objects according to their features. Most clustering algorithms, including
densitybased algorithms, strongly rely on the computation of similarity/distance
measures among pairs of objects. This task has an inherent cost of O(n2 m), which
can be reduced signi cantly only by resorting to approximated strategies. In this
paper, inspired by [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], we adopt the Locality Sensitive Hashing (LSH), which
is able to identify an approximation of the neighborhood graph (where nodes
are linked when their similarity is greater than 1-eps) among the instances in
O(n m), preserving the neighborhood relationships. In particular, LSH adopts
a hash function which maps similar objects (according to cosine similarity) to
the same bucket with high probability. Objects falling in the same bucket of a
given object are considered as belonging to its neighborhood. Here, we exploit a
parallel implementation of LSH available for the Apache Spark framework1.
        </p>
        <p>Once the neighborhood graph has been computed, we identify the set of core
objects, according to the de nitions provided in the previous section, that is, we
identify the set cores = fo s.t. jN (o)j minP tsg. Let E be the set of edges
of the neighborhood graph identi ed by LSH, represented as a distributed data
structure of pairs of nodes. This step can be implemented as an aggregateByKey
operation (to nd the set of neighbors for each node) followed by a lter
operation (to select only those nodes o such that jN (o)j minP ts). Therefore,
this step is fully parallelizable and, since each edge of the neighborhood graph
is analyzed once, its computational complexity is O(jEj).</p>
        <sec id="sec-3-1-1">
          <title>1 https://github.com/soundcloud/cosine-lsh-join-spark</title>
          <p>Fig. 1. Work ow of the proposed method.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Distributed Density-based Clustering</title>
        <p>Given the set of core nodes and the neighborhood graph, our density-based
clustering algorithm exploits the property of density-connection, described in
Section 2, to simultaneously identify multiple density-based clusters. It is
noteworthy that in this phase we do not need node attributes, therefore we can
distribute over the available machines only the neighborhood graph, making the
proposed algorithm e cient also in terms of memory requirements. An intuitive
pseudo-code description of our clustering method is shown in Algorithm 1.</p>
        <p>In detail, the algorithm initially assigns a di erent cluster ID to each core
node, and the I D = 0 to boundary and noise nodes (lines 1-10). This step can
be implemented in a distributed fashion by a single map operation over the set
of nodes, therefore its complexity is O(n). Subsequently, each core node sends
a message containing its cluster ID to its neighbouring nodes having a lower
cluster ID (lines 15-27). Each node receives multiple messages from its adjacent
Algorithm 1 Density-based clustering exploiting the neighborhood graph.
Require:</p>
        <p>G = hV; Ei: the neighborhood graph, where V are the nodes and E are the edges representing
the neighborhood relationship between pairs of nodes;
cores: set of core nodes from which to start the propagation of cluster IDs;
labelChangeRate: minimum percentage of label changes to perform a new iteration.
Ensure:</p>
        <p>the updated graph G, where nodes V are associated to their cluster ID.
fEach node receives multiple messages from (core, clustered) neighboring nodesg
for all e = hsrc; dsti 2 E do
if src 2 cores and src:clusterID &gt; dst:clusterID then
dst:messages dst:messages [ fsrc:clusterIDg
changes changes + 1
end if
end for
nodes and aggregates them by taking the highest cluster ID (lines 29-32)2. We
repeat this process until the cluster assignments appear stable, i.e. when the
messages exchanged among nodes through the edges of the neighborhood graph
are less then a given percentage labelChangeRate of the whole set of edges. This
process is coherent with the general concepts at the basis of classical
densitybased clustering approaches, since clusters can only be built from core nodes and
since the propagation of cluster IDs is performed only from core nodes.</p>
        <p>
          This iterative process can be easily parallelized, since each edge of the graph
can be analyzed independently of the other to evaluate whether a message has
to be sent. Moreover, the nal aggregation step of the received messages can be
performed by a single reduce operation. Therefore, the computational cost of this
phase is dominated by O(u jEj), where u is the number of performed iterations.
We also emphasize the fact that, contrary to existing methods (e.g., [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]) our
approach does not require a nal merging procedure necessary to aggregate the
results computed on di erent machines, which is usually executed on a single
driver machine.
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Exploiting Clusters for Forecasting Purposes</title>
        <p>
          Inspired by predictive clustering approaches [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], we exploit the k identi ed
clusters for forecasting purposes (see Figure 2). In particular, we re-associate
each node to its features through a join operation and compute a representative
m-dimensional vector (including the target attribute3) for each cluster, by
performing a column-wise average of the vectors associated to objects falling in the
cluster. This operation can be easily parallelized by aggregating the clustering
result over the cluster IDs and by performing a single map operation to obtain
the representative vectors for each cluster. Since each object can fall in at most
one cluster, the complexity of this operation is O(n m). The prediction of the
value of the target attribute for unlabeled instances can then be performed by
2 This is only an implementation choice. Indeed, we could keep the lowest cluster ID
without any change in the results, since the density-connection is symmetric.
3 Here we assume that the target attribute is numerical. However, we can handle also
categorical attributes by adopting a strategy based on majority voting.
        </p>
        <p>Fig. 2. Outline of the forecasting step.
comparing their (m 1)-dimensional feature vector (excluding the target
attribute, since it is unknown for unlabeled instances) to all the representative
vectors associated to the clusters, in order to nd the closest cluster. The value
assumed by the target attribute of the closest representative vector is nally
associated to the unlabeled instance. Also this step can be easily parallelized
through a map operation over the set of unlabeled instances which compares
their vector to all the k representative vectors. Therefore, the computational
complexity of the prediction phase is O(k m) for each unlabeled instance.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>
        We performed the experiments by adopting the self-adaptive online training
strategy, where data are represented as a time-based sliding window [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] of size
d (past d days of historical data). We considered di erent values of d (30; 60 and
90), and set the forecasting horizon as one-day-ahead. Each con guration was
run ve times, with di erent random picks of test days. Results are evaluated in
terms of the average Root Mean Square Error (RMSE).
      </p>
      <p>The considered approaches, each run in its best con guration, are:</p>
      <p>Our method, optimized by means of a grid search on a separate set of days,
in order to identify the best values of the parameters eps and minP ts;</p>
      <p>K-Means algorithm, implemented in the Apache Spark framework, extended
with our forecasting strategy. A grid search has been performed using a separate
set of days, in order to identify the best value of the parameter k;</p>
      <p>
        ARIMA [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a pure forecasting method which automatically selects the best
forecasting model, on the basis of the Akaike Information Criterion (AIC);
      </p>
      <p>AVG, a baseline approach which predicts the value of the target attribute as
the average value assumed over the training instances.</p>
      <p>The considered datasets are described in the following:</p>
      <p>PVItaly. This dataset contains data about the energy production of
photovoltaic power plants, aggregated hourly, collected at regular intervals by sensors
located on 17 plants in Italy, from January 1st 2012 to May 4th 2014.</p>
      <p>PVNREL. This dataset originally consists of simulated photovoltaic data
of 6,000 plants, aggregated hourly, for the year 2006. The experiments were
performed on a reduced version of the dataset, consisting of 48 plants belonging
to the 16 States with the highest Global Horizontal Irradiation.</p>
      <p>
        Bike Sharing (BS). This dataset4 contains data of the Capital bikeshare
system, aggregated hourly, about the process of rental and returning of bikes,
from/to possibly di erent positions, collected in 2011 and 2012 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Descriptive attributes of PVItaly and PVNREL regard environmental and
meteorological factors, whereas the target attribute is the hourly energy production
(pre-processing steps applied to these datasets can be found in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]). Descriptive
attributes of Bike Sharing regard weather and seasonal information, whereas the
target attribute is the hourly count of rented bikes. See Table 1 for additional
information about the dataset characteristics and the considered parameters.
      </p>
      <sec id="sec-4-1">
        <title>4 https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset</title>
        <p>In the results shown in Table 2, we can observe that the simple baseline approach
AVG actually appears comparable to ARIMA in the datasets PVNREL and BS.
However, they are strongly outperformed by our method and the extended
version of K-Means. Our density-based method leads to the best results in PVItaly
and PVNREL, whereas it appears comparable to K-Means in BS. Moreover,
it can be observed that, on PVNREL and BS, the proposed method is able to
exploit the possible availability of more instances in the training phase. At this
respect, in order to evaluate whether we can handle a massive amount of data,
without incurring in time complexity issues, we performed a scalability test with
the large version of the dataset PVNREL, on a cluster of 4 worker machines
and one driver machine, each with 4 cores and 32GB of RAM, by progressively
increasing the number of edges in the neighborhood graph up to 130 millions.
In Figure 3, we can observe that i) the proposed algorithm scales linearly with
respect to the number of edges, therefore it is able to deal massive amounts of
data, and ii) the possible overhead introduced by the distribution of data to
(and by the collection of the results from) di erent machines appears negligible.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>In this paper, we proposed a distributed density-based clustering approach,
which can be fruitfully exploited for forecasting purposes. The algorithm,
implemented in the Apache Spark framework, is fully parallel, does not require any
nal merging procedure and is computationally e cient. Experimental results
show that it is able to obtain good predictive performance and that it scales
linearly with respect to the number of edges in the neighborhood graph. For future
work, we intend to extend the forecasting capabilities to the multi-target setting
and to perform more extensive comparisons with existing parallel solutions.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to acknowledge the support of the European Commission through
the projects MAESTRA - Learning from Massive, Incompletely annotated, and
Structured Data (Grant Number ICT-2013-612944) and TOREADOR -
Trustworthy Model-aware Analytics Data Platform (Grant Number H2020-688797).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aggarwal</surname>
          </string-name>
          , C.C.,
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>J</given-names>
          </string-name>
          .,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>P.S.:</given-names>
          </string-name>
          <article-title>A framework for clustering evolving data streams</article-title>
          .
          <source>In: Proc. of VLDB -</source>
          Volume
          <volume>29</volume>
          . pp.
          <volume>81</volume>
          {
          <fpage>92</fpage>
          .
          <string-name>
            <given-names>VLDB</given-names>
            <surname>Endowment</surname>
          </string-name>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ankerst</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Breunig</surname>
            ,
            <given-names>M.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kriegel</surname>
            ,
            <given-names>H.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sander</surname>
            ,
            <given-names>J.: OPTICS</given-names>
          </string-name>
          :
          <article-title>Ordering Points to Identify the Clustering Structure</article-title>
          .
          <source>In: ACM SIGMOD '99</source>
          . pp.
          <volume>49</volume>
          {
          <issue>60</issue>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Birant</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kut</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <string-name>
            <surname>ST-DBSCAN</surname>
          </string-name>
          :
          <article-title>An algorithm for clustering spatial{temporal data</article-title>
          .
          <source>Data &amp; Knowledge Engineering</source>
          <volume>60</volume>
          (
          <issue>1</issue>
          ),
          <volume>208</volume>
          {
          <fpage>221</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Box</surname>
            ,
            <given-names>G.E.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jenkins</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Time Series Analysis, Forecasting and Control</article-title>
          . HoldenDay,
          <string-name>
            <surname>Incorporated</surname>
          </string-name>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Campello</surname>
            ,
            <given-names>R.J.G.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moulavi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zimek</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sander</surname>
          </string-name>
          , J.:
          <article-title>Hierarchical density estimates for data clustering, visualization, and outlier detection</article-title>
          .
          <source>ACM Trans. Knowl. Discov. Data</source>
          <volume>10</volume>
          (
          <issue>1</issue>
          ), 5:
          <issue>1</issue>
          {5:
          <fpage>51</fpage>
          (Jul
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ceci</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corizzo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fumarola</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malerba</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rashkovska</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Predictive modeling of pv energy production: How to set up the learning task for a better prediction? IEEE Transactions on Industrial Informatics PP(</article-title>
          <year>99</year>
          ),
          <volume>1</volume>
          {
          <issue>1</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Charikar</surname>
            ,
            <given-names>M.S.</given-names>
          </string-name>
          :
          <article-title>Similarity estimation techniques from rounding algorithms</article-title>
          .
          <source>In: Proc. of ACM symposium on Theory of computing</source>
          . pp.
          <volume>380</volume>
          {
          <fpage>388</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Cordova</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moh</surname>
            ,
            <given-names>T.S.:</given-names>
          </string-name>
          <article-title>Dbscan on resilient distributed datasets</article-title>
          .
          <source>In: High Performance Computing &amp; Simulation (HPCS)</source>
          . pp.
          <volume>531</volume>
          {
          <fpage>540</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ester</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kriegel</surname>
            ,
            <given-names>H.P.</given-names>
          </string-name>
          , et al.:
          <article-title>A density-based algorithm for discovering clusters in large spatial databases with noise</article-title>
          .
          <source>In: Kdd</source>
          . vol.
          <volume>96</volume>
          (
          <issue>34</issue>
          ), pp.
          <volume>226</volume>
          {
          <issue>231</issue>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Fanaee-T</surname>
          </string-name>
          , H.,
          <string-name>
            <surname>Gama</surname>
          </string-name>
          , J.:
          <article-title>Event labeling combining ensemble detectors and background knowledge</article-title>
          .
          <source>Progress in Arti cial Intelligence</source>
          pp.
          <volume>1</volume>
          {
          <issue>15</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Gama</surname>
          </string-name>
          , J.:
          <article-title>Knowledge Discovery from Data Streams. Chapman and Hall / CRC Data Mining and Knowledge Discovery Series</article-title>
          , CRC Press (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fan</surname>
          </string-name>
          , J.:
          <article-title>MR-DBSCAN: an e cient parallel density-based clustering algorithm using MapReduce</article-title>
          .
          <source>In: Parallel and Distributed Systems (ICPADS)</source>
          . pp.
          <volume>473</volume>
          {
          <fpage>480</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Stojanova</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ceci</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Appice</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Dzeroski, S.:
          <article-title>Network regression with predictive clustering trees</article-title>
          .
          <source>Data Min. Knowl. Discov</source>
          .
          <volume>25</volume>
          (
          <issue>2</issue>
          ),
          <volume>378</volume>
          {
          <fpage>413</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.J.:</given-names>
          </string-name>
          <article-title>A linear DBSCAN algorithm based on LSH</article-title>
          .
          <source>In: Int. Conf. on Machine Learning and Cybernetics</source>
          . vol.
          <volume>5</volume>
          , pp.
          <volume>2608</volume>
          {
          <issue>2614</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>