<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Safe­Zones for Monitoring Distributed Streams</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Ben­David Technion</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Assaf Schuster Technion</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Daniel Keren Haifa University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In many emerging applications, the data which has to be monitored is of very high volume, dynamic, and distributed, making it infeasible to collect the distinct data streams to a central node and process them there. Often, the monitoring problem consists of determining whether the value of a global function, which depends on the union of all streams, crossed a certain threshold. A great deal of effort is directed at reducing communication overhead by transforming the monitoring of the global function to the testing of local constraints, checked independently at the nodes. Recently, geometric monitoring (GM) proved to be very useful for constructing such local constraints for general (non-linear, non-monotonic) functions. Alas, in all current variants of geometric monitoring, the constraints at all nodes share an identical structure and are, thus, unsuitable for handling heterogeneous streams, which obey different distributions at the distinct nodes. To remedy this, we propose a general approach for geometric monitoring of heterogeneous streams (HGM), which defines constraints tailored to fit the distinct data distributions at the nodes. While optimally selecting the constraints is an NP-hard problem, we provide a practical solution, which seeks to reduce running time by hierarchically clustering nodes with similar data distributions and then solving more, but simpler, optimization problems. Experiments are provided to support the validity of the proposed approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        For a few years now, processing and monitoring of
distributed streams has been emerging as a major effort in
data management, with dedicated systems being developed
for the task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This paper deals with threshold queries
over distributed streams, which are defined as “retrieve all
items x for which f (x) ≤ T ”, where f () is a scoring
function and T some threshold. Such queries are the building
block for many algorithms, such as top-k queries, anomaly
detection, and system monitoring. They are also applied in
important data processing and data mining tools, including
feature selection, decision tree construction, association rule
mining, and computing correlations. Another important
application is data classification, which is often also achieved
by thresholding a function, such as the output of a neural
net or support vector machine.
      </p>
      <p>
        Geometric monitoring (GM) [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
        ] has been recently
proposed for handling such threshold queries over distributed
data. While a more detailed presentation is deferred until
Section 2, we note that GM can be applied to the
important case of scoring functions f () evaluated at the average of
dynamic data vectors v1(t), . . . , vn(t), maintained at n
distributed nodes. Here, vi(t) is an m-dimensional data
vector, often denoted as local vector, at the i-th node Ni at
time t (often t will be suppressed). In a nutshell, each node
monitors a convex subset, often referred to as the node’s
safe-zone, of the domain of these data vectors, as opposed
to their range. What is guaranteed in GM is that the global
function f () will not cross its specified threshold as long as
all data vectors lie within their corresponding safe-zones.
Thus, each node remains silent as long as its data vector
lies within its safe zone. Otherwise, in case of a safe-zone
breach, communication needs to take place in order to check
if the function has truly crossed the given threshold.
      </p>
      <p>
        The geometric technique can support any scoring function
f (), evaluated at the average of the dynamic data vectors.
Thus, f () is not assumed to obey some simple property (e.g.,
linearity or monotonicity). To add to the generality of the
technique [
        <xref ref-type="bibr" rid="ref3 ref6">3, 6</xref>
        ], the vi vector can contain not only the raw
data, but any function (i.e., norm, logarithm, power,
variance, etc) computed over the data of Ni. Thus, GM allows
the monitoring of functions that are far more complex and
general than simple aggregates.
      </p>
      <p>A crucial component for reducing the communication
required by the geometric method is the design of the safe-zone
in each node. Nodes remain silent as long as their local
vectors remain within their safe-zone. Thus, good safe-zones
increase the probability that nodes will remain silent, while
also guaranteeing correctness: a global threshold violation
cannot occur unless at least one node’s local vector lies
outside the corresponding node’s safe-zone.</p>
      <p>However, prior work on geometric monitoring has failed to
take into account the nature of heterogeneous data streams,
in which the data distribution of the local vectors at
different nodes may vary significantly. This has led to a uniform
treatment of all nodes, independently of their
characteristics, and the assignment of identical safe-zones (i.e., of the
same shape and size) to all nodes.</p>
      <p>As we demonstrate in this paper, designing safe-zones that
take into account the data distribution of nodes can lead
to efficiently monitoring threshold queries at a fraction
(requiring an order of magnitude fewer messages) of what prior
techniques achieve. However, designing different safe-zones
for the nodes is by no means an easy task. In fact, the
problem is NP-hard (proof omitted due to lack of space). We
thus propose a more practical solution that hierarchically
clusters nodes, based on the similarity of their data
distributions, and then seeks to solve many small (and easier)
optimization problems.</p>
      <p>The contributions of this work are:
• We formulate a far more general safe-zone assignment
problem than those which were treated so far. Instead of
constructing one safe-zone which is common to all nodes,
we seek to fit each node with a safe-zone that suits its
data distribution.
• We present a practical solution, which uses hierarchical
clustering of the nodes to construct the safe-zones, while
applying various geometric and computational tools.
• The resulting safe-zones were tested on real data, where
we demonstrate that: (i) the hierarchical clustering
approach dramatically reduces the running time of the
safezone assignment process, (ii) our techniques may result
in one order of magnitude (or even larger) improvements
in communication over previous geometric monitoring
methods, even for a small number of nodes.</p>
      <p>Outline. We survey prior work on the geometric approach
in Section 2. In Section 3 we formulate our optimization
problem, which involves the design of safe-zones at the nodes.
Section 4 presents our algorithmic framework. Experiments
are resented in Section 5. Lastly, conclusions are offered.</p>
      <p>Hereafter we denote our proposed method for geometric
monitoring of heterogeneous streams as HGM, in contrast
to prior work on geometric monitoring that is denoted GM.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Space limitations allow us to only survey previous work
on geometric monitoring (GM). We now describe some basic
ideas and concepts of the GM technique, which was
introduced and applied to monitor distributed data streams in [
        <xref ref-type="bibr" rid="ref2 ref7">2,
7</xref>
        ].
      </p>
      <p>
        As described in Section 1, each node Ni maintains a local
vector vi, while the monitoring function f () is evaluated
at the average v of the vi vectors. Before the monitoring
process, each node Ni is assigned a subset of the data space,
denoted as Si – its safe-zone – such that, as long as the local
vectors are inside their respective safe-zones, it is guaranteed
that the global function’s value did not cross the threshold;
thus the node remains silent as long as its local vector vi is
inside Si. If vi ∈/ Si (local violation), a violation recovery
(“balancing”) algorithm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] can be applied.
      </p>
      <p>
        For details and scope of GM see [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and the survey in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Recently, GM was successfully applied to detecting outliers
in sensor networks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], extended to prediction-based
monitoring [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and applied to other monitoring problems [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10,
11</xref>
        ].
      </p>
      <p>Basic definitions relating to GM. A basic construct is
the admissible region, defined by A , {v|f (v) ≤ T }. Since
the value we wish to monitor is f v1+.n..+vn , any viable
assignment of safe-zones must satisfy
n
^ (vi ∈ Si) → v = (v1 + ... + vn)/n ∈ A. This guarantees
i=1
that as long as all nodes are silent, the average of the vi
vectors remains in A and, therefore, the function has not
crossed the threshold. The question is, of course, how to
determine the safe-zone Si of each node Ni; in a sense to be
made precise in Section 3, it is desirable for the safe-zones
to be as large as possible.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] it was proved that all existing variants of GM share
the following property: each of them defines some convex
subset C of A (different methods induce different C’s), such
that each safe-zone Si is a translation of C – that is, there
exist vectors ui (1 ≤ i ≤ n) such that Si = {ui + c|c ∈ C}
n
and X ui = 0. This observation unifies the distinct variants
i=1
n
of GM, and also allows to easily see why ^ (vi ∈ Si) implies
i=1
that v ∈ C ⊆ A – it follows immediately from the fact that
convex subsets are closed under taking averages and from
the fact that the ui vectors sum to zero.
      </p>
      <p>Here we assume that C is given; it can be provided by
any of the abovementioned methods (obviously, if A itself
is convex, we just choose C = A). We propose to extend
previous work in a more general direction. Our goal here is
to handle a basic problem which haunts all the existing GM
variants: the shapes of the safe-zones at different nodes are
identical. Thus, if the data is heterogeneous across the
distinct streams (an example is depicted in Figure 1), meaning
that the data at different nodes obeys different distributions,
existing GM algorithms will perform poorly, causing many
local violations that do not correspond to global threshold
crossing (“false alarms”).</p>
      <p>C
S1
N1</p>
      <p>N2</p>
      <p>S2</p>
      <p>In this paper, we present a more general approach that
allows to assign differently shaped safe-zones to different nodes.
Our approach requires tackling a difficult optimization
problem, for which practical solutions need to be devised.</p>
    </sec>
    <sec id="sec-3">
      <title>3. SAFE­ZONE DESIGN AS AN OPTIMIZA­</title>
    </sec>
    <sec id="sec-4">
      <title>TION PROBLEM</title>
      <p>We now seek to formulate an optimization problem, whose
solution defines the safe-zones at all nodes. The safe-zones
should satisfy the following properties:
Correctness: If Si denotes the safe-zone at node Ni, we
mnust have: S2
ie^=v1er(yvi t∈hrSeis)ho→ld (cvr1os+sin..g. +byvnf)(/vn) w∈illCr.esTulhtisinenasusarefes-ztohnaet N1 S1 N2
breach in at least one node.</p>
      <p>
        Expansiveness: Every safe-zone breach (local violation) S1  S2
triggers communication, so the safe-zones should be as “large”
as possible. We measure the “size” of a safe-zone Si by its
probability volume, defined as SRi pi(v)dv where pi is the pdf 2 C
of the data at node Ni. Probabilistic models have proved
useful in predicting missing and future stream values in var- Figure 2: Schematic example of HGM safe-zone
asious monitoring and processing tasks [
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
        ], including signment for two nodes, which also demonstrates the
previous geometric methods [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and their incorporation in advantage over previous work. The convex set C
our algorithms proved useful in monitoring real data (Sec- is a square, and the pdf at the left (right) node is
tion 5). To handle these two requirements, we formulate a uniform over a rectangle elongated along the
horconstrained optimization problem as follows: izontal (vertical) direction. HGM can handle this
Given: (1) probability distribution functions p1, . . . , pn at n nodes case by assigning the two rectangles S1, S2 as
safe(2) A convex subset C of the admissible region A zones, which satisfies the correctness requirement
(since their Minkowski average is equal to C). GM
MaximizeSR1p1dv1 · ... ·SRnpndvn (expansiveness) (Figure 1) will perform poorly in this case.
      </p>
      <p>Subject to S1⊕···⊕Sn ⊆ C (correctness)</p>
      <p>
        n
where S1⊕·n··⊕Sn = v1+·n··+vn |v1 ∈ S1, . . . , vn ∈ Sn , or the
Minkowski sum [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] of S1, . . . , Sn, in which every element
is divided by n (the Minkowski average). Introducing the
Minkowski average is necessary in order to guarantee
correctness, since vi must be able to range over the entire
safezone Si. Note that instead of using the constraint S1⊕...⊕Sn ⊆
      </p>
      <p>n
A, we use S1⊕...⊕Sn ⊆ C. This preserves correctness, since</p>
      <p>n
C ⊆ A. The reason we chose to use C is that typically
it’s much easier to check the constraint for the Minkowski
average containment in a convex set; this is discussed in
Section 4.4.</p>
      <p>To derive the target function R p1dv1 · ... · R pndvn, which</p>
      <p>
        S1 Sn
estimates the probability that the local vectors of all nodes
will remain in their safe-zones, we assumed that the data is
not correlated between nodes (hence we multiply the
individual probabilities), as it was the case in the experiments
in Section 5 (see also [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and the discussion therein). If
the data is correlated, the algorithm is essentially the same,
with the expression for the probability that data at some
node breaches its safe-zone modified accordingly.
      </p>
      <p>Note that correctness and expansiveness have to reach a
“compromise”: figuratively speaking, the correctness
constraint restricts the size of the safe-zones, while the
probability volume increases as the safe-zones become larger. This
trade-off is central in the solution of the optimization
problem.</p>
      <p>The advantage of the resulting safe-zones is demonstrated
by a schematic example (Figure 2), in which C and the
stream pdfs are identical to those in Figure 1. In HGM,
however, the individual safe-zones can be shaped very differently
from C, allowing a much better coverage of the pdfs, while
adhering to the correctness constraint. Intuitively speaking,
nodes can trade “geometric slack” between them; here S1
trades “vertical slack” for “horizontal slack”.</p>
    </sec>
    <sec id="sec-5">
      <title>CONSTRUCTING THE SAFE­ZONES</title>
      <p>
        We now briefly describe the overall operation of the
distributed nodes. The computation of the safe-zones is
initially performed by a coordinator node, using a process
described in this section. This process is performed
infrequently, since there is no need to change the safe-zones of
a node unless a global threshold violation occurs. As
described in Section 3, the input to the algorithm is: (1) The
probability distribution functions p1, . . . , pn at the n nodes.
These pdfs can be of any kind (e.g., Gaussian [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], random
walk [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], uniform, etc). (2) A convex subset C of the
admissible region A.
      </p>
      <p>Given this input, the coordinator applies the algorithm
described in Sections 4.1 to 4.5 to compute S1...Sn which
solve the optimization problem defined in Section 3. Then,
node Nk is assigned Sk.
4.1</p>
    </sec>
    <sec id="sec-6">
      <title>Solving the Optimization Problem</title>
      <p>In order to efficiently solve our optimization problem, we
need to answer several questions:
• What kinds of shapes to consider for candidate
safezones? This is discussed in Section 4.2.
• The target function is defined as the product of integrals
of the respective pdfs on the candidate safe-zones. Given
candidate safe-zones, how do we efficiently compute the
target function? This is discussed in Section 4.3.
• Given candidate safe-zones, how do we efficiently test if
their Minkowski average lies in C? This is discussed in
Section 4.4.
• As we will point out, the number of variables to optimize
over is very large, with this number increasing with the
number of nodes. It is well-known that the computational
cost of general optimization routines increases at a
superlinear rate with the number of variables. To remedy this
issue, we propose in Section 4.5 a hierarchical clustering
approach, which uses a divide-and-conquer algorithm to
reduce the problem to that of recursively computing
safezones for small numbers of nodes.
4.2</p>
    </sec>
    <sec id="sec-7">
      <title>Shape of Safe­Zones to Consider</title>
      <p>The first step in solving an optimization problem is
determining the parameters to optimize over. Here, the space
of parameters is huge – all subsets of the Euclidean space
are candidates for safe-zones. For one-dimensional (scalar)
data, intervals provide a reasonable choice for safe-zones,
but for higher dimensions no clear candidate exists.</p>
      <p>To achieve a practical solution, we choose the safe-zones
from a parametric family of shapes, denoted by S. This
family of shapes should satisfy the following requirements:
• It should be broad enough so that its members can
reasonably approximate every subset which is a viable
candidate for a safe-zone.
• The members of S should have a relatively simple shape.</p>
      <p>
        In practice, this means that they are polytopes with a
restricted number of vertices, or can be defined by a small
number of implicit equations (e.g., polynomials [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]).
• It should not be too difficult to compute the integral of
the various pdfs over members of S (Section 4.3).
• It should not be too difficult to compute, or bound, the
      </p>
      <p>
        Minkowski average of members of S (Section 4.4).
The last two conditions allow efficient optimization. If
computing the integrals of the pdf or the Minkowski average are
time consuming, the optimization process may be lengthy.
We thus considered and applied in our algorithms various
polytopes (such as triangles, boxes, or more general
polytope) as safe-zones; this yielded good results in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>The choices of S applied here have provided good results
in terms of safe-zone simplicity and effectiveness. However,
the challenge of choosing the best shape for arbitrary
functions and data distributions is quite formidable, and we plan
to continue studying it in the future.
4.3</p>
    </sec>
    <sec id="sec-8">
      <title>Computing the Target Function</title>
      <p>The target function is defined as the product of integrals
of the respective pdfs on the candidate safe-zones.
Typically, data is provided as discrete samples. The integral can
be computed by first approximating the discrete samples by
a continuous pdf, and then integrating it over the safe-zone.
We used this approach, fitting a GMM (Gaussian Mixture
Model) to the discrete data and integrating it over the
safezones, which were defined as polytopes. To accelerate the
computation of the integral, we used Green’s Theorem to
reduce a double integral to a one-dimensional integral over the
polygon’s boundary, for the two-dimensional data sets in the
experiments. For higher dimensions, the integral can also be
reduced to integrals of lower dimensions, or computed using
Monte-Carlo methods.
4.4</p>
    </sec>
    <sec id="sec-9">
      <title>Checking the Constraints</title>
      <p>
        A simple method to test the Minkowski sum constraint
relies on the following result [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]:
      </p>
      <p>Lemma 1. If P and Q are convex polytopes with vertices
{Pi}, {Qj}, then P ⊕ Q is equal to the convex hull of the set
{Pi + Qj}.</p>
      <p>Now, assume we wish to test whether the Minkowski average
of P and Q is contained in C. Since C is convex, it contains
the convex hull of every of its subsets; hence it suffices to
test whether the points (Pi + Qj)/2 are in C, for all i, j.
If not all points are inside C, then the constraint violation
can be measured by the maximal distance of a point (Pi +
Qj)/2 from C’s boundary. The method easily generalizes to
more polytopes: for three polytopes it is required to test the
average of all triplets of vertices, etc.
4.5</p>
    </sec>
    <sec id="sec-10">
      <title>Hierarchical Clustering</title>
      <p>
        While the algorithms presented in Sections 4.3-4.4 reduce
the running time for computing the safe-zones, our
optimization problem still poses a formidable difficulty. For
example, fitting octagonal safe-zones [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to 100 nodes with
two-dimensional data requires to optimize over 1,600
variables (800 vertices in total, each having two coordinates),
which is quite high. To alleviate this problem, we first
organize the nodes in a hierarchical structure, which allows us to
then solve the problem recursively (top-down) by reducing
it to sub-problems, each containing a much smaller number
of nodes.
      </p>
      <p>
        We first perform a bottom-up hierarchical clustering of
the nodes. To achieve this, a distance measure between
nodes needs to be defined. Since a node is represented by
its data vectors, a distance measure should be defined
between subsets of the Euclidean space. We apply the method
in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], which defines the distance between sets by the L2
distance between their moment vectors (vectors whose
coordinates are low-order moments of the set). The moments
have to be computed only once, in the initialization stage.
The leaves of the cluster tree are individual nodes, and the
inner vertices can be thought of as “super nodes”, each
containing the union (Minkowski average) of the data of nodes
in the respective sub-tree. Since the moments of a union of
sets are simply the sum of the individual sets’ moments, the
computation of the moment for the inner nodes is very fast.
      </p>
      <p>After the hierarchical clustering is completed, the
safezones are assigned top-down: first, the children of the root
are assigned safe-zones under the constraint that their
Minkowski average is contained in C. In the next level,
the grandchildren of the root are assigned safe-zones under
the constraint that their Minkowski average is contained in
their parent nodes’ safe-zones, etc. The leaves are either
individual nodes, or clusters which are uniform enough and
can all be assigned safe-zones with identical shapes.
5.</p>
    </sec>
    <sec id="sec-11">
      <title>EXPERIMENTS</title>
      <p>
        HGM was implemented and compared with the GM method,
as described in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which is the most recent variant of
previous work on geometric monitoring that we know of. We
are not aware of other algorithms which can be applied to
monitor the functions treated here (the ratio queries in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]
deal with accumulative ratios and not instantaneous ones as
in our experiments).
5.1
      </p>
    </sec>
    <sec id="sec-12">
      <title>Data, Setup and Monitored Functions</title>
      <p>5.1.1</p>
      <sec id="sec-12-1">
        <title>Data and Monitored Functions</title>
        <p>
          Our data consists of air pollutant measurements taken
from “AirBase – The European Air Quality Database” [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ],
measured in micrograms per cubic meter. Nodes correspond
to sensors at different geographical locations. The data at
different nodes greatly varies in size and shape and is
irregular as a function of time. The monitored functions
were chosen due to their practical importance, and also as
they are non-linear and non-monotonic and, thus, cannot
be handled by most existing methods. In Section 5.2
results are presented for monitoring the ratio of NO to NO2,
which is known to be an important indicator in air
quality analysis [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. An example of monitoring a quadratic
function in three variables is also presented (Section 5.3);
quadratic functions are important in numerous applications
(e.g., the variance is a quadratic function in the variables,
and a normal distribution is the exponent of a quadratic
function, hence thresholding it is equivalent to thresholding
the quadratic).
5.1.2
        </p>
      </sec>
      <sec id="sec-12-2">
        <title>Choosing the Family of Safe­Zones</title>
        <p>
          To solve the optimization problem, it is necessary to define
a parametric family of shapes S from which the safe-zones
will be chosen. Section 4.2 discusses the properties this
family should satisfy. In [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], the suitability of some families of
polytopes is studied for the simpler, but related, problem of
finding a safe-zone common to all nodes. The motivations
for choosing S here were:
• Ratio queries (Section 5.2) – the triangular safe-zones
(Figure 3) have the same structure, but not size or
location, as C, and are very simple to define and apply.
• Quadratic function (Section 5.3) – here we allowed
general polytopes, and tested the results for increasing
numbers of vertices. The model selected was with 12 vertices,
in which the target function to optimize was “saturated”
(i.e. adding more vertices increased the value by less than
0.1%).
5.1.3
        </p>
      </sec>
      <sec id="sec-12-3">
        <title>Optimization Parameters and Tools</title>
        <p>
          The triangular safe-zones (Section 5.2) have two degrees of
freedom each (Mi and βi, see Figure 3), hence for n nodes
we have 2n parameters to optimize over. The safe-zones
in Section 5.3 require 36 parameters each. In all cases we
used the Matlab routine fmincon to solve the optimization
problem [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. To compute the integral of the pdf on the
safe-zones, data was approximated by a Gaussian Mixture
Model (GMM), using a Matlab routine [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ].
5.2
        </p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>Ratio Queries</title>
      <p>
        This set of experiments concerned monitoring the ratio
between two pollutants, NO and NO2, measured in distinct
sensors. Each of the n nodes holds a vector (xi, yi) (the two
concentrations), and the monitored function is PP xyii (in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]
ratio is monitored but over aggregates over time, while here
we monitor the instantaneous ratio for the current readings).
An alert must be sent whenever this function is above a
threshold T (taken as 4 in the experiments), and/or when
the NO2 concentration is above 250. The admissible region
A is a triangle, therefore convex, so C = A. The safe-zones
tested were triangles of the form depicted in Figure 3, a
choice motivated by the shape of C. The half-planes method
(Section 4.4) was used to test the constraints. An example
with four nodes, which demonstrates the advantage of
allowing different safe-zones at the distinct nodes, is depicted
in Figure 4.
      </p>
      <p>Improvement Over Previous GM Work. We
compared HGM with GM in terms of the number of produced
local violations. In Figure 5, the number of safe-zone
violations is compared for various numbers of nodes. HGM
results in significantly fewer local violations, even for a small
number of nodes. As the number of nodes increases, the
benefits of HGM over GM increase. For a modest network
size of 10 nodes, HGM requires less than an order of
magnitude fewer messages than GM.
5.3</p>
    </sec>
    <sec id="sec-14">
      <title>Monitoring a Quadratic Function</title>
      <p>Another example consists of monitoring a quadratic
function with more general polyhedral safe-zones in three
variables (Figure 6). The data consists of measurements of three
pollutants (NO, NO2, SO2), and the safe-zones are
polyhedra with 12 vertices. The admissible region A is the ellipsoid
depicted in pink; since it is convex, C = A. As the extent of
the data is far larger than A, the safe-zones surround the
regions in which the data is denser. Here we did not compare
to previous methods.</p>
    </sec>
    <sec id="sec-15">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>An approach for minimizing communication while
monitoring threshold queries over heterogeneous distributed streams
was presented. It is formulated as an optimization problem
of a geometric and probabilistic flavor, whose solution
assigns each node a “safe-zone” with the property that a node
may remain silent as long as its data vector is in its safe-zone.
While the problem is known to be difficult, a practical
solution using a hierarchical clustering algorithm is presented
and implemented for two and three dimensional data,
allowing to achieve substantial improvement over previous work,
while using rather simple safe-zones which also reduce the
computational effort at the nodes.</p>
    </sec>
    <sec id="sec-16">
      <title>ACKNOWLEDGMENT</title>
      <p>This work was partially supported by the European
Commission under ICT-FP7-LIFT-255951 (Local Inference in
Massively Distributed Systems).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Balazinska</surname>
          </string-name>
          , U. C¸ etintemel, M. Cherniack,
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lindner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maskey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rasin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ryvkina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tatbul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xing</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Zdonik</surname>
          </string-name>
          , “
          <article-title>The design of the borealis stream processing engine</article-title>
          ,” in CIDR,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.</given-names>
            <surname>Sharfman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schuster</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Keren</surname>
          </string-name>
          , “
          <article-title>A geometric approach to monitoring threshold functions over distributed data streams,” ACM Trans</article-title>
          . Database Syst., vol.
          <volume>32</volume>
          , no.
          <issue>4</issue>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Burdakis</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Deligiannakis</surname>
          </string-name>
          , “
          <article-title>Detecting outliers in sensor networks using the geometric approach</article-title>
          ,” in ICDE,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Giatrakos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Deligiannakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Garofalakis</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Sharfman, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Schuster</surname>
          </string-name>
          , “
          <article-title>Prediction-based geometric monitoring over distributed data streams</article-title>
          ,” in SIGMOD,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Keren</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sharfman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Livne</surname>
          </string-name>
          , “
          <article-title>Shape sensitive geometric monitoring</article-title>
          ,
          <source>” IEEE Trans. Knowl. Data Eng.</source>
          , vol.
          <volume>24</volume>
          , no.
          <issue>8</issue>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Sagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Keren</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Sharfman, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Schuster</surname>
          </string-name>
          , “
          <article-title>Distributed threshold querying of general functions by a difference of monotonic representation</article-title>
          ,
          <source>” PVLDB</source>
          , vol.
          <volume>4</volume>
          , no.
          <issue>2</issue>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>I.</given-names>
            <surname>Sharfman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schuster</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Keren</surname>
          </string-name>
          , “
          <article-title>Shape sensitive geometric monitoring</article-title>
          ,” in PODS,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cormode</surname>
          </string-name>
          , “
          <article-title>Algorithms for continuous distributing monitoring: A survey,</article-title>
          ” in AlMoDEP,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kogan</surname>
          </string-name>
          , “
          <article-title>Feature selection over distributed data streams through optimization</article-title>
          ,” in SDM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>O.</given-names>
            <surname>Papapetrou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Garofalakis</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Deligiannakis</surname>
          </string-name>
          , “
          <article-title>Sketch-based querying of distributed sliding-window data streams</article-title>
          ,
          <source>” PVLDB</source>
          , vol.
          <volume>5</volume>
          , no.
          <issue>10</issue>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Garofalakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Keren</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Samoladas</surname>
          </string-name>
          , “
          <article-title>Sketch-based geometric monitoring of distributed stream queries</article-title>
          ,
          <source>” PVLDB</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          , and W. Hong, “
          <article-title>Model-driven data acquisition in sensor networks</article-title>
          ,” in VLDB,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Phillips</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Jestes</surname>
          </string-name>
          , “
          <article-title>Efficient threshold monitoring for distributed probabilistic data,” in</article-title>
          <string-name>
            <surname>ICDE</surname>
          </string-name>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kanagal</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          , “
          <article-title>Online filtering, smoothing and probabilistic modeling of streaming data,” in</article-title>
          <string-name>
            <surname>ICDE</surname>
          </string-name>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Serra</surname>
          </string-name>
          , “
          <article-title>Image analysis</article-title>
          and mathematical morphology,” in Academic Press, London,
          <year>1982</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shah</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Ramamritham</surname>
          </string-name>
          , “
          <article-title>Handling non-linear polynomial queries over dynamic data</article-title>
          ,” in ICDE,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Keren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Cooper</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Subrahmonia</surname>
          </string-name>
          , “
          <article-title>Describing complicated objects by implicit polynomials,”</article-title>
          <source>IEEE Trans. Pattern Anal. Mach</source>
          . Intell., vol.
          <volume>16</volume>
          , no.
          <issue>1</issue>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fogel</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Halperin</surname>
          </string-name>
          , “
          <article-title>Exact and efficient construction of minkowski sums of convex polyhedra with applications,” Computer-Aided Design</article-title>
          , vol.
          <volume>39</volume>
          , no.
          <issue>11</issue>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Elad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ar</surname>
          </string-name>
          , “
          <article-title>Content based retrieval of vrml objects: an iterative and interactive approach</article-title>
          ,”
          <source>in Proceedings of the sixth Eurographics workshop on Multimedia</source>
          <year>2001</year>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>R.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ramamritham</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Mohania</surname>
          </string-name>
          , “
          <article-title>Ratio threshold queries over distributed data sources</article-title>
          ,” in ICDE,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21] “
          <article-title>The european air quality database</article-title>
          ,” in http://tinyurl.com/ct9bh7x.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kurpius</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Goldstein</surname>
          </string-name>
          , “
          <article-title>Gas-phase chemistry dominates o3 loss to a forest, implying a source of aerosols and hydroxyl radicals to the atmosphere</article-title>
          ,
          <source>” Geophysical Research Letters</source>
          , vol.
          <volume>30</volume>
          , no.
          <issue>7</issue>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>[23] http://tinyurl.com/kxssfgl.</mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>DCPR</surname>
          </string-name>
          <article-title>(Data Clustering and Pattern Recognition) Toolbox</article-title>
          , http://tinyurl.com/nxospq2.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>