An Edge-centric Ensemble Scheme for Queries
                    Assignment

              Kostas Kolomvatsos and Christos Anagnostopoulos

 Department of Computing Science, University of Glasgow, G12 8RZ, Glasgow, UK
      {kostas.kolomvatsos, christos.anagnostopoulos}@glasgow.ac.uk


      Abstract. The new era of the Internet of Things (IoT) reveals new po-
      tentials for the management of numerous devices. Such devices produce
      data streams that are guided to the Cloud for further processing. How-
      ever, any processing in the Cloud, even if it is supported by increased
      computational resources, suffers from increased latency. For minimizing
      the latency, we can perform data processing at the edge of the network,
      i.e., at the edge nodes. The aim is to provide analytics and build knowl-
      edge on top of the collected data in the minimum time. In this paper,
      we deal with the problem of allocating queries, defined for producing
      knowledge, to a number of edge nodes. The aim is to further reduce
      the latency by allocating queries to nodes that exhibit low load (the
      current and the estimated), thus, they can provide the final response
      in the minimum time. However, before the allocation, we should decide
      the computational burden that a query will add. The allocation is con-
      cluded by the assistance of an ensemble similarity scheme responsible to
      deliver the complexity class for each query. The complexity class, thus,
      can be matched against the current load of every edge node. We discuss
      our scheme and through a large set of simulations and the adoption of
      benchmarking queries, we reveal the potentials of the proposed model
      supported by numerical results.

      Keywords: Edge nodes · Queries assignment · Ensemble similarity scheme


1    Introduction

In the era of the Internet of Things (IoT), numerous devices form a vast infras-
tructure while being capable of performing simple processing tasks and exchange
of data. One can identify a research challenge related to the connection of such
devices with the network and three locations of data processing. Data can be
processed at the devices, at the edge of the network (Edge/Fog) or at the Cloud.
As we move to the upper layers of this architecture (from the devices to the
Cloud), we observe improved computational resources, however, the latency in-
creases as well. Current research efforts (e.g., [4], [10], [27]) focus on the data
streams management at the edge to reduce the latency experienced by end users.
Hence, the power of data processing and knowledge production is transferred to
the edge nodes instead of relying on the Cloud or a central data warehouse.
2      K. Kolomvatsos and C. Anagnostopoulos

    A number of Edge Nodes (ENs) can be considered as the distributed repos-
itories where queries can be executed to export meaningful analytics. ENs vary
from simple routers to complete servers placed at various locations. ENs are con-
nected with a number of devices and act as the repository of the data reported
by them. ENs are responsible to execute the queries and report the result to the
requested entity. The efficient management of the incoming queries as well as the
provided responses will characterize the success of the supported applications.
Usually, applications demand a response in the minimum time to provide high
quality services to end users. The most important challenge is that ENs receive
queries from multiple requestors and the collected data are quickly updated over
time. This makes imperative the need for intelligent methods that will be able
to manage the numerous requests (in the form of queries) and the large volumes
of the collected data.


    In this paper, we deal with the problem of query allocation to the appro-
priate ENs. Queries are reported through streams into a set of entities called
Query Controllers (QCs). QCs are located at the Cloud and they have direct
connection with multiple ENs. Our aim is to efficiently allocate every query to
the appropriate EN in order to get the final response in the minimum time. This
is a multi-dimensional problem involving queries and ENs characteristics (e.g.,
query type, ENs location, ENs load, the collected data). In the current effort,
we focus on ‘matching’ queries with ENs. We propose a method for classifying
queries into a set of complexity classes that depict the burden that a query will
cause to an EN. Hence, we can compare the requirements of the query with the
ENs’ load and decide if the specific allocation is productive. A question arises:
‘Why do not we rely on the selection of ENs with the lowest load? ’. The response
is two-fold, i.e., (i) we want a mechanism to estimate the computational burden
that a query will add to the selected ENs being difficult to classify a query in
a specific complexity class; (ii) we cannot be based on the current minimum
load as ENs receive queries from multiple QCs, thus, the load is continuously
updated. We propose models for both problems; an ensemble similarity scheme
for the estimation of the complexity class based on historical queries and the
decision related to the selection of ENs based on a ‘combined’ view of their
current and future loads. The following list reports on the contributions of our
work: (i) we provide a modeling process for different types of queries; (ii) we
provide an ensemble similarity scheme for concluding the complexity class; (iii)
we provide a ENs selection model based on their current and the future load;
(iv) we provide experimental evaluation of our ensemble similarity scheme.


   The paper is organized as follows. Section 2 presents the related work while
Section 3 discusses the problem under consideration. Section 4 presents the pro-
posed ensemble similarity scheme and the adopted decision making technique.
Section 5 reports on the experimental evaluation of our mechanism while Section
6 concludes our paper by giving insights in our future research directions.
                  An Edge-centric Ensemble Scheme for Queries Assignment         3

2   Prior Work

In the IoT infrastructure, the collection of data in multiple locations is a com-
mon approach. Data are geospatialy distributed with multiple nodes hosting the
data. The management of these numerous nodes is a very challenging task. A
set of efforts try to reveal opportunities for the management of the distributed
nodes/data. For instance, Dragon [19] tries to efficiently identify nodes that can
reply to user requests based on static criteria describing nodes themselves or
their data. In such settings, the important is to have a view on the nodes char-
acteristics as well as the available data. However, IoT nodes exhibit different
characteristics not only in the hardware but also in the software (e.g., their
middleware). In [16], the authors propose a Distributed Data Service (DDS)
providing functionalities for collecting and processing data. The main target is
to enable multiple and distinct IoT middleware systems to share common data
services, thus, to cover interoperability issues.
    In any case, the execution of queries, in parallel, can increase the perfor-
mance of the applications. This advantage is provided on top of the separation
of data in a number of partitions. The separation of data may ‘emerge’ as a
natural consequence, e.g., when streams report data in high rates at various lo-
cations. Multiple efforts try to handle the problem of proposing algorithms for
separating the available data. In [5], the authors adopt a sliding window ap-
proach. Streams are partitioned on the fly taking into consideration the query
semantics. A multi-route optimizer is proposed in [8]. The optimizer tries to ex-
ploit the intra- and inter-stream correlations to produce effective partitions. The
authors in [35] propose the separation of streams into a set of sub-streams over
which query operators are executed in parallel. Another effort that focuses on
splitting functions is reported in [13]. The proposed partitioning functions are
characterized by a set of properties, i.e, balance properties (e.g., memory, pro-
cessing, communication balance), structural properties (e.g., compactness, fast
lookup), and adaptation properties (e.g., fast computation, minimal migration).
    In addition, various models, originated in the database community, have been
proposed for delivering the similarity between SQL queries. Queries can be rep-
resented at the intentional [32] or at the extensional level [29]. Other techniques
involve Information Retrieval (IR) models, i.e., queries can be depicted by vec-
tors of features [2], a set of fragments [1] or graphs [31]. Example schemes deal
with the inner product of vectors [29], the cosine distance [29] or the Jaccard
coefficient [9]. Other more ‘sophisticated’ solutions focus on the adoption of Sup-
port Vector Machines (SVMs) [33], [34]. SVMs aim to learn the ranking function
applied on queries. This way, we are able to sort the queries and get the top-k
of them. Most existing top-k query processing algorithms like [7], [17] assume
that the ranking function is defined over absolute attribute values or they are
monotonic. The exploitation of the similarity can involve index structures (e.g.,
B-trees) to access the scoring of a sub-region. Other efforts e.g., [36], focus on
relaxing the monotonicity assumption to include functions whose scores can be
bounded in the given attribute value range.
4       K. Kolomvatsos and C. Anagnostopoulos

    In our past research, we also deal with the allocation of queries to a set of
processors. In [20], we propose a time-optimized scheme for selecting the appro-
priate processor(s) through the use of the Odds algorithm. With the proposed
model, we try to result the optimal allocation, in the minimum time. In [21], we
present a Q-learning scheme to calculate the reward retrieved for an allocation.
In [22], we extend the work presented in [21] and incorporate into the learning
process a load balancer comparing it with a clustering model that creates groups
of processors with similar characteristics. The missing contributions in our past
research activities that we cover with the current work are: (i) in our previous
models, we do not adopt any specific similarity technique for ‘matching’ queries
with the available processors; (ii) Our past efforts do not deal with an ensemble
scheme that identifies the complexity class of queries; (iii) our past models are
mostly ‘static’ meaning that they are applied on top of static values without
taking into consideration the continuous update of processors characteristics;
(iv) our previous research efforts require a training phase that increases the la-
tency in the provision of the final response especially when adopted in dynamic
environments.


3    Problem Definition and High Level Description

Data Processing
                    at the Edge of the Network. We consider a set of ENs, i.e.,
EN = en1 , en2 , . . . , en|EN | placed at various locations (e.g., in a city or around
the Globe). A set of IoT nodes (e.g., smartphones, sensors) are ‘connected’ with
every EN to send their data. ENs, on top of the collected data, can build and
extend the produced contextual knowledge supporting decision making. A Query
Processor (QP) is adopted     in every EN to respond to any incoming query. Let
the set of QPs be QP = qp1 , qp2 , . . . , qp|EN | . QPs are ‘invoked’ through the
mediation of Cloud where a number of QCs are present. QCs receive queries,
‘invoke’ the appropriate QPs, get their responses and return the final result. We
consider two types of applications, i.e., (i) applications that demand responses
in real time; (ii) applications that do not define any time constraints. In our
research, we focus on the former type.
    In each EN, a dataset is formulated by the collected data defining a geo-
distributed local data repository. Each dataset Di , present at the ith EN, stores
multivariate data, i.e., vectors in the form x = hx1 , x2 , . . . , xl i where l is the
number of dimensions. Di s are updated over time as streams produced by IoT
devices report data at high rates. In our research, we cannot have any view on
the data present in every dataset and we do not adopt any separation algorithm
for the collected data.
    In the upper layer (i.e., the Fog/Cloud), there is a number of QCs that have
to process queries reported through streams Qi = {q1 , q2 , . . .}. QCs perform the
selection of the appropriate ENs/QPs and the final aggregation of the ‘partial’ re-
sponses. As partial response, we define the response retrieved by an EN/QP that
should be aggregated with the remaining results reported by other ENs/QPs. In
our research, we provide models for the management of the ecosystem of QCs -
                    An Edge-centric Ensemble Scheme for Queries Assignment                   5

ENs/QPs (see Figure 1) and define techniques for the efficient allocation of the
incoming queries. Our current effort tries to ‘match’: (a) queries q1 , q2 , . . . with;
(b) the available QPs qp1 , qp2 , . . . , qp|EN | .


                  Fig. 1. The generic architecture under consideration.

    Matching Queries and Processors. Every EN/QP exhibits specific char-
acteristics C qp = {cqp         qp
                           1 , c2 , . . .}, e.g., C
                                                    qp
                                                       = {load, speed, language, ef f ectiveness}.
A detailed discussion on the QPs characteristics can be found in [25]. Some of
them are: (i) the input language; (ii) the types of the performed optimizations;
(iii) the optimization timing; (iv) the effectiveness of processors as depicted by
statistics; (v) the decision sites; (vi) the exploitation of the network topology;
(vii) the exploitation of replicated fragments; (viii) the use of semi-joins. We
propose to extend the aforementioned list and incorporate more ‘dynamic’ char-
acteristics that are related to high level features like the load and the speed
of each QP. Such characteristics are delivered as a more detailed view of the
performance depicting the current state of each QP. In our work, we focus on
the load β as an indication of their ability to quickly perform the execution of a
query.
    We consider that every QP maintains a queue where the incoming queries
are placed and wait for processing. The size of the queue is adopted to deliver
β which represents the percentage of the maximum load that can be afforded
by the corresponding QP. Without loss of generality, β is defined in [0,1] (a
maximum queue size Qmax is adopted for such purposes). When β → 1 means
that the corresponding QP exhibits a high load. The load is directly ‘connected’
with the throughput of each QP and the velocity in which queries arrive in the
discussed queue. A complex query (e.g., a join query - see below) may demand
for more time and resources to be answered compared with a simple query (e.g.,
a select query). Usually, a complex query requires a high number of steps (a
discussion on the query execution plans and the required steps can be found in
the upcoming sections). The ‘classification’ of the complexity of a query and its
‘combination’ with the load of a QP is the main subject of the current work.
Future extensions involve the modeling of more QPs’ characteristics as well as a
complex ‘matching’ scheme for delivering the final allocation.
    Every query reported to a QC also has a set of characteristics depicted by
C q = {cq1 , cq2 , . . .}, e.g., C q = {class, deadline, type, size}. According to [14],
‘generic’ characteristics are: (i) the type of the query (e.g., repetitive, ad-hoc);
6       K. Kolomvatsos and C. Anagnostopoulos

(ii) the query shape; (iii) the size of the query (e.g., simple, medium, complex).
Based on C q , specific execution plans could be defined in the form of a processing
tree [14]. We propose to extend the aforementioned list and incorporate more
characteristics that depict the complexity and the need for instant response.
Such characteristics affect queries’ execution in terms of the resources required
to produce the final response. In the current work, we focus on the query class θ;
it depicts the complexity of a query. θ is aligned with the complexity performed
by the operations required for producing the final result. For instance, the op-
erations required by a select query may be easier than the operations required
by a Cartesian product. Various research efforts deal with the complexity of
queries [3], [28], [30].

    Delivering the Query Complexity Class. Our aim is to combine β with
θ and support the decision if a query can be efficiently executed in a QP. Ini-
tially, we have to assign the incoming query to a complexity class θ (i.e., a typical
classification task). However, it is difficult to ‘match’ any query against a single
class due to the increased number of constraints that could be adopted in each
query statement. The final complexity should be defined based not only on quan-
titative (e.g., number of constraints / conditions) but also on qualitative (e.g.,
type of operations / constraints) characteristics. For handling this complicated
process, we propose a ‘fuzzy’ approach and define a Fuzzy Classification Process
(FCP). The FCP is the process of grouping individuals having the same charac-
teristics into the same fuzzy set based on a membership function that indicates
whether a query is a member of a class. The FCP tells us the membership of
a query in each of the pre-defined classes, thus, we could be able to estimate
the computational burden that will be added to the selected QP. A dataset of
historical queries (i.e., a set of tuples) together with their corresponding classes
is available for the FCP. The same class may be involved in multiple tuples, thus,
in multiple queries. The queries dataset if defined by database experts and its
creation is not the focus of the current work.

    For evaluating θ for qj , we adopt Information Retrieval (IR) and Data Mining
(DM) techniques. We prefer to adopt fast similarity techniques to deliver the final
result in real time instead of adopting a classification approach that requires a
training phase. Our model can be easily executed, even if the queries dataset
is updated; this is not the case in the majority of the classification
                                                                          algorithms
(the training phase should be executed again). The set Θ = θ1 , θ2 , . . . , θ|Θ|
depicts the pre-defined classes where a query can/should be classified. Let the
queries dataset be QD with tuples in the form hsk , θk i , ∀k ∈ {1, 2, . . . , |QD |}.
sk represents the query’s statement and θk ∈ Θ. An example query statement
could be {Select price from stocks where id =0 RBS 0 }. The function f gets the
qj and based on QD delivers a vector that depicts the ‘similarity’ of qj with
every class in Θ, i.e., f (qj ; QD ) → qs ∈ R|Θ| . qs contains values in [0,1] forming
the
 basis of our FCP. An example vector         could be qs = h0.2, 0.8, 0.3i for Θ =
                                          2   s
  θ1 = O(nlogn), θ2 = O(n), θ3 = O(n . q shows that the qj ‘belongs’ by 20%
to the first complexity class, by 80% to the second and by 30% to the third.
                      An Edge-centric Ensemble Scheme for Queries Assignment                       7

Based on qs , we can estimate θ and match it with QPs characteristics (i.e., β in
this effort).
    For calculating qs , we can be based on various efforts that deliver the simi-
larity between queries. The interested reader can refer in [23] for more details.
We propose the use of an ensemble scheme for evaluating the final similarity be-
tween qj and every tuple hsk , θk i in QD . The ensemble model aims at avoiding
the disadvantages of each individual metric. We process all theavailable tuples
in QD classified to θk . The ensemble scheme adopts the set E = e1 , e2 , . . . , e|E|
of similarity metrics. For instance, E could involve the Hamming distance [26],
the Jaccard coefficient [1], the Cosine similarity [26] or any other metric. Any
distance metric available in the corresponding literature could be transformed to
depict the similarity between qj and hsk , θk i. For instance, if ed is the Euclidean
                                                                              1
distance between qj and a tuple, their similarity can be calculated by 1+e      d
                                                                                  .


4    Allocating Queries to Processors
The Ensemble Scheme. The adopted similarity metrics are applied on each
tuple classified to θk aggregated to a successive step for the finalization of qks ,
i.e., the final similarity of qj with θk . Formally the ‘2D aggregation’ is calculated
as follows: qks = Ω(ω {ei (qj , hsk , θk i)} , ∀i, ∀ hsk , θk i. ω realizes the envisioned
ensemble similarity scheme while the aggregation operator Ω produces the qks
through multiple ω values.
     For ω, we consider that every single result (i.e., ei (qj , hsk , θk i) represents the
membership of qj to a ‘virtual’ fuzzy set. We have |E| membership degrees that
should be combined to get the final similarity for the each tuple. For instance, if
we get e1 = 0.2, e2 = 0.5 and e3 = 0.3, qj ‘belongs’ to the e1 fuzzy set by 0.2, to
the e2 by 0.5 and to the e3 by 0.3. ω is a fuzzy aggregation operator, a |E|-place
function ω : [0, 1]|E| → [0, 1]) that takes into consideration the membership to
every fuzzy set and returns the final value. Aggregation operators are well studied
in various efforts [12], [15]. Through a high number of experiments [12], [15],
a number of aggregators are identified to exhibit the best performance, i.e.,
the Einstein product, the algebric product, the Hamacher product [15] and the
Schweizer-Sklar metric [12].
     In the proposed model, we adopt the Hamacher product as it gives us more
opportunities to ‘tune’ the result through the parameter α ≥ 0. The final ω for a
                                           ė·ë
tuple is defined as: ωi = a+(1−a)(            ė+ë−ė·ë) where ė and ë are two similarity values.
As those values may be distributed in [0,1], i.e., similarity metrics may ‘disagree’,
we propose the use of the top-n similarity values based on their significance
level. The Significance Level (SL) depicts if a value is ‘representative’ for many
other results. We borrow the idea from the density based clustering where the
centroids are points being connected with many other objects. We propose the
use of the radius γ and calculate the SL for each similarity result as follows:
                        1
SLei =                                  , ∀j, where δ1 and δ2 are parameters adopted to
            1+e−(δ1 |d(ei ,ej )≤γ|−δ2 )
smooth the sigmoid function. With the sigmoid function, we want to eliminate
the SL of values with a low number of ‘neighbors’ in the radius γ. Finally, the
8        K. Kolomvatsos and C. Anagnostopoulos

results are sorted in descending order of the SL and the top-n of them are
processed with the Hamacher product to deliver the final aggregated similarity
value.
    The Ω operator builds on top of the ω values produced for each tuple in
QD classified in θk . Let ω1 , ω2 , . . . , ωm are those values. For their aggregation,
                                                          1 Pm α  α1
we rely on a Quasi-Arithmetic mean, i.e., qks = m               i=1 ωi    where α is a
parameter that ‘tunes’ the function. When α = 1, the function is the arithmetic
mean, when α = 2, it is the quadratic mean and so on. After calculating the
final values for each θk , we get qs = Ω1 , Ω2 , . . . , Ω|Θ| .
    The Matching Process. The next step is to estimate the required process-
ing steps to conclude the response for the qj , thus, to identify its computational
burden. We consider an additional vector Ts = T1 , T2 , . . . , T|Θ| which repre-
sents a ‘typical’ number of processing steps (an upper bound) for each class.
                                                                         P|Θ|
The expected number of processing steps for qj is defined by TE = i=1 Ωi Ti .
Recall that β depicts the current load of a processor, thus, 1 − β depicts the
room for ‘hosting’ additional queries. The most common execution approach is
the creation of an execution tree where the required steps are connected 1 . A
statistical study for the average required steps T E in various query execution
plans can assist us to define the room for additional queries in QPs. TE should be
compared with T̂E = (1 − β) T E to identify if qj can be executed in the specific
QP. When TE ≤ T̂E , we assign a reward r1 to the specific QP, otherwise, r1
corresponds to a penalty.
    In addition, we want to incorporate in the decision process, our view on
the future load of QPs. Hence, we maintain historical β values and apply a
single linear estimator to identify the future load as in [22]. The idea is to see if
the current (through T̂E ) and future load can support the execution of qj . For
the latest W β observations βt−1 , βt−2 , . . . , βt−W , we estimate the future load
β̂ through the linear combination of βt−k , k = 1, 2, . . . , W with real-valued ak
coefficients. The set {ak } is estimated to minimize the error between β̂ and β.
In our effort,
               we
                  adopt the Levinson-Durbin algorithm [24], [11]. Based on β̂,
if TE ≤ 1 − β̂ T E , we assign a reward equal to r2 to the specific processor,
otherwise, r2 corresponds to a penalty.
                                                      P|R|
    The ith QP gets a reward/penalty equal to ri = j=1 sgn(ri )ri where |R| is
the number of rewards and sgn(ri ) is the positive sign if the ri deals with a re-
ward; otherwise, it is the negative sign. For each QP, we calculate the probability
                                                                      e ri
of allocation pi delivered by the softmax function [6], i.e., pi = P|EN   | r . qj is
                                                                         i=1
                                                                             i e
allocated in the processors that their probability exceeds a pre-defined threshold
pT . This secures the optimal allocation based on the Probability Ranking Princi-
ple [18], i.e., if QPs are ordered by decreasing pi , then the model’s effectiveness
is the best to be gotten for the qj .


1
    https://docs.oracle.com/database/121/TGSQL/tgsql sqlproc.htm#TGSQL186
                  An Edge-centric Ensemble Scheme for Queries Assignment          9

5   Experimental Evaluation

Experimentation Setup. We report on the performance of the proposed scheme
through a large set of simulations. Our simulator is written in Java and manages
a number of queries retrieved by a real dataset. We rely on two benchmarking
datasets, i.e., TPC-DS and TPC-H (http://www.tpc.org/). TPC-DS is the de-
facto industry standard benchmark for measuring the performance of decision
support solutions. The TPC-H is a decision support benchmark that consists of
a suite of business oriented ad-hoc queries. For each of the adopted queries, we
define its class as described in [30] where a survey of databases experts define
their view on the complexity of every query. We classify our evaluation queries
in six (6) classes.
    We adopt three (3) performance metrics: (i) the time required for conclud-
ing a query allocation depicted by Ψ measured in seconds. The lower the Ψ is
the more efficient the model becomes; (ii) the difference between β of the first
selected QP compared to the lowest β observed in the group of QPs Ξ. Ξ → 0
means that the proposed model selects the best possible QP; (iii) the difference
of β between the lowest value in the group and the average β of the top-n se-
lected QPs Φ. We want Φ close to zero which means that the proposed model
selects the best possible QPs. Ξ and Φ depict the correct matching between the
qj and the available QPs. When we meet ‘ties’, i.e., QPs with the same pi , we
experiment with two scenarios: Scenario A. the random selection of the QPs
for the final allocation; Scenario B. the selection of the lowest possible β. In the
later case, there is the risk of possible bottlenecks (consider the scenario where
all the QCs select QPs with the lowest load when ‘ties’ arise).
    We get |EN | ∈ {10, 100, 1000, 10000} and consider β following: (a) the Uni-
form; (b) the Gaussian distributions. With the Uniform distribution, we simulate
a very dynamic environment where β continuously changes. The Gaussian distri-
bution assumes a ‘smooth’ environment where abrupt changes in β are absent.
In each simulation, we randomly select a query and apply the proposed model.
The adopted parameters are as follows: γ = 0.1, δ1 = 5.0, δ2 = 7.0, a = 1.5,
α = 10.0, W = 20.0, r1 = r2 = 10.0.
    Performance Assessment. Initially, we report on the complexity of the
proposed scheme which depends on: (i) the complexity of the ensemble similar-
ity model; (ii) the complexity of the QPs selection process. The first complexity
is affected by the |Θ| and the |QD | (the size of the dataset). Hence, the complex-
ity for (i) is O(|Θ)| · |QD |. In addition, when we produce the similarity values
with every metric in E, we require O(|E|2 + m) to calculate the Ω value. O(|E|2 )
is required to produce the SL for each metric and, additionally, O(m) to produce  
the Ω. Hence, the final complexity of our scheme is O |Θ| · |QD | · |E|2 + m .
In Figure 2, we plot the complexity of our scheme. At the left, we observe that
a combination of a high number of training queries with a high number of |Θ|
increases the computational time. At the right, we see that the number of simi-
larity metrics does not mainly affect the complexity. However, when Θ remains
low (e.g., below 20), the required time for concluding an allocation is low as well.
10      K. Kolomvatsos and C. Anagnostopoulos


              108                                     108
               4                                       8

                3                                      6

                2                                      4

                1                                      2

                0                                      0
             4000                                    100
                                               100                                        200
                    2000                                    50                      150
                                         50                                   100
                                                                         50
                               0   0                             0   0


                    Fig. 2. The complexity of the proposed scheme.


     In Table 5, we present the conclusion time for various numbers of QPs (Ψ
metric). The distribution of β does not affect the result; The adoption of the
Uniform mainly results lower conclusion time than the Gaussian distribution. In
any case, our results are below 0.3 seconds no matter the |EN |. This depicts the
efficiency of our model and its ability to support real time decisions.

                           Table 1. The conclusion time of our model.

                                                     Ψ
                                        |EN | Uniform Gaussian
                                         10    0.008   0.008
                                         100   0.012   0.010
                                        1,000 0.055    0.370
                                       10,000 0.251    0.276


   In Figure 3, we depict our results for the Ξ metric. We observe that, as
natural, the Scenario A leads to a higher difference with the lowest β than the
Scenario B. The random selection of a QP, in the case of ties, does not secure the
optimality of the selection but it focuses only in the ‘load balancing’ aspect of
the problem. We also observe that the difference is high as |EN | → 10, 000, The
higher the |EN | is, the higher the difference becomes. These results stand for the
Scenario A. In the Scenario B, we see that the increased |EN | positively affects
the performance as Ξ approaches zero. In the Scenario B, our model relies on
the minimum β, however, under the risk of bottlenecks if this decision is adopted
by the majority of the QCs.
   In Figure 4, we present our results for the Φ metric. Now, the difference is
higher than in the Ξ case. The reason is that the remaining selected QPs in the
top-n list and their load negatively affect the statistics. In any case, the load of
the selected QPs remains low in Scenario B.

6    Conclusions and Future Work
The efficient management of queries adopted to provide analytics comes, more in-
tensively, into scene in the IoT era. Queries should be immediately and efficiently
                 An Edge-centric Ensemble Scheme for Queries Assignment       11


                       Fig. 3. Our results for the Ξ metric.


                   Fig. 4. Performance results for the Φ metric.


responded to support high quality services. In this paper, we discuss a setting
where queries are set into the Cloud and responded in multiple edge nodes. We
propose a model for depicting the complexity of a query and an allocation pro-
cess to the edge nodes. The complexity class defines the computational burden
that a query imposes to a node and it is delivered by an ensemble similarity
scheme. Our model does not impose any training process and does not require
an increased time to deliver the final result. Our evaluation process reveals the
pros of the model and through numerical results confirms the increased perfor-
mance. Our future research plans involve the incorporation of more parameters
into the decision making process. For instance, we can take into consideration
the deadline defined for the final execution of a query or the statistics of data
hosted in each edge node. This way, we will be capable of providing a mecha-
nism fully adapted to the queries and nodes characteristics together with the
requirements defined by end users.


Acknowledgment

This work is funded by the EU/H2020 Marie Sklodowska-Curie (MSCA-2016)
under the INNOVATE project; Grant#745829.
12      K. Kolomvatsos and C. Anagnostopoulos

References
1. Aligon, J., et al., ’Mining preferences from OLAP query logs for proactive person-
   alization’, in Proc. of ADBIS, 2011.
2. Antara, G., et al., ’Plan selection based on query clustering’, in Proc. of VLDB
   Endowment, 2002.
3. Artail, H., et al., ’SQL query space and time complexity estimation for multidi-
   mensional queries’, Int. Journal of Int. Inf. and Database Systems, 2(4), 2008, pp.
   460–480.
4. Aujla, G. S., et al., ’Optimal Decision Making for Big Data Processing at Edge-
   Cloud Environment: An SDN Perspective’, IEEE TII, vol. 14(2), 2018, pp. 778–789.
5. Balkensen, C., Tatbul, N., ’Scalable Data Partitioning Techniques for Parallel Slid-
   ing Window Processing over Data Streams’, in Proc. of 8th Int. Workshop on Data
   Management for Sensor Networks, 2011.
6. Bishop, C., ’Pattern Recognition and Machine Learning’, Springer, 2006.
7. Bruno, N., Gravano, L., Marian, A., ’Evaluating top-k queries over web-accessible
   databases’, In ICDE, 2002.
8. Cao, L., Rundensteiner, E. A., ’High Performance Stream Query Processing with
   Correlation-Aware Partitioning’, VLDB Endowment, 7(4), 2013, pp. 265–276.
9. Chatzopoulou, G., et al., ’The QueRIE system for personalized query recommenda-
   tions’, IEEE Data Eng. Bull., 34(2), 2011, pp. 55-60.
10. Dias, M., et al., ’Distributed Data Stream Processing and Edge Computing: A
   Survey on Resource Elasticity and Future Directions’, NCA, 103, 2018, pp. 1–17.
11. Durbin, J., ’The fitting of time series models’, Rev. Inst. Int. Stat., vol. 28, 1960,
   pp. 233-243.
12. Farahbod, F., Eftekhari, N., ’Comparison of Different T-Norm Operators in Clas-
   sification Problems’, IJFLS, vol.2(3), 2012.
13. Gedik, B., ’Partitioning Functions for Stateful Data Parallelism in Stream Pro-
   cessing’, VLDB Journal, vol. 23(4), 2014, pp. 517–539.
14. Hameurlain, A., Morvan, F., ’Evolution of Query Optimization Methods’, TL-
   SDK-CS, 2009, pp. 211–242.
15. Hossain, K., Raihan, Z., Hashem, M., ’On Appropriate Selection of Fuzzy Aggrega-
   tion Operators in Medical Decision Support System’, In Proc. of the 8th Int. Conf.
   on Comp. and Inf. Technology, 2005.
16. Huacarpuma, R. C., at el., ’Distributed Data Service for Data Management in
   Internet of Things Middleware’, Sensors, vol. 17, 2017.
17. Hwang, S., Chang, K., ’Optimizing access cost for top-k queries over web sources:
   A unified cost-based approach’, in ICDE, 2005.
18. Jones, S., et al., ’A probabilistic model of information retrieval: development and
   comparative experiments’, Inf. Processing and Management, 36(6), 2000, pp. 779–
   808.
19. Kolcun, R., McCann, J. A., ’Dragon: Data Discovery and Collection Architecture
   for Distributed IoT’, Int. Conf. on IoT, 2014.
20. Kolomvatsos, K., ’An Intelligent Scheme for Assigning Queries’, Springer Applied
   Intelligence, doi.org/10.1007/s10489-017-1099-5, 2018.
21. Kolomvatsos, K., Hadjiefthymiades, S., ’Learning the Engagement of Query Pro-
   cessors for Intelligent Analytics’, Applied Intelligence Journal, vol. 46(1), 96-112,
   2017, pp. 1–17.
22. Kolomvatsos, K., Anagnostopoulos, C., ’Reinforcement Machine Learning for Pre-
   dictive Analytics in Smart Cities’, Informatics, 4(16), 2017.
                   An Edge-centric Ensemble Scheme for Queries Assignment            13

23. Kul, G., et al., ’Similarity Measures for SQL Query Clustering’, IEEE TKDE,
   2018.
24. Levinson, N., ’The Wiener RMS error criterion in filter design and prediction’,
   Journal of Math. Phys., vol. 25, 1947, pp. 261-278.
25. Ozgu, M. T., Valduriez, P., ’Overview of Query Processing’, Principles of Dis-
   tributes Database Systems, 2011, pp. 205–220.
26. Pandit, S., Gupta, S., ’A Comparative Study on Distance Measuring Approaches
   for Clustering’, Int. Journal of Research in Computer Science, 2(1), 2011, pp. 29–31.
27. Satoh, I., ’Edge Data Processing’, in Proc. of the 30th WAINA, 2016.
28. Simon, M., Pataki, N., ’SQL Code Complexity Analysis’, Proc. of the 8th Int.
   Conf. of Applied Informatics, 2010.
29. Stefanidis, K., Drosou, M., Pitoura, E., ’You may also like results in relational
   databases’, in Proc. of PersDB, 2009.
30. Vashistha, A., Jain, S., ’Measuring Query Complexity in SQLShare Workload’,
   Proc. of the Int. Conf. on Management of Data, 2016.
31. Yang, X., Procopiuc, C. M., Srivastava, D., ’Recommending join queries via query
   log analysis’, in IEEE ICDE, 2009.
32. Yao, Q., An, A., Huang, X., ’Finding and analyzing database user sessions’, in
   Proc. DASFAA, 2005.
33. Yu, H., et al., ’Exact indexing for support vector machines’, in proc. of the 2011
   ACM SIGMOD, 2011, pp. 709–720.
34. Zazhir, J., El Qadi, A., Bellatreche, L., ’Identifying SQL Queries Similarity Using
   SVM’, in Proc. of ICONIP, 2015.
35. Zeitler, E., Risch, T., ’Scalable Splitting of Massive Data Streams’, in DASFAA
   2010, vol 5982, Springer, 2010.
36. Zhang, Z., et al., ’Boolean + ranking: Querying a database by k-constrained opti-
   mization’, in Proc. ACM SIGMOD, 2006.