An Edge-centric Ensemble Scheme for Queries Assignment Kostas Kolomvatsos and Christos Anagnostopoulos Department of Computing Science, University of Glasgow, G12 8RZ, Glasgow, UK {kostas.kolomvatsos, christos.anagnostopoulos}@glasgow.ac.uk Abstract. The new era of the Internet of Things (IoT) reveals new po- tentials for the management of numerous devices. Such devices produce data streams that are guided to the Cloud for further processing. How- ever, any processing in the Cloud, even if it is supported by increased computational resources, suffers from increased latency. For minimizing the latency, we can perform data processing at the edge of the network, i.e., at the edge nodes. The aim is to provide analytics and build knowl- edge on top of the collected data in the minimum time. In this paper, we deal with the problem of allocating queries, defined for producing knowledge, to a number of edge nodes. The aim is to further reduce the latency by allocating queries to nodes that exhibit low load (the current and the estimated), thus, they can provide the final response in the minimum time. However, before the allocation, we should decide the computational burden that a query will add. The allocation is con- cluded by the assistance of an ensemble similarity scheme responsible to deliver the complexity class for each query. The complexity class, thus, can be matched against the current load of every edge node. We discuss our scheme and through a large set of simulations and the adoption of benchmarking queries, we reveal the potentials of the proposed model supported by numerical results. Keywords: Edge nodes · Queries assignment · Ensemble similarity scheme 1 Introduction In the era of the Internet of Things (IoT), numerous devices form a vast infras- tructure while being capable of performing simple processing tasks and exchange of data. One can identify a research challenge related to the connection of such devices with the network and three locations of data processing. Data can be processed at the devices, at the edge of the network (Edge/Fog) or at the Cloud. As we move to the upper layers of this architecture (from the devices to the Cloud), we observe improved computational resources, however, the latency in- creases as well. Current research efforts (e.g., [4], [10], [27]) focus on the data streams management at the edge to reduce the latency experienced by end users. Hence, the power of data processing and knowledge production is transferred to the edge nodes instead of relying on the Cloud or a central data warehouse. 2 K. Kolomvatsos and C. Anagnostopoulos A number of Edge Nodes (ENs) can be considered as the distributed repos- itories where queries can be executed to export meaningful analytics. ENs vary from simple routers to complete servers placed at various locations. ENs are con- nected with a number of devices and act as the repository of the data reported by them. ENs are responsible to execute the queries and report the result to the requested entity. The efficient management of the incoming queries as well as the provided responses will characterize the success of the supported applications. Usually, applications demand a response in the minimum time to provide high quality services to end users. The most important challenge is that ENs receive queries from multiple requestors and the collected data are quickly updated over time. This makes imperative the need for intelligent methods that will be able to manage the numerous requests (in the form of queries) and the large volumes of the collected data. In this paper, we deal with the problem of query allocation to the appro- priate ENs. Queries are reported through streams into a set of entities called Query Controllers (QCs). QCs are located at the Cloud and they have direct connection with multiple ENs. Our aim is to efficiently allocate every query to the appropriate EN in order to get the final response in the minimum time. This is a multi-dimensional problem involving queries and ENs characteristics (e.g., query type, ENs location, ENs load, the collected data). In the current effort, we focus on ‘matching’ queries with ENs. We propose a method for classifying queries into a set of complexity classes that depict the burden that a query will cause to an EN. Hence, we can compare the requirements of the query with the ENs’ load and decide if the specific allocation is productive. A question arises: ‘Why do not we rely on the selection of ENs with the lowest load? ’. The response is two-fold, i.e., (i) we want a mechanism to estimate the computational burden that a query will add to the selected ENs being difficult to classify a query in a specific complexity class; (ii) we cannot be based on the current minimum load as ENs receive queries from multiple QCs, thus, the load is continuously updated. We propose models for both problems; an ensemble similarity scheme for the estimation of the complexity class based on historical queries and the decision related to the selection of ENs based on a ‘combined’ view of their current and future loads. The following list reports on the contributions of our work: (i) we provide a modeling process for different types of queries; (ii) we provide an ensemble similarity scheme for concluding the complexity class; (iii) we provide a ENs selection model based on their current and the future load; (iv) we provide experimental evaluation of our ensemble similarity scheme. The paper is organized as follows. Section 2 presents the related work while Section 3 discusses the problem under consideration. Section 4 presents the pro- posed ensemble similarity scheme and the adopted decision making technique. Section 5 reports on the experimental evaluation of our mechanism while Section 6 concludes our paper by giving insights in our future research directions. An Edge-centric Ensemble Scheme for Queries Assignment 3 2 Prior Work In the IoT infrastructure, the collection of data in multiple locations is a com- mon approach. Data are geospatialy distributed with multiple nodes hosting the data. The management of these numerous nodes is a very challenging task. A set of efforts try to reveal opportunities for the management of the distributed nodes/data. For instance, Dragon [19] tries to efficiently identify nodes that can reply to user requests based on static criteria describing nodes themselves or their data. In such settings, the important is to have a view on the nodes char- acteristics as well as the available data. However, IoT nodes exhibit different characteristics not only in the hardware but also in the software (e.g., their middleware). In [16], the authors propose a Distributed Data Service (DDS) providing functionalities for collecting and processing data. The main target is to enable multiple and distinct IoT middleware systems to share common data services, thus, to cover interoperability issues. In any case, the execution of queries, in parallel, can increase the perfor- mance of the applications. This advantage is provided on top of the separation of data in a number of partitions. The separation of data may ‘emerge’ as a natural consequence, e.g., when streams report data in high rates at various lo- cations. Multiple efforts try to handle the problem of proposing algorithms for separating the available data. In [5], the authors adopt a sliding window ap- proach. Streams are partitioned on the fly taking into consideration the query semantics. A multi-route optimizer is proposed in [8]. The optimizer tries to ex- ploit the intra- and inter-stream correlations to produce effective partitions. The authors in [35] propose the separation of streams into a set of sub-streams over which query operators are executed in parallel. Another effort that focuses on splitting functions is reported in [13]. The proposed partitioning functions are characterized by a set of properties, i.e, balance properties (e.g., memory, pro- cessing, communication balance), structural properties (e.g., compactness, fast lookup), and adaptation properties (e.g., fast computation, minimal migration). In addition, various models, originated in the database community, have been proposed for delivering the similarity between SQL queries. Queries can be rep- resented at the intentional [32] or at the extensional level [29]. Other techniques involve Information Retrieval (IR) models, i.e., queries can be depicted by vec- tors of features [2], a set of fragments [1] or graphs [31]. Example schemes deal with the inner product of vectors [29], the cosine distance [29] or the Jaccard coefficient [9]. Other more ‘sophisticated’ solutions focus on the adoption of Sup- port Vector Machines (SVMs) [33], [34]. SVMs aim to learn the ranking function applied on queries. This way, we are able to sort the queries and get the top-k of them. Most existing top-k query processing algorithms like [7], [17] assume that the ranking function is defined over absolute attribute values or they are monotonic. The exploitation of the similarity can involve index structures (e.g., B-trees) to access the scoring of a sub-region. Other efforts e.g., [36], focus on relaxing the monotonicity assumption to include functions whose scores can be bounded in the given attribute value range. 4 K. Kolomvatsos and C. Anagnostopoulos In our past research, we also deal with the allocation of queries to a set of processors. In [20], we propose a time-optimized scheme for selecting the appro- priate processor(s) through the use of the Odds algorithm. With the proposed model, we try to result the optimal allocation, in the minimum time. In [21], we present a Q-learning scheme to calculate the reward retrieved for an allocation. In [22], we extend the work presented in [21] and incorporate into the learning process a load balancer comparing it with a clustering model that creates groups of processors with similar characteristics. The missing contributions in our past research activities that we cover with the current work are: (i) in our previous models, we do not adopt any specific similarity technique for ‘matching’ queries with the available processors; (ii) Our past efforts do not deal with an ensemble scheme that identifies the complexity class of queries; (iii) our past models are mostly ‘static’ meaning that they are applied on top of static values without taking into consideration the continuous update of processors characteristics; (iv) our previous research efforts require a training phase that increases the la- tency in the provision of the final response especially when adopted in dynamic environments. 3 Problem Definition and High Level Description Data Processing  at the Edge of the Network. We consider a set of ENs, i.e., EN = en1 , en2 , . . . , en|EN | placed at various locations (e.g., in a city or around the Globe). A set of IoT nodes (e.g., smartphones, sensors) are ‘connected’ with every EN to send their data. ENs, on top of the collected data, can build and extend the produced contextual knowledge supporting decision making. A Query Processor (QP) is adopted  in every EN to respond to any incoming query. Let the set of QPs be QP = qp1 , qp2 , . . . , qp|EN | . QPs are ‘invoked’ through the mediation of Cloud where a number of QCs are present. QCs receive queries, ‘invoke’ the appropriate QPs, get their responses and return the final result. We consider two types of applications, i.e., (i) applications that demand responses in real time; (ii) applications that do not define any time constraints. In our research, we focus on the former type. In each EN, a dataset is formulated by the collected data defining a geo- distributed local data repository. Each dataset Di , present at the ith EN, stores multivariate data, i.e., vectors in the form x = hx1 , x2 , . . . , xl i where l is the number of dimensions. Di s are updated over time as streams produced by IoT devices report data at high rates. In our research, we cannot have any view on the data present in every dataset and we do not adopt any separation algorithm for the collected data. In the upper layer (i.e., the Fog/Cloud), there is a number of QCs that have to process queries reported through streams Qi = {q1 , q2 , . . .}. QCs perform the selection of the appropriate ENs/QPs and the final aggregation of the ‘partial’ re- sponses. As partial response, we define the response retrieved by an EN/QP that should be aggregated with the remaining results reported by other ENs/QPs. In our research, we provide models for the management of the ecosystem of QCs - An Edge-centric Ensemble Scheme for Queries Assignment 5 ENs/QPs (see Figure 1) and define techniques for the efficient allocation of the incoming queries. Our current effort tries to ‘match’: (a) queries q1 , q2 , . . . with; (b) the available QPs qp1 , qp2 , . . . , qp|EN | . Fig. 1. The generic architecture under consideration. Matching Queries and Processors. Every EN/QP exhibits specific char- acteristics C qp = {cqp qp 1 , c2 , . . .}, e.g., C qp = {load, speed, language, ef f ectiveness}. A detailed discussion on the QPs characteristics can be found in [25]. Some of them are: (i) the input language; (ii) the types of the performed optimizations; (iii) the optimization timing; (iv) the effectiveness of processors as depicted by statistics; (v) the decision sites; (vi) the exploitation of the network topology; (vii) the exploitation of replicated fragments; (viii) the use of semi-joins. We propose to extend the aforementioned list and incorporate more ‘dynamic’ char- acteristics that are related to high level features like the load and the speed of each QP. Such characteristics are delivered as a more detailed view of the performance depicting the current state of each QP. In our work, we focus on the load β as an indication of their ability to quickly perform the execution of a query. We consider that every QP maintains a queue where the incoming queries are placed and wait for processing. The size of the queue is adopted to deliver β which represents the percentage of the maximum load that can be afforded by the corresponding QP. Without loss of generality, β is defined in [0,1] (a maximum queue size Qmax is adopted for such purposes). When β → 1 means that the corresponding QP exhibits a high load. The load is directly ‘connected’ with the throughput of each QP and the velocity in which queries arrive in the discussed queue. A complex query (e.g., a join query - see below) may demand for more time and resources to be answered compared with a simple query (e.g., a select query). Usually, a complex query requires a high number of steps (a discussion on the query execution plans and the required steps can be found in the upcoming sections). The ‘classification’ of the complexity of a query and its ‘combination’ with the load of a QP is the main subject of the current work. Future extensions involve the modeling of more QPs’ characteristics as well as a complex ‘matching’ scheme for delivering the final allocation. Every query reported to a QC also has a set of characteristics depicted by C q = {cq1 , cq2 , . . .}, e.g., C q = {class, deadline, type, size}. According to [14], ‘generic’ characteristics are: (i) the type of the query (e.g., repetitive, ad-hoc); 6 K. Kolomvatsos and C. Anagnostopoulos (ii) the query shape; (iii) the size of the query (e.g., simple, medium, complex). Based on C q , specific execution plans could be defined in the form of a processing tree [14]. We propose to extend the aforementioned list and incorporate more characteristics that depict the complexity and the need for instant response. Such characteristics affect queries’ execution in terms of the resources required to produce the final response. In the current work, we focus on the query class θ; it depicts the complexity of a query. θ is aligned with the complexity performed by the operations required for producing the final result. For instance, the op- erations required by a select query may be easier than the operations required by a Cartesian product. Various research efforts deal with the complexity of queries [3], [28], [30]. Delivering the Query Complexity Class. Our aim is to combine β with θ and support the decision if a query can be efficiently executed in a QP. Ini- tially, we have to assign the incoming query to a complexity class θ (i.e., a typical classification task). However, it is difficult to ‘match’ any query against a single class due to the increased number of constraints that could be adopted in each query statement. The final complexity should be defined based not only on quan- titative (e.g., number of constraints / conditions) but also on qualitative (e.g., type of operations / constraints) characteristics. For handling this complicated process, we propose a ‘fuzzy’ approach and define a Fuzzy Classification Process (FCP). The FCP is the process of grouping individuals having the same charac- teristics into the same fuzzy set based on a membership function that indicates whether a query is a member of a class. The FCP tells us the membership of a query in each of the pre-defined classes, thus, we could be able to estimate the computational burden that will be added to the selected QP. A dataset of historical queries (i.e., a set of tuples) together with their corresponding classes is available for the FCP. The same class may be involved in multiple tuples, thus, in multiple queries. The queries dataset if defined by database experts and its creation is not the focus of the current work. For evaluating θ for qj , we adopt Information Retrieval (IR) and Data Mining (DM) techniques. We prefer to adopt fast similarity techniques to deliver the final result in real time instead of adopting a classification approach that requires a training phase. Our model can be easily executed, even if the queries dataset is updated; this is not the case in the majority of the classification  algorithms (the training phase should be executed again). The set Θ = θ1 , θ2 , . . . , θ|Θ| depicts the pre-defined classes where a query can/should be classified. Let the queries dataset be QD with tuples in the form hsk , θk i , ∀k ∈ {1, 2, . . . , |QD |}. sk represents the query’s statement and θk ∈ Θ. An example query statement could be {Select price from stocks where id =0 RBS 0 }. The function f gets the qj and based on QD delivers a vector that depicts the ‘similarity’ of qj with every class in Θ, i.e., f (qj ; QD ) → qs ∈ R|Θ| . qs contains values in [0,1] forming the  basis of our FCP. An example vector could be qs = h0.2, 0.8, 0.3i for Θ = 2 s θ1 = O(nlogn), θ2 = O(n), θ3 = O(n . q shows that the qj ‘belongs’ by 20% to the first complexity class, by 80% to the second and by 30% to the third. An Edge-centric Ensemble Scheme for Queries Assignment 7 Based on qs , we can estimate θ and match it with QPs characteristics (i.e., β in this effort). For calculating qs , we can be based on various efforts that deliver the simi- larity between queries. The interested reader can refer in [23] for more details. We propose the use of an ensemble scheme for evaluating the final similarity be- tween qj and every tuple hsk , θk i in QD . The ensemble model aims at avoiding the disadvantages of each individual metric. We process all theavailable tuples in QD classified to θk . The ensemble scheme adopts the set E = e1 , e2 , . . . , e|E| of similarity metrics. For instance, E could involve the Hamming distance [26], the Jaccard coefficient [1], the Cosine similarity [26] or any other metric. Any distance metric available in the corresponding literature could be transformed to depict the similarity between qj and hsk , θk i. For instance, if ed is the Euclidean 1 distance between qj and a tuple, their similarity can be calculated by 1+e d . 4 Allocating Queries to Processors The Ensemble Scheme. The adopted similarity metrics are applied on each tuple classified to θk aggregated to a successive step for the finalization of qks , i.e., the final similarity of qj with θk . Formally the ‘2D aggregation’ is calculated as follows: qks = Ω(ω {ei (qj , hsk , θk i)} , ∀i, ∀ hsk , θk i. ω realizes the envisioned ensemble similarity scheme while the aggregation operator Ω produces the qks through multiple ω values. For ω, we consider that every single result (i.e., ei (qj , hsk , θk i) represents the membership of qj to a ‘virtual’ fuzzy set. We have |E| membership degrees that should be combined to get the final similarity for the each tuple. For instance, if we get e1 = 0.2, e2 = 0.5 and e3 = 0.3, qj ‘belongs’ to the e1 fuzzy set by 0.2, to the e2 by 0.5 and to the e3 by 0.3. ω is a fuzzy aggregation operator, a |E|-place function ω : [0, 1]|E| → [0, 1]) that takes into consideration the membership to every fuzzy set and returns the final value. Aggregation operators are well studied in various efforts [12], [15]. Through a high number of experiments [12], [15], a number of aggregators are identified to exhibit the best performance, i.e., the Einstein product, the algebric product, the Hamacher product [15] and the Schweizer-Sklar metric [12]. In the proposed model, we adopt the Hamacher product as it gives us more opportunities to ‘tune’ the result through the parameter α ≥ 0. The final ω for a ė·ë tuple is defined as: ωi = a+(1−a)( ė+ë−ė·ë) where ė and ë are two similarity values. As those values may be distributed in [0,1], i.e., similarity metrics may ‘disagree’, we propose the use of the top-n similarity values based on their significance level. The Significance Level (SL) depicts if a value is ‘representative’ for many other results. We borrow the idea from the density based clustering where the centroids are points being connected with many other objects. We propose the use of the radius γ and calculate the SL for each similarity result as follows: 1 SLei = , ∀j, where δ1 and δ2 are parameters adopted to 1+e−(δ1 |d(ei ,ej )≤γ|−δ2 ) smooth the sigmoid function. With the sigmoid function, we want to eliminate the SL of values with a low number of ‘neighbors’ in the radius γ. Finally, the 8 K. Kolomvatsos and C. Anagnostopoulos results are sorted in descending order of the SL and the top-n of them are processed with the Hamacher product to deliver the final aggregated similarity value. The Ω operator builds on top of the ω values produced for each tuple in QD classified in θk . Let ω1 , ω2 , . . . , ωm are those values. For their aggregation,  1 Pm α  α1 we rely on a Quasi-Arithmetic mean, i.e., qks = m i=1 ωi where α is a parameter that ‘tunes’ the function. When α = 1, the function is the arithmetic mean, when α = 2, it is the quadratic mean and so on. After calculating the final values for each θk , we get qs = Ω1 , Ω2 , . . . , Ω|Θ| . The Matching Process. The next step is to estimate the required process- ing steps to conclude the response for the qj , thus, to identify its computational burden. We consider an additional vector Ts = T1 , T2 , . . . , T|Θ| which repre- sents a ‘typical’ number of processing steps (an upper bound) for each class. P|Θ| The expected number of processing steps for qj is defined by TE = i=1 Ωi Ti . Recall that β depicts the current load of a processor, thus, 1 − β depicts the room for ‘hosting’ additional queries. The most common execution approach is the creation of an execution tree where the required steps are connected 1 . A statistical study for the average required steps T E in various query execution plans can assist us to define the room for additional queries in QPs. TE should be compared with T̂E = (1 − β) T E to identify if qj can be executed in the specific QP. When TE ≤ T̂E , we assign a reward r1 to the specific QP, otherwise, r1 corresponds to a penalty. In addition, we want to incorporate in the decision process, our view on the future load of QPs. Hence, we maintain historical β values and apply a single linear estimator to identify the future load as in [22]. The idea is to see if the current (through T̂E ) and future load can support the execution of qj . For the latest W β observations βt−1 , βt−2 , . . . , βt−W , we estimate the future load β̂ through the linear combination of βt−k , k = 1, 2, . . . , W with real-valued ak coefficients. The set {ak } is estimated to minimize the error between β̂ and β. In our effort,  we  adopt the Levinson-Durbin algorithm [24], [11]. Based on β̂, if TE ≤ 1 − β̂ T E , we assign a reward equal to r2 to the specific processor, otherwise, r2 corresponds to a penalty. P|R| The ith QP gets a reward/penalty equal to ri = j=1 sgn(ri )ri where |R| is the number of rewards and sgn(ri ) is the positive sign if the ri deals with a re- ward; otherwise, it is the negative sign. For each QP, we calculate the probability e ri of allocation pi delivered by the softmax function [6], i.e., pi = P|EN | r . qj is i=1 i e allocated in the processors that their probability exceeds a pre-defined threshold pT . This secures the optimal allocation based on the Probability Ranking Princi- ple [18], i.e., if QPs are ordered by decreasing pi , then the model’s effectiveness is the best to be gotten for the qj . 1 https://docs.oracle.com/database/121/TGSQL/tgsql sqlproc.htm#TGSQL186 An Edge-centric Ensemble Scheme for Queries Assignment 9 5 Experimental Evaluation Experimentation Setup. We report on the performance of the proposed scheme through a large set of simulations. Our simulator is written in Java and manages a number of queries retrieved by a real dataset. We rely on two benchmarking datasets, i.e., TPC-DS and TPC-H (http://www.tpc.org/). TPC-DS is the de- facto industry standard benchmark for measuring the performance of decision support solutions. The TPC-H is a decision support benchmark that consists of a suite of business oriented ad-hoc queries. For each of the adopted queries, we define its class as described in [30] where a survey of databases experts define their view on the complexity of every query. We classify our evaluation queries in six (6) classes. We adopt three (3) performance metrics: (i) the time required for conclud- ing a query allocation depicted by Ψ measured in seconds. The lower the Ψ is the more efficient the model becomes; (ii) the difference between β of the first selected QP compared to the lowest β observed in the group of QPs Ξ. Ξ → 0 means that the proposed model selects the best possible QP; (iii) the difference of β between the lowest value in the group and the average β of the top-n se- lected QPs Φ. We want Φ close to zero which means that the proposed model selects the best possible QPs. Ξ and Φ depict the correct matching between the qj and the available QPs. When we meet ‘ties’, i.e., QPs with the same pi , we experiment with two scenarios: Scenario A. the random selection of the QPs for the final allocation; Scenario B. the selection of the lowest possible β. In the later case, there is the risk of possible bottlenecks (consider the scenario where all the QCs select QPs with the lowest load when ‘ties’ arise). We get |EN | ∈ {10, 100, 1000, 10000} and consider β following: (a) the Uni- form; (b) the Gaussian distributions. With the Uniform distribution, we simulate a very dynamic environment where β continuously changes. The Gaussian distri- bution assumes a ‘smooth’ environment where abrupt changes in β are absent. In each simulation, we randomly select a query and apply the proposed model. The adopted parameters are as follows: γ = 0.1, δ1 = 5.0, δ2 = 7.0, a = 1.5, α = 10.0, W = 20.0, r1 = r2 = 10.0. Performance Assessment. Initially, we report on the complexity of the proposed scheme which depends on: (i) the complexity of the ensemble similar- ity model; (ii) the complexity of the QPs selection process. The first complexity is affected by the |Θ| and the |QD | (the size of the dataset). Hence, the complex- ity for (i) is O(|Θ)| · |QD |. In addition, when we produce the similarity values with every metric in E, we require O(|E|2 + m) to calculate the Ω value. O(|E|2 ) is required to produce the SL for each metric and, additionally, O(m) to produce  the Ω. Hence, the final complexity of our scheme is O |Θ| · |QD | · |E|2 + m . In Figure 2, we plot the complexity of our scheme. At the left, we observe that a combination of a high number of training queries with a high number of |Θ| increases the computational time. At the right, we see that the number of simi- larity metrics does not mainly affect the complexity. However, when Θ remains low (e.g., below 20), the required time for concluding an allocation is low as well. 10 K. Kolomvatsos and C. Anagnostopoulos 108 108 4 8 3 6 2 4 1 2 0 0 4000 100 100 200 2000 50 150 50 100 50 0 0 0 0 Fig. 2. The complexity of the proposed scheme. In Table 5, we present the conclusion time for various numbers of QPs (Ψ metric). The distribution of β does not affect the result; The adoption of the Uniform mainly results lower conclusion time than the Gaussian distribution. In any case, our results are below 0.3 seconds no matter the |EN |. This depicts the efficiency of our model and its ability to support real time decisions. Table 1. The conclusion time of our model. Ψ |EN | Uniform Gaussian 10 0.008 0.008 100 0.012 0.010 1,000 0.055 0.370 10,000 0.251 0.276 In Figure 3, we depict our results for the Ξ metric. We observe that, as natural, the Scenario A leads to a higher difference with the lowest β than the Scenario B. The random selection of a QP, in the case of ties, does not secure the optimality of the selection but it focuses only in the ‘load balancing’ aspect of the problem. We also observe that the difference is high as |EN | → 10, 000, The higher the |EN | is, the higher the difference becomes. These results stand for the Scenario A. In the Scenario B, we see that the increased |EN | positively affects the performance as Ξ approaches zero. In the Scenario B, our model relies on the minimum β, however, under the risk of bottlenecks if this decision is adopted by the majority of the QCs. In Figure 4, we present our results for the Φ metric. Now, the difference is higher than in the Ξ case. The reason is that the remaining selected QPs in the top-n list and their load negatively affect the statistics. In any case, the load of the selected QPs remains low in Scenario B. 6 Conclusions and Future Work The efficient management of queries adopted to provide analytics comes, more in- tensively, into scene in the IoT era. Queries should be immediately and efficiently An Edge-centric Ensemble Scheme for Queries Assignment 11 Fig. 3. Our results for the Ξ metric. Fig. 4. Performance results for the Φ metric. responded to support high quality services. In this paper, we discuss a setting where queries are set into the Cloud and responded in multiple edge nodes. We propose a model for depicting the complexity of a query and an allocation pro- cess to the edge nodes. The complexity class defines the computational burden that a query imposes to a node and it is delivered by an ensemble similarity scheme. Our model does not impose any training process and does not require an increased time to deliver the final result. Our evaluation process reveals the pros of the model and through numerical results confirms the increased perfor- mance. Our future research plans involve the incorporation of more parameters into the decision making process. For instance, we can take into consideration the deadline defined for the final execution of a query or the statistics of data hosted in each edge node. This way, we will be capable of providing a mecha- nism fully adapted to the queries and nodes characteristics together with the requirements defined by end users. Acknowledgment This work is funded by the EU/H2020 Marie Sklodowska-Curie (MSCA-2016) under the INNOVATE project; Grant#745829. 12 K. Kolomvatsos and C. Anagnostopoulos References 1. Aligon, J., et al., ’Mining preferences from OLAP query logs for proactive person- alization’, in Proc. of ADBIS, 2011. 2. Antara, G., et al., ’Plan selection based on query clustering’, in Proc. of VLDB Endowment, 2002. 3. Artail, H., et al., ’SQL query space and time complexity estimation for multidi- mensional queries’, Int. Journal of Int. Inf. and Database Systems, 2(4), 2008, pp. 460–480. 4. Aujla, G. S., et al., ’Optimal Decision Making for Big Data Processing at Edge- Cloud Environment: An SDN Perspective’, IEEE TII, vol. 14(2), 2018, pp. 778–789. 5. Balkensen, C., Tatbul, N., ’Scalable Data Partitioning Techniques for Parallel Slid- ing Window Processing over Data Streams’, in Proc. of 8th Int. Workshop on Data Management for Sensor Networks, 2011. 6. Bishop, C., ’Pattern Recognition and Machine Learning’, Springer, 2006. 7. Bruno, N., Gravano, L., Marian, A., ’Evaluating top-k queries over web-accessible databases’, In ICDE, 2002. 8. Cao, L., Rundensteiner, E. A., ’High Performance Stream Query Processing with Correlation-Aware Partitioning’, VLDB Endowment, 7(4), 2013, pp. 265–276. 9. Chatzopoulou, G., et al., ’The QueRIE system for personalized query recommenda- tions’, IEEE Data Eng. Bull., 34(2), 2011, pp. 55-60. 10. Dias, M., et al., ’Distributed Data Stream Processing and Edge Computing: A Survey on Resource Elasticity and Future Directions’, NCA, 103, 2018, pp. 1–17. 11. Durbin, J., ’The fitting of time series models’, Rev. Inst. Int. Stat., vol. 28, 1960, pp. 233-243. 12. Farahbod, F., Eftekhari, N., ’Comparison of Different T-Norm Operators in Clas- sification Problems’, IJFLS, vol.2(3), 2012. 13. Gedik, B., ’Partitioning Functions for Stateful Data Parallelism in Stream Pro- cessing’, VLDB Journal, vol. 23(4), 2014, pp. 517–539. 14. Hameurlain, A., Morvan, F., ’Evolution of Query Optimization Methods’, TL- SDK-CS, 2009, pp. 211–242. 15. Hossain, K., Raihan, Z., Hashem, M., ’On Appropriate Selection of Fuzzy Aggrega- tion Operators in Medical Decision Support System’, In Proc. of the 8th Int. Conf. on Comp. and Inf. Technology, 2005. 16. Huacarpuma, R. C., at el., ’Distributed Data Service for Data Management in Internet of Things Middleware’, Sensors, vol. 17, 2017. 17. Hwang, S., Chang, K., ’Optimizing access cost for top-k queries over web sources: A unified cost-based approach’, in ICDE, 2005. 18. Jones, S., et al., ’A probabilistic model of information retrieval: development and comparative experiments’, Inf. Processing and Management, 36(6), 2000, pp. 779– 808. 19. Kolcun, R., McCann, J. A., ’Dragon: Data Discovery and Collection Architecture for Distributed IoT’, Int. Conf. on IoT, 2014. 20. Kolomvatsos, K., ’An Intelligent Scheme for Assigning Queries’, Springer Applied Intelligence, doi.org/10.1007/s10489-017-1099-5, 2018. 21. Kolomvatsos, K., Hadjiefthymiades, S., ’Learning the Engagement of Query Pro- cessors for Intelligent Analytics’, Applied Intelligence Journal, vol. 46(1), 96-112, 2017, pp. 1–17. 22. Kolomvatsos, K., Anagnostopoulos, C., ’Reinforcement Machine Learning for Pre- dictive Analytics in Smart Cities’, Informatics, 4(16), 2017. An Edge-centric Ensemble Scheme for Queries Assignment 13 23. Kul, G., et al., ’Similarity Measures for SQL Query Clustering’, IEEE TKDE, 2018. 24. Levinson, N., ’The Wiener RMS error criterion in filter design and prediction’, Journal of Math. Phys., vol. 25, 1947, pp. 261-278. 25. Ozgu, M. T., Valduriez, P., ’Overview of Query Processing’, Principles of Dis- tributes Database Systems, 2011, pp. 205–220. 26. Pandit, S., Gupta, S., ’A Comparative Study on Distance Measuring Approaches for Clustering’, Int. Journal of Research in Computer Science, 2(1), 2011, pp. 29–31. 27. Satoh, I., ’Edge Data Processing’, in Proc. of the 30th WAINA, 2016. 28. Simon, M., Pataki, N., ’SQL Code Complexity Analysis’, Proc. of the 8th Int. Conf. of Applied Informatics, 2010. 29. Stefanidis, K., Drosou, M., Pitoura, E., ’You may also like results in relational databases’, in Proc. of PersDB, 2009. 30. Vashistha, A., Jain, S., ’Measuring Query Complexity in SQLShare Workload’, Proc. of the Int. Conf. on Management of Data, 2016. 31. Yang, X., Procopiuc, C. M., Srivastava, D., ’Recommending join queries via query log analysis’, in IEEE ICDE, 2009. 32. Yao, Q., An, A., Huang, X., ’Finding and analyzing database user sessions’, in Proc. DASFAA, 2005. 33. Yu, H., et al., ’Exact indexing for support vector machines’, in proc. of the 2011 ACM SIGMOD, 2011, pp. 709–720. 34. Zazhir, J., El Qadi, A., Bellatreche, L., ’Identifying SQL Queries Similarity Using SVM’, in Proc. of ICONIP, 2015. 35. Zeitler, E., Risch, T., ’Scalable Splitting of Massive Data Streams’, in DASFAA 2010, vol 5982, Springer, 2010. 36. Zhang, Z., et al., ’Boolean + ranking: Querying a database by k-constrained opti- mization’, in Proc. ACM SIGMOD, 2006.