=Paper=
{{Paper
|id=Vol-1558/paper3
|storemode=property
|title=Which is Better for kNN Query Processing in the Cloud: Sequential or Parallel
|pdfUrl=https://ceur-ws.org/Vol-1558/paper3.pdf
|volume=Vol-1558
|authors=Chong Zhang,Xiaoying Chen,Bin Ge,Weidong Xiao
|dblpUrl=https://dblp.org/rec/conf/edbt/ZhangCGX16
}}
==Which is Better for kNN Query Processing in the Cloud: Sequential or Parallel==
Which is Better for kNN Query Processing in the Cloud: Sequential or Parallel∗ Chong Zhang1 , Xiaoying Chen2 , Bin Ge3 , Weidong Xiao4 ∗ Science and Technology on Information Systems Engineering Laboratory National University of Defense Technology, Changsha 410073, China + Collaborative Innovation Center of Geospatial Technology, China {1 leocheung8286, 2 chenxiaoying1991}@yahoo.com gebin1978@gmail.com, 4 wilsonshaw@vip.sina.com 3 ABSTRACT 1. INTRODUCTION With the development of various Cloud system, providing With the development of Cloud computing, various lay- powerful kNN query capability to DaaS (Database as a Ser- ers of computing resources are used in terms of pay-as-you- vice) is an essential requirement for many applications. In go, such as IaaS (Infrastructure as a Service), PaaS (Plat- this paper, we are interested in two opposite approaches for form as a Service) and SaaS (Software as a Service). Nowa- processing kNN query in Cloud system, parallel processing days, Database as a Service (DaaS)[2][13] is a hot topic for and sequential processing, and we want to explore the an- database community in Cloud computing era. For DaaS swer of which one performs better. For addressing such a users, it is not necessary to focus on the location of database question, we devise a new distributed indexing structure VI- instance, nor the physical storage mechanism of schema or HCO, which is characterized by fast locating Cloud nodes tables, not even data partition fashion or query processing, capability. Then parallel and sequential processing methods in one word, the inner of database is transparent to the are designed upon the structure. For parallel one, we take users. They just define the structure of table, and insertion, differential cells between two consecutive range queries into query or other operations seem similar to use a centralized consideration, and for sequential one, we elaborately design local database. However, it is possible for one table, data are an accurate message delivery algorithm. We verify our ideas spread over many computing nodes, and querying processing through experiments, which is conducted on both synthetic needs the collaboration of these nodes. And as data volume and real dataset, and the results show that VIHCO outper- increases, the database should be adaptive to the new scale forms a previous work RT-CAN, and the sequential method and new query requirement, i.e., it should be elastic. is more efficient under small k query condition and small In this paper, we focus on kNN query in DaaS, which is system size, while parallel one suits for large k and large an essential function for spatial database. Given a point scale of computing nodes. in the space, kNN query aims to find k nearest objects to the query point. This topic is addressed well in some previ- ous works[13][9][14], however, there are two opposite ideas Categories and Subject Descriptors to solve the problem in the state-of-the-art, namely, parallel H.2.4 [Database Management]: Systems - query processing processing and sequential processing. Parallel method ex- ploits the parallelism of Cloud nodes, and make them work simultaneously, while sequential one uses the vicinity rela- General Terms tionship between query point and Cloud nodes to accurately Algorithms, Measurement, Performance deliver query messages. Nevertheless, which is better for DaaS is not studied before. Hence, in this paper, we extend our work in [14] to acquire the answer. Keywords For comparing the two approaches, we use a previous work kNN query, Cloud, histogram, parallel, sequential RT-CAN as a baseline, and propose a new structure, called VIHCO (VIcinity-based Hilbert Cloud Overlay), to index ∗This work is supported by NSF of China grant 61303062 spatial data in Cloud system and to process kNN query. and 71331008. The feature of VIHCO is not only leveraged on fast look up routing table (finger table), but also highlighted on vicinity neighbors to quickly locate the nearby Cloud nodes. Based on such structure, we present the designs of parallel and sequential processing algorithms. Experiments on both syn- thetic and real dataset show that VIHCO outperforms RT- CAN, in efficiency and scalability, and the sequential method c 2016, Copyright is with the authors. Published in the Workshop Pro- is more proper under small k query condition and small sys- ceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bor- deaux, France) on CEUR-WS.org (ISSN 1613-0073). Distribution of this tem size, while parallel one suits for large k and large scale paper is permitted under the terms of the Creative Commons license CC- of computing nodes. by-nc-nd 4.0 To summarize, we make the following contributions: 3. PROBLEM DEFINITION AND OVERLAY • We propose the problem of comparing performances STRUCTURE between parallel and sequential processing for kNN query in Cloud computing architecture. 3.1 Problem Definition In this paper, only two-dimensional space is considered. • We design an indexing structure VIHCO capable to A Cloud platform contains a set C of Cloud nodes, C={cni | fast locate Cloud nodes. 1 ≤ i ≤ n}, where cni denotes the i-th Cloud node with stand alone storage and computational capability. A set O • We devise parallel and sequential processing algorithms of spatial objects is spread over Cloud nodes C according to based on VIHCO, and conduct experiments to com- the underlying Cloud storing policy. And each object obj is pare their performances under different parameters. represented as (x, y, shp), where (x, y) denotes the location of obj, and shp is a set of points denoting the shape of obj. The rest of this paper is organized as follows. Section 2 If obj is a point object, shp is null. Otherwise, i.e., obj is reviews related works. Section 3 formally defines the prob- an object with spatial extent, an MBR (Minimal Bounding lem and presents VIHCO structure. Algorithms for parallel Rectangle) is used as the approximated shape of the obj, and sequential processing are presented in section 4 and 5, and (x, y) means the center of the MBR. respectively. And we experimentally evaluate VIHCO and A kNN query could be formally defined as: given a set acquire comparison answers in section 6. Finally, section 7 O of spatial objects, a kNN query (q=(xq , yq ), k), aims to concludes the paper with directions for future works. find a set Oq ⊆ O , such that |Oq |=k, and d(o, q) ≤ d(o0 , q), ∀o ∈ Oq , o0 ∈ O \ Oq , where d() is the Euclidean distance function. 2. RELATED WORKS A similar work RT-CAN solving multidimensional queries 3.2 Overlay Structure in Cloud system is proposed in [13]. Each peer firstly builds We propose an overlay structure called VIHCO (Vicinity- R-tree to index multidimensional data locally, and then se- based Hilbert Cloud Overlay), which takes advantage of lects the nodes in the level above the leaf level of R-tree to both linearization of Hilbert curve[5] and space vicinity. The publish to CAN by interacting with overlay node. RT-CAN overview of the structure is presented in Figure 1. addresses point query, range query and kNN query for multi- dimensional data in Cloud system. Later, the authors pro- key=60 posed a framework for supporting DBMS-like indexes in the cn5 Cloud[2], enabling users to define their own indexes without cn4 key=8 key=52 cn8 key=10 knowing the complicated structure of the underlying net- cn3 11 work, and three conventional indexes, namely hash indexes, 12 B+ -tree-like indexes and multi-dimensional indexes are im- key=48 cn2 key=14 cn1 plemented. The results of work [14] prove that our proposed structure DRISTIX is better than RT-CAN, and in this pa- 18 per, we extend our last work to exclusively study kNN query 42 cn6 processing. key=41 cn7 Several works from moving objects query and sensor net- key=24 26 work are also related to our work. DTI[6] is a distributed index for trajectory queries on moving objects. For each (a) (b) moving object, it builds a SkipList overlay to index trajec- tory. This method suffers from high overhead, because it Figure 1: Overview of VIHCO maintains complicated overlay connection when the number of moving objects is large. Hua Lu et al.[7] address con- The whole space is partitioned by Hilbert curve, in par- tinuous queries for monitoring moving objects constrained ticular, for a λ-order Hilbert, the space is divided into 22λ in road network in a distributed environment. However, cells, and each cell is attached a Hilbert value varied from they adopt a central server to control the whole distributed 0 to 22λ − 1. Initially, Cloud nodes are formed into over- queries, which is a bottle neck in real application. DIST[8] lay network using Chord protocol[10], a distributed hash- addresses tracking moving object in sensor network, it uses based (key-value) P2P network structure. Formally, a set quad-tree to divide space recursively and builds distributed of 2m consistent keys [0, 2m − 1] is maintained by a set of index hierarchically among sensors, and adopts efficient up- Cloud nodes, and each Cloud node is exclusively respon- date to reduce maintenance cost. However, due to lack of sible for maintaining a segment [keys , keye ] (0 ≤ keys ≤ scalability of tree structure, DIST suffers bottle neck prob- keye ≤ 2m − 1), each key of which is associated with data lem. items (value). According to the order of the segment main- A related work addresses multi-dimensional queries for tained in the key domain, each Cloud node connects to its PaaS, MD-HBase[9] uses k-d tree and quad-tree to partition predecessor node and successor node, in addition, it main- space and adopts Z-curve to convert multidimensional data tains a routing table called finger table containing m entries points into one dimension, and supports multi-dimensional where the i-th entry points the node maintaining the key range and nearest neighbor queries, which leverages a mul- keye +2i−1 . For the Cloud nodes maintaining the first and tidimensional index structure layered over HBase. However, the last segments of keys, they are also connected by each this work is limited in HBase, it could not be generalized to other as predecessor and successor, so the whole structure many Cloud system. looks like a ring. Figure 1(b) illustrates an example, node cn8 maintains key segment [9, 10], with connecting prede- est Cloud node to q, say cnq , is found, and then according to cessor cn4 and successor cn1 , for maintaining figure table, global Hilbert partition policy, cnq finds the set SCq of cells cn8 finds the nodes which maintain the key 10+20 , 10+21 , intersecting with the circle, after that, the cells with contigu- 10+22 , 10+23 , 10+24 , 10+25 , respectively, resulting cn1 , ous Hilbert values are formed into intervals, i.e., SCq ={[csi , cn7 , cn6 , cn2 . cei ] | si ≤ei }, and then cnq uses its vicinity neighbors to For simplicity, let 2λ = m, which means that each cell in search these cells, in particular, for each interval [csi , cei ], the space is mapped to a key on the ring, i.e., each Cloud if there is a cell ck being cnq ’s left, lower, right or upper node cni is responsible for maintaining a set of cells which cell, then cnq sends the interval to the corresponding vicinity are continuous in Hilbert value domain, and we call these neighbor, say cnv , to search the results, otherwise, i.e., there cells as cni ’s charge-cell s and cni is these cells’ charge-node. is no cell being cnq ’s neighbor cell, cnq uses its predecessor, The term charge here means, if an object obj spatially in- successor and/or finger table to find the proper charge-nodes tersects with cell cj , obj is maintained (stored and indexed) responsible for these cells to retrieve the results. Figure 2 by cj ’s charge-node. Figure 1 illustrates an example of VI- illustrates an example, assuming cn8 (key=10) is the nearest HCO, the whole space is partitioned by a 3-order Hilbert Cloud node to query point q, after obtaining radius r1 , cn8 (Figure 1(a)), and there are 8 Cloud nodes, namely, cn1 to finds SCq ={[6, 8], 11, [30, 32], [53, 54], 57}, note that cell cn8 , which connect to each other, forming a Chord (Figure 9 and 10 are cn8 ’s charge-cell s, so they are not contained 1(b)). Hence, for cn8 , whose key is 10, its charge-cell s are in SCq . Then, for each interval (or value) in SCq , say [6, cell 9 and 10, similarly, cn2 (key=48) is responsible from cell 8], cn8 examines that cell 6 is vicinity cell, so [6, 8] is sent 42 to 48. to cn4 , similarly, 11, [30, 32] and [53, 54] are sent to cn1 , Hilbert curve merely tries best to preserve locality of origi- cn6 and cn5 , respectively, and then cell 57 is routed through nal space, hence, some spatial relation is lost after lineariza- looking up finger table and is forwarded to cn5 . tion, e.g., in Figure 1, cell 10 and cell 53 are adjacent to After retrieving objects, cnq examines the number of re- each other in original space, but they are distinguished by 43 sults, if it is not less than k, then cnq sorts these objects in linear space, which consequently increases complexity of according to the distance to q and returns results, other- query processing. So we propose a vicinity-based approach wise, i.e., there are less than k objects, cnq increases r by δ, to improve performance. Each Cloud node not only main- a new range query (q, r+δ) is formed, and the intersecting tains finger table, but also connects to its vicinity neighbors, cells are also calculated and sent to corresponding charge- in particular, for each Cloud node cni , assuming the Chord nodes. Then the charge-nodes only search for the new com- key of it is keyi , then the vicinity neighbors are charge-nodes ing cells and return results to cnq , and again cnq examines of cell keyi ’s left, lower, right, upper cells, if the charge-node the number, and the above procedure is repeated until k of left (lower, right, upper) cell is cni itself, then continue objects are obtained. Continuing the example in Figure 2, searching the charge-node of left (lower, right, upper) of left when cn8 finds that the number of results is less than k, (lower, right, upper) cell, until the charge-node is not cni it issues a new range query with radius r2 , and the corre- itself. Take cn8 as an example in Figure 1(a), its Chord key sponding cells are sent to charge-nodes until k objects are is 10, and left, lower, right, upper cells of cn8 are 11, 9, 53, retrieved, the iteration is terminated. 31, but cell 9 is cn8 ’s charge-cell, so cell 6 is selected, and the vicinity neighbors are cn1 , cn4 , cn3 and cn6 . Algorithm 1 kNN Query Parallel Processing Input: 4. PARALLEL PROCESSING q=(xq , yq ), k In this section, parallel processing for kNN query is pre- Output: sented. The main workflow is leveraging a series of range Qlist //result list queries to accomplish kNN query. In particular, firstly, an 1: r = estimateR(k) initial radius r is calculated by k and estimation of spa- 2: cnq = f indN earestCN (q) tial objects distribution[1][11], then a range query (q, r) is 3: while true do formed, i.e., to find objects located in the circular range 4: SCq = cnq .f indIntersectingCells(q, r) centered at q, with the radius r. 5: for each interi ∈ SCq do 6: if (cnv =cnq .vN eighbor(interi ))6=null then key=60 7: cnq .sendTo(cnv , interi ) cn5 cn4 key=8 8: else key=52 cn8 key=10 9: cnq .forward(interi ) cn3 12 11 10: end if key=48 key=14 11: end for cn2 cn1 12: ret=cnq .receiveResults() r1 13: if |ret|≥k then q r2 18 14: return k nearest neighbors to q 42 cn6 key=41 15: else cn7 key=24 16: r = r+δ 26 17: end if (a) (b) 18: end while Figure 2: Parallel Processing Algorithm 1 presents our parallel processing for kNN query. Line 1 calculates the radius r, and then the nearest Cloud For processing such range query, first, the spatially near- node cnq to q is found in line 2. From line 3 to the end, range queries are issued, and the returned results are exam- query then replying are called responders. Next issue is that ined until k objects are obtained. Node cnq first finds the we consider both point objects and extent objects (the ob- intersecting cells SCq by (q, r) (line 4), and then for each jects with spatial shape), so for an extent object, it may interval in SCq , cnq uses vicinity neighbor (line 6 and 7) cross more than one cells, which means one object might be or finger table (line 8 and 9) to deliver the query messages, stored in different Cloud nodes, thus when query is sent to and then checks the number of returned results, until gets k different nodes, the results returned may contain duplicates, objects. hence the requester should detect duplicates and eliminate them. Another issue is that, when a responder cnj receives 5. SEQUENTIAL PROCESSING query message, it would be costly for cnj returning all ob- jects maintained by it, hence cnj should just return the ob- In this section, we present the sequential processing for jects with distance to q not larger than that of the current kNN query, which is a progressive procedure to obtain k ob- top element in the queue, however, such a fashion would lost jects. The difference is that, parallel processing is composed cnj forever, i.e., the rest objects in cnj have no chance to be of a series of range queries, with constantly extending radius scanned. So we propose a concept called map object, which of the circular range to retrieve results, while for sequential means this map object is just attached to the current results processing, the query is progressively delivered to the cor- returned to requester, but it is not a result for this time, it responding Cloud nodes, i.e., the Cloud nodes receive the might be a result in the following iterations, and its function query in an order according to the distance to the query is to lead the requester to send query to cnj if necessary. In point. Although for parallel processing, query messages are the following, we present the description for sequential pro- almost delivered to different Cloud nodes at the same time, cessing, assuming that, requester cnq issues a kNN query (q, and then nodes process local search simultaneously, the crit- k), the processing in requester and responders are described ical point in parallel processing is the estimation of range ra- below: dius r, i.e., with a large r, k results would be retrieved at one (1) Processing in Requester. A min-priority queue P Q is time, but some useless objects beyond k objects are also ob- used here to store three kinds of elements: b-MBR (bM), ob- tained, which promises waste of communication, while with ject (obj) and map object (mobj). The mobj is a structure a small r, some query messages would be repeatedly deliv- sent from responder, representing an obj stored in some re- ered to the same Cloud nodes, which also lower down the sponder, and is formatted as hdist, {CloudN odes}, rangei, query performance. The advantage of sequential process- where dist is the distance between q and the obj being rep- ing is that, without prior knowledge of k, the results are resented, and {CloudN odes} is a list of identifiers of Cloud incrementally obtained according to sorting function, with- nodes which are responsible for the key interval range. El- out wasting communication cost. For achieving sequential ements in P Q are sorted by distance, i.e., M IN DIST (q, process, we propose to use histogram. bM), distance between q and obj or mobj.dist. In order to 5.1 Design of Histogram break the tie that two elements have the same distance, we make a convention that obj is preferred to mobj, and mobj is We use histogram to assist sequential processing. Each preferred to bM. Initially, all bMs of cnq are enqueued into Cloud node maintains its own histogram, which is com- P Q, in addition, a counter c storing the number of glob- posed of several buckets. Each bucket is formatted with ally retrieved objs so far is set to zero. After that, the top hbid, range, numi, where bid identifies a bucket, and range element e of P Q is dequeued. denotes an interval marked with a starting value and an ending value of the Chord key domain, and num denotes • If e is a bM, cnq uses vicinity-connection or finger the number of items in range. Different buckets’ ranges table to route bM-request hq, ctopdist, intervali, to are non-overlapping with each other and all buckets’ ranges the charge-nodes (responders) of the cells in interval, constitute the key domain. where ctopdist is distance between current top element For each Hilbert cell, its charge-node calculates the MBR and q, and interval is the cell interval covered by e. (Minimum Bounding Rectangle) of objects intersecting with Then the thread for operating P Q is suspended until the cell, hence for each Cloud node, there is a set of MBRs cnq receives a reply from responders. The reply may corresponding to its charge-cell s. Relying on continuous contain both objs and mobj, and objs are added into maintaining routing tables of Cloud node[4], the MBRs are result list, and c is increased by the number of objs, able to be disseminated to other Cloud nodes, i.e., the MBR and then cnq examines that, if c is larger than or equal information is piggy-back on the routing table maintenance to k, meaning that k nearest objects are found, cnq in- message. Such a design would reduce the communication forms all responders to terminate, otherwise, i.e., c is cost, thus the whole performance would not suffer. Af- less than k, mobj is enqueued into P Q. ter constant message exchanging, each Cloud node is able to know the MBR for each Hilbert cell, and then for each • If e is an mobj, cnq sends mobj-request hq, e.dist, bucket in the histogram, a bucket MBR (b-MBR) is built, ctopdist, e.rangei to the responders e.{CloudN odes}. which summarizes all objects in the bucket. Similarly, P Q is blocked until cnq receives a reply and follows process above. 5.2 Processing Description • Otherwise, i.e., e is an obj, e is added to result list and The main tool in sequential processing is priority queue[3], c is increased by one. If c is equal to k, cnq informs all which is controlled by query issuing node, the node con- responders to terminate the processing. stantly fetches top element from the queue, and parses the element into query, and deliver the query to corresponding (2) Processing in Responder. When a responder cnj re- charge-nodes. To clearly specify such interaction, we call ceives a request, depending on the type of the request, two the issuing node as requester, and the nodes receiving the cases are discussed below: • If the request is a bM-request hq, ctopdist, intervali, Algorithm 2 kNN Query Sequential Processing for each obj stored in cnj , which is bounded in interval, Input: cnj calculates the distance between q and the obj. q=(xq , yq ), k Only the objs with the distance not larger than ctopdist Output: are added to the return list. To eliminate local dupli- Qlist //result list cates, we use a simple but effective method based on 1: Requester cnq ’s Process: the observation that, for a bM-request, the responders 2: P Q=φ, c=0 are always located in a contiguous key set and in the 3: cnq .enqueue(bMs, P Q) relationship of predecessor and successor. Assuming 4: while P Q 6= φ do that, three responders cn1 , cn2 and cn3 are located for 5: e=cnq .dequeue(P Q) a bM-request with the query key range [key1 , key2 ]. 6: if e is type of bM then In particular, [key1 , keya ) is maintained by cn1 , [keya , 7: cnq .deliver(hq, ctopdist, intervali) keyb ) is for cn2 and [keyb , key2 ] is for cn3 . After they 8: else if e is type of mobj then finds qualified objs, cn1 sends its results to its succes- 9: cnq .deliver(hq, e.dist, ctopdist, e.rangei) sor cn2 , and cn2 sends the results found by itself to 10: else cn3 as well as the ones from cn1 . Now, cn3 is able to 11: Qlist←e use the identifiers of objs to filter the duplicates among 12: end if their findings, then a refined result list is determined. 13: cnq updates Qlist and c when receives a reply After that, cn3 sends a reply to the requester, contain- 14: if c≥k then ing the refined results. To determine mobj, cn1 , cn2 15: cnq informs all related responders to terminate and cn3 follow the collaborative way above to find an 16: return Qlist object nobj (next object) which is the immediate one 17: end if to the last object in current result in ascending order 18: end while by distance. After that, cn3 sends a mobj-reply to 19: the requester, mobj=hnobj.distance, {cn1 , cn2 , cn3 }, 20: Responder cnj ’s Process: rangei. 21: msg=cnj .receive() 22: if msg is type of bM-request then • If the request is a mobj-request hq, dist, ctopdist, rangei, 23: hq, ctopdist, intervali=parse(msg) the responders search the objs with the distances be- 24: objlist=cnj .search&F ilter(q, ctopdist, interval) tween dist and ctopdist, whose associated keys are in 25: else range. Then similar to the above processing, the re- 26: hq, dist, ctopdist, rangei=parse(msg) sponders aggregate the results and one of them sends 27: objlist=cnj .search&F ilterM Obj(q, dist, ctopdist, a mobj-reply to the requester. range) We briefly describe sequential query processing in Algo- 28: end if rithm 2. From line 1 to line 18, the pseudo-codes show 29: objlist0 =duplicateEliminate(objlist) processing in requester, and the processing in responder is 30: cnj .return(objlist0 ) presented between line 20 and line 30. Figure 3 shows an example for kNN query sequential pro- cessing. Assume that cn6 in Figure 1 issues a 5-NN query mobj-request to cn4 with dist=5, ctopdist=7, and cn4 finds specified by a spatial query point q. We illustrate three near- obj5 and replies to cn6 , then c is set to 5, which means the est bMs, bM1 for the bucket with range [30, 32], containing processing is terminated. obj3 , obj4 , with distance 0 to q; bM2 for [8, 11], containing obj1 , obj5 , obj6 , with distance 2 to q; and bM3 for [52, 55], 6. EXPERIMENTAL EVALUATION containing obj1 , obj2 , with distance 3 to q. And the dis- We use two different datasets to evaluate our approaches. tances for obj1 ∼ obj6 to q are 3, 7, 1, 4, 5, 2. Initially, cn6 The first one is a synthetic dataset generated by GSTD[12] enqueues the three bMs, resulting hbM1 (0), bM2 (2), bM3 (3)i, in space domain [0, 1]2 and time domain [0, 1]. In particular, then bM1 is dequeued, because cn6 is the charge-node of cell 500,000 objects, each specified by a two-dimensional rectan- 30 to 32, it searches locally, and finds obj3 as the first result, gle, are initialized at t=0, with center coordinates in Gaus- and the mobj is obj4 , due to obj4 belonging to cn6 itself, so sian distribution (mean=0.5 and variance=0.1) and size uni- obj4 is directly added into P Q, now the queue is hbM2 (2), formly distributed in [0, 0.1]2 . As t evolves uniformly, the bM3 (3), obj4 (4)i. Next bM2 is dequeued, and cn6 sends a objects move from central part to the border of the space as bM-request with ctopdist=3 (bM3 ’s M IN DIST ) to the re- well as resizing themselves, which totally results in almost sponders cn4 , cn8 and cn1 for [8, 11], then the reply is obj6 , 100 million records. Figure 4 visualizes the synthetic dataset obj1 and mobj5 (obj5 is not a result for this iteration, be- (for the sake of legibility, only 5,000 objects are visualized). cause in5 ’s distance(=5) is larger than ctopdist(=3), it is The second dataset is a real one1 which records trajectories the immediate object following obj1 according to the dis- of taxis in Beijing from Nov. 1st to 3rd, 2013. In particular, tance to q), so cn6 adds obj6 and obj1 into result list, and each record in the dataset contains vehicle ID, geo-location, enqueues mobj5 , and increases c by 2, resulting 3. Similarly, recording time stamp, etc. The amount of the records in the cn6 sends bM-request to cn3 and cn5 which are responsi- dataset is about 60 million. ble for [52, 55], after searching, cn5 replies the result obj1 We implement VIHCO in Java and run it on a set of com- and mobj2 , because obj1 is a duplicate, it is discarded, and puting units. Each unit (Cloud node) is configured with mobj2 is enqueued. Then obj4 is inserted into result list, 1 c is set to 4. After that, cn6 processes mobj5 and sends a http://activity.datatang.com/20130830/description key=60 initial: cn5 < bM1(0), bM2(2), bM3(3) > cn4 key=8 result: null key=52 cn8 key=10 < bM2(2), bM3(3), obj4(4) > cn3 11 bM1 12 result: obj3 q obj3 key=48 key=14 < bM3(3), obj4(4), mobj5(5) > obj4 bM3 cn2 cn1 result: obj3, obj6, obj1 obj6 obj1 < obj4(4), mobj5(5), mobj2(7) > 18 obj5 result: obj3, obj6, obj1 42 bM2 obj2 cn6 key=41 < mobj5(5), mobj2(7) > cn7 result: obj3, obj6, obj1, obj4 key=24 26 < mobj2(7) > result: obj3, obj6, obj1, obj4, obj5 Figure 3: Sequential Processing Next, we can see the comparison results from Figure 5(b) under different system sizes. A small size gives better per- formance to sequential processing and a large size gives to parallel one. The reason is similar to the previous one. (a) t=0 (b) t=0.25 (c) t=0.5 (d) t=0.75 (e) t=1 7 0 0 0 6 5 0 0 Figure 4: Visualization of Synthetic Dataset P a r a lle l 6 0 0 0 P a r a lle l 6 0 0 0 S e q u e n tia l S e q u e n tia l R T -C A N R T -C A N 5 5 0 0 Q u e ry T h ro u g h p u t 5 0 0 0 Q u e ry T h ro u g h p u t 5 0 0 0 Intel(R) Core2 Quad 2.5GHz processor, 2GB memory and 4 0 0 0 4 5 0 0 500GB disk. The number of nodes varies from 4 to 16. 3 0 0 0 4 0 0 0 And they are connected via 1Gbps network links. We use 2 0 0 0 3 5 0 0 RT-CAN for comparison and query throughput (number of queries processed per second) as the metric in the exper- 1 0 0 0 3 0 0 0 iments. And we vary k and network size to measure the 4 8 1 6 3 2 6 4 4 8 1 2 1 6 k C lo u d N o d e C o u n t performance, which are 16 and 8 as default, respectively. (a) Effect of k (b) Effect of system size 6.1 Results on Synthetic Dataset Figure 5: Results on Synthetic Dataset Figure 5 shows the performances of three query processing methods on synthetic dataset. When we increase k from 4 to 64, both performances degrade, which can be explained that a larger k involves more Cloud nodes to inspect distances 6.2 Results on Real Dataset from objects to the query point, hence messages and com- Figure 6 shows the results on real dataset. The advantage parisons are raised. However, we can see from Figure 5(a), of VIHCO and our query processing methods are definitely RT-CAN degrades faster than our two processing methods, revealed in the results, where RT-CAN deteriorates more this is due to: first, VIHCO uses not only long-links (fin- seriously with k increased. A detail observation is that the ger table) to find charge-nodes, but also vicinity neighbors throughput of RT-CAN is only about 1,000 when k is equal to quickly locate nearby nodes, second, our parallel process- to 64, while our methods are above 4,000. We can see VI- ing method takes differential cells between two consecutive HCO is more suitable for the skewed dataset than RT-CAN range queries into consideration, which reduces communica- for kNN query. tion cost, on the other hand, our sequential method is able to elaborately forward query to the related Cloud nodes, while 8 0 0 0 7 5 0 0 RT-CAN does not take these details into consideration. 7 0 0 0 P a r a lle l S e q u e n tia l 7 0 0 0 P a r a lle l S e q u e n tia l For comparing parallel and sequential method, we can see 6 0 0 0 R T -C A N 6 5 0 0 R T -C A N from Figure 5(a), when k is small, sequential processing out- Q u e ry T h ro u g h p u t Q u e ry T h ro u g h p u t 6 0 0 0 5 0 0 0 performs parallel one, while for a large k, the result is oppo- 5 5 0 0 4 0 0 0 site. This can be explained that when k is small, there are 5 0 0 0 fewer objects to be scanned, and parallel processing does not 3 0 0 0 4 5 0 0 exploit parallelism of Cloud nodes sufficiently, on the other 2 0 0 0 4 0 0 0 hand, sequential method is able to deliver messages accu- 1 0 0 0 3 5 0 0 rately, any no-related node would not be contacted, hence 4 8 1 6 3 2 6 4 4 8 1 2 1 6 k C lo u d N o d e C o u n t the communication cost is reduced. While with a large k, the disadvantage of sequential method increases, i.e., Cloud (a) Effect of k (b) Effect of system size nodes receive query messages one by one, and for parallel Figure 6: Results on Real Dataset method, the query range is enlarged, more nodes are able to search results simultaneously, the performance is raised. 7. CONCLUSIONS infrastructure for location aware services. In Mobile With the increasing of DaaS (Database as a Service) re- Data Management (MDM), 2011 12th IEEE quirement, kNN query processing methods would be paid International Conference on, volume 1, pages 7–16. more attention by both academic community and industrial IEEE, 2011. circles. In this paper, we propose an interesting problem, [10] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and which is with better performance for Cloud computing, par- H. Balakrishnan. Chord: A scalable peer-to-peer allel method or sequential one. For answering such a ques- lookup service for internet applications. ACM tion, we first devise an distributed indexing structure VI- SIGCOMM Computer Communication Review, HCO for constructing overlay network, featured by vicinity- 31(4):149–160, 2001. connection which is capable to fast locate the destinations. [11] Y. Tao, J. Zhang, D. Papadias, and N. Mamoulis. An Upon VIHCO, we present the algorithms of parallel and se- efficient cost model for optimization of nearest quential processing, and explore the answers through exper- neighbor search in low and medium dimensional iments both on synthetic and real dataset, and the results spaces. Knowledge and Data Engineering, IEEE show that our VIHCO is better than RT-CAN, and the se- Transactions on, 16(10):1169–1184, 2004. quential method is more efficient under small k query condi- [12] Y. Theodoridis, J. R. Silva, and M. A. Nascimento. On tion and small system size, while parallel one suits for large the generation of spatiotemporal datasets. In Advances k and large scale of computing nodes. in Spatial Databases, pages 147–164. Springer, 1999. In the future, we plan to extend our work to in high di- [13] J. Wang, S. Wu, H. Gao, J. Li, and B. C. Ooi. mensional space and road network space. Indexing multi-dimensional data in a cloud system. In Proceedings of the 2010 ACM SIGMOD International 8. ACKNOWLEDGMENTS Conference on Management of data, pages 591–602. This work is supported by NSF of China grant 61303062 ACM, 2010. and 71331008. We would like to thank Prof. Dai and Dr. [14] C. Zhang, X. Chen, B. Ge, and W. Xiao. Indexing Hu for helping with the proof. historical spatio-temporal data in the cloud. In Big Data (Big Data), 2015 IEEE International Conference 9. REFERENCES on. IEEE, 2015. [1] S. Berchtold, C. Böhm, D. A. Keim, and H.-P. Kriegel. A cost model for nearest neighbor search in high-dimensional data space. In Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 78–86. ACM, 1997. [2] G. Chen, H. T. Vo, S. Wu, B. C. Ooi, and M. T. Özsu. A framework for supporting dbms-like indexes in the cloud. Proceedings of the VLDB Endowment, 4(11):702–713, 2011. [3] G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. ACM Transactions on Database Systems (TODS), 24(2):265–318, 1999. [4] M. Jelasity, S. Voulgaris, R. Guerraoui, A.-M. Kermarrec, and M. Van Steen. Gossip-based peer sampling. ACM Transactions on Computer Systems (TOCS), 25(3):8, 2007. [5] I. Kamel and C. Faloutsos. Hilbert r-tree: An improved r-tree using fractals. 1993. [6] R. Lange, F. Dürr, and K. Rothermel. Scalable processing of trajectory-based queries in space-partitioned moving objects databases. In Proceedings of the 16th ACM SIGSPATIAL international conference on Advances in geographic information systems, page 31. ACM, 2008. [7] H. Lu, Z. Huang, C. S. Jensen, and L. Xu. Distributed, concurrent range monitoring of spatial-network constrained mobile objects. In Advances in Spatial and Temporal Databases, pages 403–422. Springer, 2007. [8] A. Meka and A. Singh. Dist: a distributed spatio-temporal index structure for sensor networks. In Proceedings of the 14th ACM international conference on Information and knowledge management, pages 139–146. ACM, 2005. [9] S. Nishimura, S. Das, D. Agrawal, and A. E. Abbadi. Md-hbase: a scalable multi-dimensional data