=Paper= {{Paper |id=Vol-1810/KARS_paper_06 |storemode=property |title=eSPAK: Top-K Spatial Keyword Query Processing in Directed Road Networks |pdfUrl=https://ceur-ws.org/Vol-1810/KARS_paper_06.pdf |volume=Vol-1810 |authors=Muhammad Attique,Awais Khan,Tae-Sun Chung |dblpUrl=https://dblp.org/rec/conf/edbt/AttiqueKC17 }} ==eSPAK: Top-K Spatial Keyword Query Processing in Directed Road Networks== https://ceur-ws.org/Vol-1810/KARS_paper_06.pdf

eSPAK: Top-k Spatial Keyword Query Processing in
Directed Road Networks

Muhammad Attique Awais Khan Tae-Sun Chung∗
Computer Engineering, Computer Engineering, Computer Engineering,
Ajou University, Ajou University, Ajou University,
Suwon, South Korea Suwon, South Korea Suwon, South Korea
attique@ajou.ac.kr awais@ajou.ac.kr tschung@ajou.ac.kr

ABSTRACT to the query location and relevance to the query keywords.
Given a query location and a set of query keywords, a top-k Therefore, several algorithms have been proposed for pro-
spatial keyword query rank objects based on the distance cessing top-k spatial keyword queries in Euclidean space [3,
to the query location and textual relevance to the query 8]. Although few algorithms exist that study the keyword
keywords. Several solutions have been proposed for top-k queries in a road network, however, they all focus on the
spatial keyword queries in Euclidean space. However, few undirected road network. In this paper, for the first time,
algorithms study top-k keyword queries in undirected road we are investigating a top-k spatial keyword queries in di-
networks where every road segment is undirected. Even rected road networks which are more closely related to the
worse, insufficient attention has been given to the process- real world scenario.
ing of keyword queries in directed road networks where each Top-k keyword queries can be used for a wide range of
road segment has a particular orientation. Therefore, in applications in recommendation systems and decision sup-
this paper, we present an algorithm called eSPAK that can port systems. For example, a tourist may want to retrieve
efficiently answer the top-k spatial keyword queries in di- a sorted list of restaurants that serve Italian steak based
rected road networks. Our experimental results demonstrate on shortest distance from her location and textual relevance
that eSPAK significantly outperforms conventional solution to the query keywords. Given a set of data objects D =
in terms of query processing cost. d1 , d2 , ..., d|D| , query location and set of keywords, the top-k
spatial keyword query returns the best k data objects from
D according to their combined textual and spatial relevance
CCS Concepts to query.
•Information systems → Data management systems; Figure 1 presents an example of a directed road network
Query processing; where rectangles represents the data objects with a tex-
tual description, and the triangle represent the query lo-
Keywords cation. The number label on each edge indicates the dis-
tance between two adjacent objects e.g., dist(n1 , d1 ) = 1
spatial keyword queries, directed road networks, location-
and dist(d1 , n2 ) = 2. Consider a scenario where a tourist
based services
is interested in finding an “Italian restaurant”. If an undi-
rected road network is considered, the top-1 Italian restau-
1. INTRODUCTION rant is d6 . However, in directed road network the shortest
With the popularization of geo-tagged data (e.g., geo- path from q to d6 is (q → n3 → n7 → d6 ). Therefore, for
tagged photos, videos, check-ins, and text messages), many directed road network, top-1 result is d3 because it is closer
online location-based services such as Google Maps, Yahoo to query location than d6 . Now consider tourist is looking
Maps, and Bing Maps have started providing useful infor- for “Cafe bakery”, the data object d7 may score better than
mation via location-based queries [9, 2]. At the same time,
a textual description of the point of interests, e.g., hotels, n6 d4 (Chinese Restaurant) n5
shopping malls and tourist attractions, are easily accessible
1 2
on the web. These developments call for techniques that 1
1
efficiently process the top-k spatial keyword queries that re- d3 (Italian Restaurant)
d5 (Pub and Bar)
turn a ranked list of k best facilities based on their proximity
∗ 3 3
Corresponding Author.
n1 1 2 2 1 1 2 n4
n2 q n3
d1 (Grand Hotel) d2 (Cafe)
1 2
2017, Copyright is with the authors. Published in the Workshop d6 (Italian Restaurant)
Proceedings of the EDBT/ICDT 2017 Joint Conference (March d7 (Cafe and Bakery)
21, 2017, Venice, Italy) on CEUR-WS.org (ISSN 1613-0073). 2 1
Distribution of this paper is permitted under the terms of the
Creative Commons license CC-by-nc-nd 4.0 n7

Figure 1: An illustration of directed road network.
the data object d1 because d7 (“Cafe and bakery”) is more the node set, edge set and edge distance matrix, respec-
textually relevant to query keywords than d2 (“Cafe”), and tively. Each edge is also assigned an orientation which is
the dist(q, d7 ) is just slightly bigger than dist(q, d2 ). either undirected or directed. The undirected edge is rep-
Top-k spatial keywords in directed road networks are use- resented by e = ns ne where ns and ne are adjacent nodes,
ful for location-based applications. However, the query pro- whereas directed edge represented as e = − n− → ←−−
s ne or e = ne ns .
cessing is costly, because it requires computing several short- Naturally, the arrow above the edge indicates associated di-
est paths while considering the orientation of road segments. rection. We refer ns as starting node and ne as ending node
To the best of our knowledge, this is the first attempt to of edge. For example in Figure 1, n6 is starting node of edge
study top-k spatial keyword queries in directed road net- −
n− → ←−−
6 n2 , whereas it is ending node for edge n6 n5 . The partic-
works. In this paper, we propose a methodology to rank the ular edge where query object is located is called the active
data objects based on the spatial and textual relevance. edge.
Below, we summarize our contributions:
3.2 Problem Formulation
• We propose an efficient indexing technique that indexes
the data objects in inverted files for processing top-k spa- Similar to previous studies [3, 10] we assume each data
tial keyword queries in directed road networks. object d ∈ D has a point location d.l in road network and a
• We present an algorithm eSPAK that exploits the indexing text description d.t. Given a query location q.l, set of key-
framework to effectively retrieve top-k results. words q.t and k number of data objects to return, the top-
• Finally, we conduct extensive experiments on a real road k spatial keyword query Qk is defined as Qk = (q.l, q.t, k),
network and demonstrate the superiority of our proposed which takes three arguments and returns best k data objects
algorithm over baseline approach. from D according to score that takes into consideration spa-
tial proximity and text relevance. The score ψ(d) of a data
object d is defined by the following equation:
2. RELATED WORK
Several approaches have been proposed for ranking spa- µ(d.t, q.t)
tial data objects. Harihran et al. [7] proposed an indexing ψ(d) = (1)
1 + α · λ(d.l, q.l)
structure KR*-tree by capturing the joint distribution of
keywords in space. Ian de Felipe et al. [4] proposed a data where λ(d.l, q.l) is the spatial relevance between d.l and
structure which combines an R-tree with text signatures. q.l, µ(d.t, q.t) is the textual relevance between d.t and q.t and
Each node of the R-tree exploits a signature to indicate the α is a positive real number that determines the importance
presence of keywords in the sub-tree of the node. How- of one measure over the other. For example, if only spatial
ever, both of these approaches only handles boolean keyword relevance is considered then α = 0, if more importance is
queries in Euclidean space. given to textual relevance then α > 1.
Top-k spatial keyword queries where data objects are ranked Spatial relevance (λ) is defined as the shortest distance be-
according to their combined textual and spatial relevance to tween data object d and q: λ(d.l, q.l) = dist(d.l, q.l). Thus,
keyword queries was first studied by Cong et al. [3] and Li dist(di .l, q.l) < dist(dj .l, q.l) indicates that data object di is
et al. [8]. Both studies [8] integrates location indexing and more spatially relevant to q than data object dj . The textual
text indexing to generate IR-trees. These studies process relevance (µ) can be computed using any popular informa-
top-k spatial keyword queries only in Euclidean space and tion retrieval model, such as cosine similarity or language
not suitable for processing top-k spatial preference queries model. In this study, we use the cosine similarity between
in road networks, where the distance between objects is de- d.t and q.t. The textual relevance is defined as:
termined by the shortest path connecting them.
Top-k spatial keyword queries in road networks was in- P
t∈q.t wt (d.t).wt (q.t)
troduced by Rocha et al. [10]. In particular, they pro- µ(d.t, q.t) = qP P (2)
2 2
posed three different indexing techniques (Base Indexing, t∈p.t [wt (d.t)] . t∈q.t [wt (q.t)]
Enhanced Indexing and Overlay Indexing) for processing
spatial keyword queries in road networks. Recently, Guo Here, the weight wt (d.t) represents the frequency of term
et al. [6] studied continuous top-k spatial keyword queries t in d.t, and the weight wt (q.t) describes the ratio of total
on road networks. They presented two methods for monitor- number of data objects in D to the number of data objects
ing moving queries in an incremental manner which reduces that contains t in their description. Higher µ means, the
the traversing of network edges. Different from [10, 6] in this higher textual relevance to query keywords.
study we consider top-k spatial keyword queries in directed
road networks where each road segment has a particular ori- 4. QUERY PROCESSING SYSTEM
entation.
In this section, we present our proposed query processing
system that indexes the data objects and prunes the irrele-
3. PRELIMINARIES vant edges for efficient query processing. In Section 4.1, we
Section 3.1 defines the terms and notations that are used discuss indexing framework and in Section 4.2, we present
in this paper. Section 3.2 formulates the problem using an an efficient keyword query processing algorithm (eSPAK).
example that illustrates the general results of top-k spatial
keyword queries. 4.1 Indexing Framework
We implemented the inverted file for indexing data ob-
3.1 Definition of Terms and Notations jects. The inverted file contains vocabulary and inverted
Road Network: A road network is represented by a weighted lists. The vocabulary keeps general information about each
directed graph G = (N, E, W ) where N, E and W denote term such as frequency of term which is helpful in comput-
ing the textual relevance of the data objects. The inverted Algorithm 1: eSPAK: Query Processing Algorithm.
list stores the data objects located on the edge − n− →
s ne that
1 Input: Top-k spatial keyword query QN = (q.l, q.t, k)
have a term t in their description. An inverted list is iden- 2 Output: Top-k data objects with highest score
tified by a key composed of (eid , tid ), eid and tid represents 3 Dc ← ∅ /*set of candidate data objects
edge id and term id, respectively. Each inverted file is a 4 max-heap Dk ← ∅ /*current Top-k set
set of inverted lists. The separate inverted list is used for 5 sk ← 0 /*k-th score in Dk
each term in the object description. Inverted list stores two 6 min-heap ← ∅
attributes for each data object: first, the distance between 7 explored ← ∅
data object and starting node dist(ns , di ); second, the sig- 8 min-heap.insert(q.l, edgeactive )
9 while min-heap 6= ∅ or (equ) do
nificance factor θ(ti , di ) of the term ti in the description of 10 (pa , edge)← min-heap.pop()
the data object. Note that the network distance between 11 if (pa , edge) ∈
/ explored then
two points in directed road networks is not symmetrical (i.e, 12 explored ← explored ∪ (pa , edge)
dist(ns , di ) 6= dist(di , ns )). Recall that the starting node is 13 Dc ← candsearch((eid , tid ), sk )
chosen according to orientation of edge such that direction 14 update Dk and sk
of edge is from node towards data object. In Figure 1, n3 15 end
16 else
is starting node for d7 . For bi-directional edges any of the 17 min-heap.push(adjacent node, edge)
adjacent node can act as starting node. 18 end
Furthermore, we develop a pruning technique to prune the 19 end
irrelevant edges. To achieve this, we introduce a highest sig- 20 return Dk
nificance factor (θt ) of term t in the description of objects
lying on the edge. The θt on an edge is retrieved by a combi- a road network presented in Figure 1. Assume, a query q
nation of eid and tid . The θt is an upper-bound significance generated a top-1 keyword query with q.d “Italian restau-
for an object on the edge with t. Naturally, the edges with rant”. For the ease of presentation, we assume α = 1 and the
θt smaller than the score of the k-th object found so far are textual relevance µ is the number of occurrence of query key-
pruned. words in d.t divided by the number of keywords in the doc-
µ(d4 .t,q.t)
The proposed indexing scheme has three main advantages. ument. For example, the ψ(d4 ) = 1+λ(d 4 .l,q.l)
= 0.5
8
= 0.06.
First, the object search relevant to query keywords is very The algorithm starts the network expansion from active edge
−
n− →
efficient using the (eid , tid ) pair. Second, inverted files also 2 n3 where q is the anchor point. Note that the direction
store the network distance between starting node and data of edge − n− →
2 n3 is from n2 to n3 . Therefore, algorithm only
object which helps in accessing the data object in the di- explore qn3 . There is no data object found in −
−→ qn→. Then, n
3 3
rected road network. Finally, the pruning technique allows
− −→
becomes anchor point and edges n3 n4 , n3 n5 and − n− →
3 n7 are
faster query processing by exploring fewer edges. inserted in min-heap. Next, candsearch function retrieves
the candidate data objects on edges − n− →
3 n4 , n2 n3 and n3 n7
−−→
4.2 eSPAK: Query Processing Algorithm whose score is better than sk . On edge n3 n5 data object d3
is retrieved with ψ(d3 ) = 0.2. The data object d3 is inserted
eSPAK traverses the road network incrementally in a sim-
in Dk set and the value of sk is set to 0.2. For edge − n− →
3 n4
ilar fashion to Dijkstra’s algorithm [5]. Algorithm 1 returns
and − n− → there is no candidate object found because d .t
n
3 7 2
top-k data objects with highest scores according to their
(“Cafe”) and d7 .t (“Cafe and Bakery”) does not match with
joint textual and spatial relevances to the query. The algo-
q.t. The algorithm continues expanding the edges whose
rithm begins by exploring the active edge where query object
upper-bound score is greater than sk . The edge − n− →
7 n2 is ex-
q is located and expands the network in an increasing or- − − → 1
plored next, the upper-bound score of n7 n2 is 7 which is less
der of distance from q. Each entry in the min-heap takes
than sk . Similarly, for edge ←n5−n−6 the upper-bound score is
the form (pa , edge), where pa indicates the anchor point in 0.5
8
< sk . Therefore, algorithm terminates reporting d3 as
the edge. For an active edge, q becomes the anchor point.
top-1 result.
Otherwise, for directed edges starting node ne becomes the
anchor point or for bi-directional edges either of the adja-
cent node, i.e., ns or ne becomes the anchor point. Let Dk 5. PERFORMANCE EVALUATION
be the current set of top-k data objects and sk be the score In this section, we evaluate the performance of eSPAK
of k-th data object in Dk . candsearch((eid , tid ), sk ) function through simulation experiments. We describe experiment
retrieves the candidate data objects Dc located in an edge settings in Section 5.1 and present experimental results in
with better score ψ(d) than sk . Next, the Dk set is updated Section 5.2.
with the data objects in Dc and so does sk . The algorithm
continues its expansion and inserts the adjacent edges of the 5.1 Experimental Settings
boundary node until the heap is exhausted or the remaining All of our experiments are performed using a real road net-
data objects cannot have the better score than sk . work [1] that comprised the main roads of North America,
The candsearch((eid , tid ), sk ) procedure finds the candi- with 175,812 nodes and 179,178 edges. Both the direction
date data objects in two steps. In first step, the upper-bound of edges and data points on edges are generated randomly.
score of edges is computed using a significance factor (θt ) of
Table 1: Experimental Parameter Settings
a term t ∈ q.t and the shortest distance sdist(ei , q.l) between
edge and the query location. In next step, the inverted lists
Parameter Range
of term t are fetched, if their upper-bound score is higher
Number of results (k ) 5, 10, 15, 20, 25
than sk . In inverted lists, the objects whose score ψ(d) is
greater than sk are returned. Number of data objects |D| 10, 20, 30, 40, 50 (x1000)
In order to give a feel for our proposed algorithm, consider Query parameter (α) 0.01, 0.1, 1, 10, 100
125 80 75
Baseline Baseline Baseline
eSPAK eSPAK eSPAK
100 60 60
Time (Sec)

Time (Sec)

Time (Sec)
75 45
40
50 30
25 20 15
0 0 0
5 10 15 20 25 10K 20K 30K 40K 50K 0.01 0.1 1 10 100
# of Results (K) # of Data Objects |D| Query Parameter (α)

Figure 2: Effect of k on query processing Figure 3: Effect of |D| on query processing Figure 4: Effect of α on query processing
time. time. time.

The description of each data object is extracted from twit- 7. ACKNOWLEDGMENTS
ter1 messages, we assigned one tweet per data object. As a We thank anonymous reviewers for their valuable com-
benchmark for eSPAK, we use a baseline method that com- ments and suggestions. This research was supported by
putes the score of every data object using the incremental Basic Science Research Program through the National Re-
network expansion [9]. Both the algorithms are implemented search Foundation of Korea (NRF) funded by the Ministry
in Java and executed on a desktop PC 2.80 GHz, Intel Core of Education(2016R1D1A1B03934129). Finally, this work
i5 with 8GB memory. In the experiments, we compared the was partially supported by Leaders Industry-University Co-
query processing time of both algorithms. Table 1 summa- operation Project.
rizes the parameters used in the experiments. In each ex-
periment, we vary a single parameter within the range that
is shown in Table 1 while keeping the other parameters at
8. REFERENCES
the bolded default values. [1] Real datasets for spatial databases.
http://www.cs.fsu.edu/˜lifeifei/SpatialDataset.htm.
5.2 Experimental Results [2] H.-J. Cho, K. Ryu, and T.-S. Chung. An efficient
algorithm for computing safe exit points of moving
Figure 2 shows the query processing time for eSPAK and
range queries in directed road networks. Information
baseline as a function of the number k of requested data ob-
Systems, 41:1–19, 2014.
jects with the highest score. The query processing time of
[3] G. Cong, C. S. Jensen, and D. Wu. Efficient retrieval
the baseline is nearly stable regardless of the k value because
of the top-k most relevant spatial web objects.
it always computes the score of each data object. However,
Proceedings of the VLDB Endowment, 2(1):337–348,
the query processing time of eSPAK increases slightly with
2009.
the k value. In Figure 3, we evaluate the performance of
eSPAK and baseline by varying the cardinality of data ob- [4] I. De Felipe, V. Hristidis, and N. Rishe. Keyword
jects. The query processing time of both the algorithms is search on spatial databases. In 2008 IEEE 24th
sensitive towards an increase in |D|. However, eSPAK scales International Conference on Data Engineering, pages
much better than baseline. Figure 4, demonstrates the im- 656–665. IEEE, 2008.
pact of query parameter α on query processing time. A small [5] E. W. Dijkstra. A note on two problems in connexion
value of α indicates more importance of textual relevance, with graphs. Numerische mathematik, 1(1):269–271,
whereas a high value of α gives more preference to the spa- 1959.
tial relevance. Experimental results reveal that α does not [6] L. Guo, J. Shao, H. H. Aung, and K.-L. Tan. Efficient
indicate a significant impact on the query processing time continuous top-k spatial keyword queries on road
of eSPAK and baseline. It is interesting to note that, both networks. GeoInformatica, 19(1):29–60, 2015.
approaches performs better for higher values of α, which in- [7] R. Hariharan, B. Hore, C. Li, and S. Mehrotra.
dicates more importance to spatial relevance. This is mainly Processing spatial-keyword (sk) queries in geographic
because, when spatial relevance is higher, fewer edges are re- information retrieval (gir) systems. In Scientific and
quired to explore to find the top-k data objects. Statistical Database Management, 2007. SSBDM’07.
19th International Conference on, pages 16–16. IEEE,
2007.
6. CONCLUSIONS [8] Z. Li, K. C. Lee, B. Zheng, W.-C. Lee, D. Lee, and
In this paper, we investigate top-k spatial keyword queries X. Wang. Ir-tree: An efficient index for geographic
in directed road networks. We presented an efficient index- document search. IEEE Transactions on Knowledge
ing framework using inverted files, that indexes the data and Data Engineering, 23(4):585–599, 2011.
objects on edges which allows effective searching of data ob- [9] D. Papadias, J. Zhang, N. Mamoulis, and Y. Tao.
jects relevant to query in term of both textual and spatial Query processing in spatial network databases. In
relevance. Furthermore, we present an algorithm for eval- Proceedings of the 29th international conference on
uating top-k spatial keyword queries. Finally, the experi- Very large data bases-Volume 29, pages 802–813.
mental evaluation conducted on real road networks demon- VLDB Endowment, 2003.
strates that eSPAK drastically reduces the query processing [10] J. B. Rocha-Junior and K. Nørvåg. Top-k spatial
time compared to baseline algorithm. keyword queries on road networks. In Proceedings of
the 15th international conference on extending
1 database technology, pages 168–179. ACM, 2012.
http://twitter.com