=Paper= {{Paper |id=Vol-1810/KARS_paper_06 |storemode=property |title=eSPAK: Top-K Spatial Keyword Query Processing in Directed Road Networks |pdfUrl=https://ceur-ws.org/Vol-1810/KARS_paper_06.pdf |volume=Vol-1810 |authors=Muhammad Attique,Awais Khan,Tae-Sun Chung |dblpUrl=https://dblp.org/rec/conf/edbt/AttiqueKC17 }} ==eSPAK: Top-K Spatial Keyword Query Processing in Directed Road Networks== https://ceur-ws.org/Vol-1810/KARS_paper_06.pdf
          eSPAK: Top-k Spatial Keyword Query Processing in
                      Directed Road Networks

              Muhammad Attique                            Awais Khan                                        Tae-Sun Chung∗
               Computer Engineering,                 Computer Engineering,                          Computer Engineering,
                  Ajou University,                      Ajou University,                               Ajou University,
                Suwon, South Korea                    Suwon, South Korea                             Suwon, South Korea
               attique@ajou.ac.kr                     awais@ajou.ac.kr                              tschung@ajou.ac.kr

ABSTRACT                                                           to the query location and relevance to the query keywords.
Given a query location and a set of query keywords, a top-k        Therefore, several algorithms have been proposed for pro-
spatial keyword query rank objects based on the distance           cessing top-k spatial keyword queries in Euclidean space [3,
to the query location and textual relevance to the query           8]. Although few algorithms exist that study the keyword
keywords. Several solutions have been proposed for top-k           queries in a road network, however, they all focus on the
spatial keyword queries in Euclidean space. However, few           undirected road network. In this paper, for the first time,
algorithms study top-k keyword queries in undirected road          we are investigating a top-k spatial keyword queries in di-
networks where every road segment is undirected. Even              rected road networks which are more closely related to the
worse, insufficient attention has been given to the process-       real world scenario.
ing of keyword queries in directed road networks where each           Top-k keyword queries can be used for a wide range of
road segment has a particular orientation. Therefore, in           applications in recommendation systems and decision sup-
this paper, we present an algorithm called eSPAK that can          port systems. For example, a tourist may want to retrieve
efficiently answer the top-k spatial keyword queries in di-        a sorted list of restaurants that serve Italian steak based
rected road networks. Our experimental results demonstrate         on shortest distance from her location and textual relevance
that eSPAK significantly outperforms conventional solution         to the query keywords. Given a set of data objects D =
in terms of query processing cost.                                 d1 , d2 , ..., d|D| , query location and set of keywords, the top-k
                                                                   spatial keyword query returns the best k data objects from
                                                                   D according to their combined textual and spatial relevance
CCS Concepts                                                       to query.
•Information systems → Data management systems;                       Figure 1 presents an example of a directed road network
Query processing;                                                  where rectangles represents the data objects with a tex-
                                                                   tual description, and the triangle represent the query lo-
Keywords                                                           cation. The number label on each edge indicates the dis-
                                                                   tance between two adjacent objects e.g., dist(n1 , d1 ) = 1
spatial keyword queries, directed road networks, location-
                                                                   and dist(d1 , n2 ) = 2. Consider a scenario where a tourist
based services
                                                                   is interested in finding an “Italian restaurant”. If an undi-
                                                                   rected road network is considered, the top-1 Italian restau-
1.     INTRODUCTION                                                rant is d6 . However, in directed road network the shortest
   With the popularization of geo-tagged data (e.g., geo-          path from q to d6 is (q → n3 → n7 → d6 ). Therefore, for
tagged photos, videos, check-ins, and text messages), many         directed road network, top-1 result is d3 because it is closer
online location-based services such as Google Maps, Yahoo          to query location than d6 . Now consider tourist is looking
Maps, and Bing Maps have started providing useful infor-           for “Cafe bakery”, the data object d7 may score better than
mation via location-based queries [9, 2]. At the same time,
a textual description of the point of interests, e.g., hotels,                                     n6       d4 (Chinese Restaurant)             n5
shopping malls and tourist attractions, are easily accessible
                                                                                                             1                      2
on the web. These developments call for techniques that                                                                                      1
                                                                                                        1
efficiently process the top-k spatial keyword queries that re-                                                                                       d3 (Italian Restaurant)
                                                                                d5 (Pub and Bar)
turn a ranked list of k best facilities based on their proximity
∗                                                                                                       3                                   3
    Corresponding Author.
                                                                     n1     1            2                       2                  1                     1           2   n4
                                                                                              n2                            q                        n3
                                                                          d1 (Grand Hotel)                                                                d2 (Cafe)
                                                                                                    1                                       2
2017, Copyright is with the authors. Published in the Workshop                  d6 (Italian Restaurant)
Proceedings of the EDBT/ICDT 2017 Joint Conference (March                                                                               d7 (Cafe and Bakery)
21, 2017, Venice, Italy) on CEUR-WS.org (ISSN 1613-0073).                                                        2              1
Distribution of this paper is permitted under the terms of the
Creative Commons license CC-by-nc-nd 4.0                                                                               n7

                                                                          Figure 1: An illustration of directed road network.
the data object d1 because d7 (“Cafe and bakery”) is more         the node set, edge set and edge distance matrix, respec-
textually relevant to query keywords than d2 (“Cafe”), and        tively. Each edge is also assigned an orientation which is
the dist(q, d7 ) is just slightly bigger than dist(q, d2 ).       either undirected or directed. The undirected edge is rep-
  Top-k spatial keywords in directed road networks are use-       resented by e = ns ne where ns and ne are adjacent nodes,
ful for location-based applications. However, the query pro-      whereas directed edge represented as e = −  n− →         ←−−
                                                                                                               s ne or e = ne ns .
cessing is costly, because it requires computing several short-   Naturally, the arrow above the edge indicates associated di-
est paths while considering the orientation of road segments.     rection. We refer ns as starting node and ne as ending node
To the best of our knowledge, this is the first attempt to        of edge. For example in Figure 1, n6 is starting node of edge
study top-k spatial keyword queries in directed road net-         −
                                                                  n−  →                                       ←−−
                                                                    6 n2 , whereas it is ending node for edge n6 n5 . The partic-
works. In this paper, we propose a methodology to rank the        ular edge where query object is located is called the active
data objects based on the spatial and textual relevance.          edge.
  Below, we summarize our contributions:
                                                                  3.2     Problem Formulation
• We propose an efficient indexing technique that indexes
  the data objects in inverted files for processing top-k spa-       Similar to previous studies [3, 10] we assume each data
  tial keyword queries in directed road networks.                 object d ∈ D has a point location d.l in road network and a
• We present an algorithm eSPAK that exploits the indexing        text description d.t. Given a query location q.l, set of key-
  framework to effectively retrieve top-k results.                words q.t and k number of data objects to return, the top-
• Finally, we conduct extensive experiments on a real road        k spatial keyword query Qk is defined as Qk = (q.l, q.t, k),
  network and demonstrate the superiority of our proposed         which takes three arguments and returns best k data objects
  algorithm over baseline approach.                               from D according to score that takes into consideration spa-
                                                                  tial proximity and text relevance. The score ψ(d) of a data
                                                                  object d is defined by the following equation:
2.    RELATED WORK
   Several approaches have been proposed for ranking spa-                                         µ(d.t, q.t)
tial data objects. Harihran et al. [7] proposed an indexing                           ψ(d) =                                            (1)
                                                                                               1 + α · λ(d.l, q.l)
structure KR*-tree by capturing the joint distribution of
keywords in space. Ian de Felipe et al. [4] proposed a data          where λ(d.l, q.l) is the spatial relevance between d.l and
structure which combines an R-tree with text signatures.          q.l, µ(d.t, q.t) is the textual relevance between d.t and q.t and
Each node of the R-tree exploits a signature to indicate the      α is a positive real number that determines the importance
presence of keywords in the sub-tree of the node. How-            of one measure over the other. For example, if only spatial
ever, both of these approaches only handles boolean keyword       relevance is considered then α = 0, if more importance is
queries in Euclidean space.                                       given to textual relevance then α > 1.
   Top-k spatial keyword queries where data objects are ranked       Spatial relevance (λ) is defined as the shortest distance be-
according to their combined textual and spatial relevance to      tween data object d and q: λ(d.l, q.l) = dist(d.l, q.l). Thus,
keyword queries was first studied by Cong et al. [3] and Li       dist(di .l, q.l) < dist(dj .l, q.l) indicates that data object di is
et al. [8]. Both studies [8] integrates location indexing and     more spatially relevant to q than data object dj . The textual
text indexing to generate IR-trees. These studies process         relevance (µ) can be computed using any popular informa-
top-k spatial keyword queries only in Euclidean space and         tion retrieval model, such as cosine similarity or language
not suitable for processing top-k spatial preference queries      model. In this study, we use the cosine similarity between
in road networks, where the distance between objects is de-       d.t and q.t. The textual relevance is defined as:
termined by the shortest path connecting them.
   Top-k spatial keyword queries in road networks was in-                                    P
                                                                                                 t∈q.t wt (d.t).wt (q.t)
troduced by Rocha et al. [10]. In particular, they pro-                 µ(d.t, q.t) = qP                        P                       (2)
                                                                                                           2                        2
posed three different indexing techniques (Base Indexing,                                  t∈p.t [wt (d.t)] .       t∈q.t [wt (q.t)]
Enhanced Indexing and Overlay Indexing) for processing
spatial keyword queries in road networks. Recently, Guo              Here, the weight wt (d.t) represents the frequency of term
et al. [6] studied continuous top-k spatial keyword queries       t in d.t, and the weight wt (q.t) describes the ratio of total
on road networks. They presented two methods for monitor-         number of data objects in D to the number of data objects
ing moving queries in an incremental manner which reduces         that contains t in their description. Higher µ means, the
the traversing of network edges. Different from [10, 6] in this   higher textual relevance to query keywords.
study we consider top-k spatial keyword queries in directed
road networks where each road segment has a particular ori-       4.    QUERY PROCESSING SYSTEM
entation.
                                                                    In this section, we present our proposed query processing
                                                                  system that indexes the data objects and prunes the irrele-
3.    PRELIMINARIES                                               vant edges for efficient query processing. In Section 4.1, we
  Section 3.1 defines the terms and notations that are used       discuss indexing framework and in Section 4.2, we present
in this paper. Section 3.2 formulates the problem using an        an efficient keyword query processing algorithm (eSPAK).
example that illustrates the general results of top-k spatial
keyword queries.                                                  4.1     Indexing Framework
                                                                     We implemented the inverted file for indexing data ob-
3.1    Definition of Terms and Notations                          jects. The inverted file contains vocabulary and inverted
  Road Network: A road network is represented by a weighted       lists. The vocabulary keeps general information about each
directed graph G = (N, E, W ) where N, E and W denote             term such as frequency of term which is helpful in comput-
ing the textual relevance of the data objects. The inverted              Algorithm 1: eSPAK: Query Processing Algorithm.
list stores the data objects located on the edge −       n− →
                                                          s ne that
                                                                       1 Input: Top-k spatial keyword query QN = (q.l, q.t, k)
have a term t in their description. An inverted list is iden-          2 Output: Top-k data objects with highest score
tified by a key composed of (eid , tid ), eid and tid represents       3 Dc ← ∅ /*set of candidate data objects
edge id and term id, respectively. Each inverted file is a             4 max-heap Dk ← ∅ /*current Top-k set
set of inverted lists. The separate inverted list is used for          5 sk ← 0 /*k-th score in Dk
each term in the object description. Inverted list stores two          6 min-heap ← ∅
attributes for each data object: first, the distance between           7 explored ← ∅
data object and starting node dist(ns , di ); second, the sig-         8 min-heap.insert(q.l, edgeactive )
                                                                       9 while min-heap 6= ∅ or (equ) do
nificance factor θ(ti , di ) of the term ti in the description of     10     (pa , edge)← min-heap.pop()
the data object. Note that the network distance between               11     if (pa , edge) ∈
                                                                                            / explored then
two points in directed road networks is not symmetrical (i.e,         12          explored ← explored ∪ (pa , edge)
dist(ns , di ) 6= dist(di , ns )). Recall that the starting node is   13          Dc ← candsearch((eid , tid ), sk )
chosen according to orientation of edge such that direction           14          update Dk and sk
of edge is from node towards data object. In Figure 1, n3             15     end
                                                                      16     else
is starting node for d7 . For bi-directional edges any of the         17          min-heap.push(adjacent node, edge)
adjacent node can act as starting node.                               18     end
   Furthermore, we develop a pruning technique to prune the           19 end
irrelevant edges. To achieve this, we introduce a highest sig-        20 return Dk
nificance factor (θt ) of term t in the description of objects
lying on the edge. The θt on an edge is retrieved by a combi-          a road network presented in Figure 1. Assume, a query q
nation of eid and tid . The θt is an upper-bound significance          generated a top-1 keyword query with q.d “Italian restau-
for an object on the edge with t. Naturally, the edges with            rant”. For the ease of presentation, we assume α = 1 and the
θt smaller than the score of the k-th object found so far are          textual relevance µ is the number of occurrence of query key-
pruned.                                                                words in d.t divided by the number of keywords in the doc-
                                                                                                              µ(d4 .t,q.t)
   The proposed indexing scheme has three main advantages.             ument. For example, the ψ(d4 ) = 1+λ(d       4 .l,q.l)
                                                                                                                               = 0.5
                                                                                                                                   8
                                                                                                                                      = 0.06.
First, the object search relevant to query keywords is very            The algorithm starts the network expansion from active edge
                                                                       −
                                                                       n−  →
efficient using the (eid , tid ) pair. Second, inverted files also       2 n3 where q is the anchor point. Note that the direction
store the network distance between starting node and data              of edge − n−  →
                                                                                   2 n3 is from n2 to n3 . Therefore, algorithm only
object which helps in accessing the data object in the di-             explore qn3 . There is no data object found in −
                                                                                 −→                                           qn→. Then, n
                                                                                                                                 3            3
rected road network. Finally, the pruning technique allows
                                                                                                            − −→
                                                                       becomes anchor point and edges n3 n4 , n3 n5 and −            n− →
                                                                                                                                      3 n7 are
faster query processing by exploring fewer edges.                      inserted in min-heap. Next, candsearch function retrieves
                                                                       the candidate data objects on edges −    n−   →
                                                                                                                  3 n4 , n2 n3 and n3 n7
                                                                                                                                         −−→
4.2    eSPAK: Query Processing Algorithm                               whose score is better than sk . On edge n3 n5 data object d3
                                                                       is retrieved with ψ(d3 ) = 0.2. The data object d3 is inserted
   eSPAK traverses the road network incrementally in a sim-
                                                                       in Dk set and the value of sk is set to 0.2. For edge −           n−  →
                                                                                                                                           3 n4
ilar fashion to Dijkstra’s algorithm [5]. Algorithm 1 returns
                                                                       and − n− → there is no candidate object found because d .t
                                                                                n
                                                                              3 7                                                           2
top-k data objects with highest scores according to their
                                                                       (“Cafe”) and d7 .t (“Cafe and Bakery”) does not match with
joint textual and spatial relevances to the query. The algo-
                                                                       q.t. The algorithm continues expanding the edges whose
rithm begins by exploring the active edge where query object
                                                                       upper-bound score is greater than sk . The edge −          n− →
                                                                                                                                   7 n2 is ex-
q is located and expands the network in an increasing or-                                                       − −  →        1
                                                                       plored next, the upper-bound score of n7 n2 is 7 which is less
der of distance from q. Each entry in the min-heap takes
                                                                       than sk . Similarly, for edge ←n5−n−6 the upper-bound score is
the form (pa , edge), where pa indicates the anchor point in           0.5
                                                                        8
                                                                            < sk . Therefore, algorithm terminates reporting d3 as
the edge. For an active edge, q becomes the anchor point.
                                                                       top-1 result.
Otherwise, for directed edges starting node ne becomes the
anchor point or for bi-directional edges either of the adja-
cent node, i.e., ns or ne becomes the anchor point. Let Dk             5.    PERFORMANCE EVALUATION
be the current set of top-k data objects and sk be the score             In this section, we evaluate the performance of eSPAK
of k-th data object in Dk . candsearch((eid , tid ), sk ) function     through simulation experiments. We describe experiment
retrieves the candidate data objects Dc located in an edge             settings in Section 5.1 and present experimental results in
with better score ψ(d) than sk . Next, the Dk set is updated           Section 5.2.
with the data objects in Dc and so does sk . The algorithm
continues its expansion and inserts the adjacent edges of the          5.1     Experimental Settings
boundary node until the heap is exhausted or the remaining               All of our experiments are performed using a real road net-
data objects cannot have the better score than sk .                    work [1] that comprised the main roads of North America,
   The candsearch((eid , tid ), sk ) procedure finds the candi-        with 175,812 nodes and 179,178 edges. Both the direction
date data objects in two steps. In first step, the upper-bound         of edges and data points on edges are generated randomly.
score of edges is computed using a significance factor (θt ) of
                                                                                 Table 1: Experimental Parameter Settings
a term t ∈ q.t and the shortest distance sdist(ei , q.l) between
edge and the query location. In next step, the inverted lists
                                                                                Parameter                              Range
of term t are fetched, if their upper-bound score is higher
                                                                           Number of results (k )                 5, 10, 15, 20, 25
than sk . In inverted lists, the objects whose score ψ(d) is
greater than sk are returned.                                            Number of data objects |D|          10, 20, 30, 40, 50 (x1000)
   In order to give a feel for our proposed algorithm, consider            Query parameter (α)                  0.01, 0.1, 1, 10, 100
                 125                                                     80                                                     75
                           Baseline                                             Baseline                                                Baseline
                            eSPAK                                                eSPAK                                                   eSPAK
                 100                                                     60                                                     60
    Time (Sec)




                                                            Time (Sec)




                                                                                                                   Time (Sec)
                 75                                                                                                             45
                                                                         40
                 50                                                                                                             30
                 25                                                      20                                                     15
                  0                                                      0                                                      0
                       5         10     15        20   25                     10K     20K     30K     40K   50K                      0.01    0.1      1      10   100
                                 # of Results (K)                                   # of Data Objects |D|                                   Query Parameter (α)

    Figure 2: Effect of k on query processing               Figure 3: Effect of |D| on query processing           Figure 4: Effect of α on query processing
    time.                                                   time.                                                 time.

The description of each data object is extracted from twit-                                     7.     ACKNOWLEDGMENTS
ter1 messages, we assigned one tweet per data object. As a                                        We thank anonymous reviewers for their valuable com-
benchmark for eSPAK, we use a baseline method that com-                                         ments and suggestions. This research was supported by
putes the score of every data object using the incremental                                      Basic Science Research Program through the National Re-
network expansion [9]. Both the algorithms are implemented                                      search Foundation of Korea (NRF) funded by the Ministry
in Java and executed on a desktop PC 2.80 GHz, Intel Core                                       of Education(2016R1D1A1B03934129). Finally, this work
i5 with 8GB memory. In the experiments, we compared the                                         was partially supported by Leaders Industry-University Co-
query processing time of both algorithms. Table 1 summa-                                        operation Project.
rizes the parameters used in the experiments. In each ex-
periment, we vary a single parameter within the range that
is shown in Table 1 while keeping the other parameters at
                                                                                                8.     REFERENCES
the bolded default values.                                                                       [1] Real datasets for spatial databases.
                                                                                                     http://www.cs.fsu.edu/˜lifeifei/SpatialDataset.htm.
5.2                Experimental Results                                                          [2] H.-J. Cho, K. Ryu, and T.-S. Chung. An efficient
                                                                                                     algorithm for computing safe exit points of moving
   Figure 2 shows the query processing time for eSPAK and
                                                                                                     range queries in directed road networks. Information
baseline as a function of the number k of requested data ob-
                                                                                                     Systems, 41:1–19, 2014.
jects with the highest score. The query processing time of
                                                                                                 [3] G. Cong, C. S. Jensen, and D. Wu. Efficient retrieval
the baseline is nearly stable regardless of the k value because
                                                                                                     of the top-k most relevant spatial web objects.
it always computes the score of each data object. However,
                                                                                                     Proceedings of the VLDB Endowment, 2(1):337–348,
the query processing time of eSPAK increases slightly with
                                                                                                     2009.
the k value. In Figure 3, we evaluate the performance of
eSPAK and baseline by varying the cardinality of data ob-                                        [4] I. De Felipe, V. Hristidis, and N. Rishe. Keyword
jects. The query processing time of both the algorithms is                                           search on spatial databases. In 2008 IEEE 24th
sensitive towards an increase in |D|. However, eSPAK scales                                          International Conference on Data Engineering, pages
much better than baseline. Figure 4, demonstrates the im-                                            656–665. IEEE, 2008.
pact of query parameter α on query processing time. A small                                      [5] E. W. Dijkstra. A note on two problems in connexion
value of α indicates more importance of textual relevance,                                           with graphs. Numerische mathematik, 1(1):269–271,
whereas a high value of α gives more preference to the spa-                                          1959.
tial relevance. Experimental results reveal that α does not                                      [6] L. Guo, J. Shao, H. H. Aung, and K.-L. Tan. Efficient
indicate a significant impact on the query processing time                                           continuous top-k spatial keyword queries on road
of eSPAK and baseline. It is interesting to note that, both                                          networks. GeoInformatica, 19(1):29–60, 2015.
approaches performs better for higher values of α, which in-                                     [7] R. Hariharan, B. Hore, C. Li, and S. Mehrotra.
dicates more importance to spatial relevance. This is mainly                                         Processing spatial-keyword (sk) queries in geographic
because, when spatial relevance is higher, fewer edges are re-                                       information retrieval (gir) systems. In Scientific and
quired to explore to find the top-k data objects.                                                    Statistical Database Management, 2007. SSBDM’07.
                                                                                                     19th International Conference on, pages 16–16. IEEE,
                                                                                                     2007.
6.                CONCLUSIONS                                                                    [8] Z. Li, K. C. Lee, B. Zheng, W.-C. Lee, D. Lee, and
   In this paper, we investigate top-k spatial keyword queries                                       X. Wang. Ir-tree: An efficient index for geographic
in directed road networks. We presented an efficient index-                                          document search. IEEE Transactions on Knowledge
ing framework using inverted files, that indexes the data                                            and Data Engineering, 23(4):585–599, 2011.
objects on edges which allows effective searching of data ob-                                    [9] D. Papadias, J. Zhang, N. Mamoulis, and Y. Tao.
jects relevant to query in term of both textual and spatial                                          Query processing in spatial network databases. In
relevance. Furthermore, we present an algorithm for eval-                                            Proceedings of the 29th international conference on
uating top-k spatial keyword queries. Finally, the experi-                                           Very large data bases-Volume 29, pages 802–813.
mental evaluation conducted on real road networks demon-                                             VLDB Endowment, 2003.
strates that eSPAK drastically reduces the query processing                                     [10] J. B. Rocha-Junior and K. Nørvåg. Top-k spatial
time compared to baseline algorithm.                                                                 keyword queries on road networks. In Proceedings of
                                                                                                     the 15th international conference on extending
1                                                                                                    database technology, pages 168–179. ACM, 2012.
    http://twitter.com