=Paper= {{Paper |id=Vol-2578/BMDA7 |storemode=property |title=Exploration of Interesting Dense Regions on Spatial Data |pdfUrl=https://ceur-ws.org/Vol-2578/BMDA7.pdf |volume=Vol-2578 |authors=Behrooz Omidvar-Tehrani,Plácido A. Souza Neto,Francisco B. Silva Júnior,Felipe F. Pontes |dblpUrl=https://dblp.org/rec/conf/edbt/Omidvar-Tehrani20 }} ==Exploration of Interesting Dense Regions on Spatial Data== https://ceur-ws.org/Vol-2578/BMDA7.pdf
        Exploration of Interesting Dense Regions on Spatial Data
                        Behrooz Omidvar-Tehrani                                                        Plácido A. Souza Neto
                     NAVER LABS Europe (France)                                           Federal Institute of Rio Grande do Norte (Brazil)
                behrooz.omidvar-tehrani@naverlabs.com                                                 placido.neto@ifrn.edu.br

                          Francisco B. Silva Júnior                                                        Felipe F. Pontes
           Federal University of Rio Grande do Sul (Brazil)                              Federal Institute of Rio Grande do Norte (Brazil)
                       fbsjunior@inf.ufrgs.br                                                 freire.pontes@academico.ifrn.edu.br

ABSTRACT                                                                               However, there has been less attention to the “value” derived
Nowadays, spatial data are ubiquitous in various fields of science,                    from spatial data. Despite the decent progress on the efficiency
such as transportation and smart city management. A recent                             front, a user may easily get lost in the plethora of geographical
research direction in analyzing spatial data is to provide means                       points of interest (POIs), mainly due to two following reasons:
for “exploratory analysis” of such data where users are guided                         • In an exploratory context, the user doesn’t know a priori what
towards interesting options in consecutive analysis iterations.                          to investigate next.
Typically, the guidance component learns user’s preferences us-                        • Moreover, he/she may easily get distracted and miss interesting
ing his/her explicit feedback, e.g., picking a spatial point of in-                      POIs by visual clutter caused by huge overlaps among POIs.
terest (POI) or selecting a region of interest. However, it is often
the case that users forget or don’t feel necessary to explicitly ex-                      The main drawback of the traditional analysis models is that
press their feedback in what they find interesting. Our approach                       the user has a passive role in the process. In other words, the
captures implicit feedback on spatial data. The approach consists                      user’s feedback (i.e., his/her likes and dislikes) is ignored and
of observing mouse moves (as a means of user’s interaction) to                         only the input query (i.e., his/her explicit request) is served. In
discover interesting spatial regions with dense mouse hovers. In                       case feedback is incorporated, the process can be more directed
this paper, we define, formalize, and explore Interesting Dense                        towards user’s interests where his/her partial needs can be served
Regions (IDRs) which capture preferences of users, in order to                         earlier in the process [5]. In this paper, we advocate for a “guid-
automatically find interesting spatial highlights. Our approach                        ance layer” on top of the raw visualization of spatial data to
involves a polygon-based abstraction layer for capturing pref-                         enable users know “what to see next”. This guidance should be
erences. Using these IDRs, we highlight POIs to guide users in                         a function of user feedback: the system should return options
the analysis process. We discuss the efficiency and effectiveness                      similar to what the user has already appreciated.
of our approach through realistic examples and experiments on
                                                                                           Various approaches in the literature propose methodologies to
Airbnb and Yelp datasets.
                                                                                       incorporate user’s feedback in the exploration process of spatial
                                                                                       data [6–10]. Typically, feedback is considered as a function which
1     INTRODUCTION                                                                     is triggered by any user’s action on the map. The action can be
Nowadays, there has been a meteoric rise in the generation of                          “selecting a POI”, “moving to a region”, “asking for more details”,
spatial datasets in various fields of science, such as transporta-                     etc. The function then updates a “profile vector” which records
tion, lodging services, and smart city management [1, 2]. As each                      user’s interests in a transparent way. The updated content in the
record in spatial data represents an activity in a precise geograph-                   profile vector enables the guidance functionality. For instance, if
ical location, analyzing such data enables discoveries grounded                        the user shows interest in a POI which describes a house with
on facts. Typically, spatial data analysis begins with an impre-                       balcony, this choice of amenity will reflect his/her profile to
cise question in the mind of the user, i.e., exploratory analysis.                     prioritize other houses with balcony in future iterations.
The user requires to go through several trial-and-error iterations                        Feedback is often expressed explicitly, i.e., the user clicks on
to improve his/her understanding of the spatial data and gain                          a POI and mentions if he/she likes or dislikes the POI [11–13].
insights. Each iteration involves visualizing a subset of data on                      However, there are several cases that the feedback is expressed
geographical maps using an off-the-shelf product (e.g., Tableau1 ,                     implicitly, i.e., the user does not explicitly click on a POI, but
Exhibit2 , Spotfire3 ) where the user can investigate on different                     there exist correlations with other signals captured from the user
parts of the visualization by zooming in/out and panning.                              which provide hint on his/her interest. For instance, it is often
   Spatial data are often voluminous. Hence the focus in the liter-                    the case in spatial data analysis that users look at some regions
ature of spatial data analysis is on “efficiency”, i.e., enabling fluid                of interest but do not provide an explicit feedback. Another ex-
means of navigation in spatial data to facilitate the exploratory                      ample is frequent mouse moves around a region which is a good
analysis. The common approach is to design pre-computed in-                            indicator of the user’s potential interest in the POIs in that region.
dexes which enable efficient retrieval of spatial data (e.g., [3, 4]).                 Users often hover their mouse in a region of interest to collect
                                                                                       information on the background map (e.g., touristic places, parks,
1 http://www.tableau.com                                                               and nearby grocery stores, presented in the form of map layers
2 http://www.simile-widgets.org/exhibit/
3 http://spotfire.tibco.com
                                                                                       and tooltips) before landing on a decision about picking a POI
                                                                                       (such as a hotel) in that region. Implicit feedbacks are more chal-
© 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed-   lenging to capture and hence less investigated in the literature.
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen,       The following example describes a use case of implicit feedbacks.
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-              This will be our running example which we follow throughout
tribution 4.0 International (CC BY 4.0)
                                                                                       the paper.
Example. Benício is planning to live in Paris for a sabbatical leave.     2    DATA MODEL
He decides to rent a home-stay from Airbnb website4 . He likes to         We consider two different layers on a geographical map: “spatial
discover the city, hence he is open to any type of lodging in any         layer” and “interaction layer”. The spatial layer contains POIs
region with an interest to stay in the center of Paris. The website       from a spatial database P. The interaction layer contains mouse
returns 1500 different locations. As he has no other preferences,         move points M.
an exhaustive investigation needs scanning each location indepen-
                                                                          Spatial layer. Each POI p ∈ P is described using its coordinates,
dently, which is nearly infeasible. While he is scanning the very first
                                                                          latitude and longitude, i.e., p = ⟨lat, lon⟩. Note that in this work,
options, he shows interest in the region of Trocadero by hovering his
                                                                          we don’t consider “time” for POIs, as our contribution focuses on
mouse around the Eiffel tower and checking the amenities within
                                                                          their location. POIs are also associated to a set of domain-specific
that region. However, he forgets or doesn’t feel necessary to click
                                                                          attributes A. For instance, for a dataset of a real estate agency,
a POI (i.e., a home-stay) in that region. An ideal system should
                                                                          POIs are properties (houses and apartments) and A contains at-
capture this implicit feedback in order to short-list a small subset
                                                                          tributes such as “surface”, “number of rooms” and “price”. The set
of locations that Benício should consider as high priority.
                                                                          of all possible values for an attribute a ∈ A is denoted as dom(a).
   The above example shows in practice that implicit feedback             We also define user’s feedback F as a vector over all attribute val-
                                                                                                        −−−−−−−−−−−→
capturing is crucial in the context of spatial data analysis, as a POI    ues (i.e., facets), i.e., F = ∪a ∈A dom(a). The vector F is initialized
can easily remain out of sight and be missed. It is particularly the      by zeros and will be updated to express user’s preferences.
case in small-screen devices such as smart watches, smartphones,          Interaction layer. Whenever the user moves his/her mouse,
and tablets.                                                              a new point m is appended to the set M. Each mouse move
   In this paper, we present a simple yet effective approach whose        point is described using the pixel position that it touches and the
aim is to capture and analyze implicit feedback of users in spatial       clock time of the move. Hence each mouse move point is a tuple
data analysis. Without loss of generality, we focus on “mouse             m = ⟨x, y, t⟩, where x and y specifies the pixel location and t is a
moves” as the implicit feedback received from the user. Mouse             Unix Epoch time. To conform with geographical standards, we
moves are the most common way that users interact with geo-               assume m = ⟨0, 0⟩ sits at the middle of the interaction layer, both
graphical maps to collect information within a region of inter-           horizontally and vertically.
est [14], such as information provided in the background map                 The user is in contact with the interaction layer. To update
(e.g., parks, theaters, shopping centers) and the tooltip informa-        the feedback vector F , we need to translate pixel locations in the
tion which pop up by hovering (e.g., information about the POIs           interaction layer to latitudes and longitudes in the spatial layer.
such as price and reviews). It is shown in [15] that mouse gestures       We employ equirectangular projection to obtain the best possible
have a strong correlation with “user engagement”. Intuitively,            approximation of a point m = ⟨x, y, t⟩ ∈ M in the spatial layer,
a POI gets a higher weight in the user’s profile if the mouse             denoted as p(m).
cursor moves around it frequently. However, our approach is
independent from the type of feedback enabler and can be easily                                                            x
                                                                                    p(m = ⟨x, y, t⟩) = ⟨lat = y + γ , lon =    + θ⟩        (1)
extended to other types such as gaze tracking, leap motions, as                                                           cosγ
well as touch gestures in touch screens.                                     The inverse operation, i.e., transforming a point p = ⟨lat, lon⟩
Contributions. In this paper, we make the following contribu-             from the spatial layer to the interaction is done using Equation 2.
tions:
• We define and explore the notion of “implicit user feedback”                  m(p = ⟨lat, lon⟩) = ⟨x = (lon − θ ) × cosγ , y = lat − γ ⟩   (2)
  which enables a seamless navigation in spatial data.                       The reference point for the transformation is the center of
• We define the notion of “information highlighting”, a mecha-            both layers. In Equations 1 and 2, we assume that γ is the latitude
  nism to highlight out-of-sight important information for users.         and θ is the longitude of a point in the spatial layer corresponding
  A clear distinction of our proposal with the literature is that it      to the center of the interaction layer, i.e., m = ⟨0, 0⟩.
  doesn’t aim for pruning (such as top-k recommendation), but
  leveraging the actual data with potential interesting results           3    PROBLEM DEFINITION
  (i.e., highlights).
                                                                          When analyzing spatial data, users require to obtain only a few
• We define and formalize the concept of Interesting Dense Re-
                                                                          options (so-called “highlights”) to focus on. These options should
  gions (IDRs), a polygon-based approach to explore and high-
                                                                          be in-line with what they have already appreciated. In this paper,
  light spatial data.
                                                                          we formulate the problem of “information highlighting using
• We propose an efficient greedy approach to compute highlights
                                                                          implicit feedback”, i.e., highlight a few POIs based on implicit
  on-the-fly.
                                                                          interests of the user in order to guide him/her towards what
• We show the effectiveness of our approach through a set of
                                                                          he/she should concentrate on in consecutive iterations of the
  qualitative experiments.
                                                                          analysis process. We formally define our problem as follows.
   The outline of the paper is the following. Section 2 describes         Problem. Given a time tc and an integer constant k, update the
our data model. In Section 3, we formally define our problem.             feedback vector F using points m ∈ M where m.t ≤ tc , and choose
Then in Section 4, we present our solution and its algorithmic            k POIs Pk ⊆ P as “highlights” where Pk satisfies two following
details. Section 5 reports our experiments on the framework. We           constraints.
review the related work in Section 6. We present some limitations         (i) ∀p ∈ Pk , similarity(p, F ) is maximized.
of our work in Section 7. Last, we conclude in Section 8.
                                                                          (ii) diversity(Pk ) is maximized.
                                                                             The first constraint guarantees that returned highlights are
4 http://www.airbnb.com
                                                                          highly similar with user’s interests captured in F . The second
 Algorithm 1: Spatial Highlighting Algorithm                                Algorithm 2: Find Interesting Dense Regions (IDRs)
  Input: Current time tc , mouse move points M                               Input: Current time tc , mouse move points M
  Output: Highlights Pk                                                      Output: IDRs S
1 S ← find_interesting_dense_regions(tc , M) // Section 4.1                1 S ← ∅
2 Ps ← match_points(S, P) // Section 4.2                                   2 д ←number of time segments
3 F ← update_feedback_vector(F, Ps ) // Section 4.3                        3 for i ∈ [1, д] do

4 Pk ← get_highlights(P, F ) // Section 4.4                                4     Mi ← {m = ⟨x, y, t⟩|( tдc × i) ≤ t ≤ ( tдc × (i + 1))}
5 return Pk                                                                5     Ci ← mine_clusters(Mi )
                                                                           6     Oi ← find_ploygons(Ci )
                                                                           7 end
constraint ensures that k POIs cover different regions and they            8 for Oi , Oj where i, j ∈ [0, д] and i ̸= j do
don’t repeat themselves. While our approach is independent from
                                                                              S.append(intersect(Oi , Oj ))
the way that similarity and diversity functions are formulated,
                                                                           9 return S
we provide a formal definition of these functions in Section 4.
  The aforementioned problem is hard to solve due to the fol-
lowing challenges.
Challenge 1. First, it is not clear how mouse move points influ-
                                                                           Observation 2. It is possible that the user moves his/her mouse
ence the feedback vector. Mouse moves occur on a separate layer
                                                                           everywhere in the map. This should not signify that all those
and there should be some meaningful transformations to inter-
                                                                           locations have the same significance.
pret mouse moves as potential changes in the feedback vector.
                                                                              Following our observations, we propose Algorithm 2 for min-
Challenge 2. Analyzing all generated mouse moves is challeng-
                                                                           ing IDRs. We add points to M only every 200ms to prevent adding
ing and may introduce false positives. A typical mouse with 1600
                                                                           redundant points (i.e., Challenge 2). Following Observation 1 and
DPI (Dots Per Inch) touches 630 pixels for one centimeter of
                                                                           in order to mine the recurring behavior of the user, the algorithm
move. Hence a mouse move from the bottom to the top of a typi-
                                                                           begins by partitioning the set M into д fixed-length consecutive
cal 13-inch screen would yield 14,427 points which may not be
                                                                           segments M1 to Mд . The first segment starts at time zero (where
necessarily meaningful.
                                                                           the system started), and the last segment ends at tc , i.e., the cur-
Challenge 3. Beyond the two first challenges, finding the most             rent time. Following Observation 2, we then find dense clusters
similar and diverse POIs with F needs an exhaustive scan of                in each segment of M using a variant of DB-SCAN approach [17].
all POIs in P which is prohibitively expensive: in most spatial            Finally, we return intersections among those clusters as IDRs.
datasets, there exist millions of POIs. Moreover, we need to follow
                                                                              For clustering points in each time segment (i.e., line 5 of Al-
bi-objective optimization considerations as we aim to optimize
                                                                           gorithm 2), we use ST-DBSCAN [18], a space-aware variant of
both similarity and diversity at the same time [16].
                                                                           DB-SCAN for clustering points based on density. For each subset
   We recognize the complexity of our problem using the afore-             of mouse move points Mi , i ∈ [1, д], ST-DBSCAN begins with a
mentioned challenges. In Section 4, we discuss a solution for the          random point m 0 ∈ Mi and collects all density-reachable points
discussed problem and its associated challenges.                           from m 0 using a distance metric. As mouse move points are in the
                                                                           2-dimensional pixel space (i.e., the display), we choose euclidean
4     INTERESTING DENSE REGIONS                                            distance as the distance metric. If m 0 turns out to be a core object,
Our approach exploits user’s implicit feedback (i.e., mouse moves)         a cluster will be generated. Otherwise, if m 0 is a border object,
to highlight a few interesting POIs as future analysis directions.         no point is density-reachable from m 0 and the algorithm picks
Algorithm 1 summarizes the principled steps of our approach.               another random point in Mi . The process is repeated until all of
   The algorithm begins by mining the set of mouse move points M           the points have been processed.
in the interaction layer to discover one or several Interesting               Once clusters are obtained for all subsets of M, we find their
Dense Regions, abbr., IDRs, in which most user’s interactions              intersections to locate recurring regions (line 6). To obtain inter-
occur (line 1). Then it matches the POIs P with IDRs using Equa-           sections, we need to clearly define the spatial boundaries of each
tion 2 in order to find POIs inside each region (line 2). The at-          cluster. Hence for each cluster, we discover its corresponding
tributes of resulting POIs will be exploited to update the user’s          polygon that covers the points inside. For this aim, we employ
feedback vector F (line 3). The updated vector F will then be              Quickhull algorithm, a quicksort-style method which computes
used to find k highlights (line 4). These steps ensure that the final      the convex hull for a given set of points in a 2D plane [19].
highlights reflect user’s implicit interests. We detail each step as          We describe the process of finding IDRs in an example. Figure 1
follows (Sections 4.1 to 4.4).                                             shows the steps that Benício follows in our running example to
                                                                           explore home-stays in Paris. Figure 1.A shows mouse moves of
4.1    Discovering IDRs                                                    Benício in different time stages. In this example, we consider д = 3
The objective of this step is to obtain one or several regions in          and capture Benício’s feedback in three different time segments
which the user has expressed his/her implicit feedback. There              (progressing from Figures 1.B to 1.D). It shows that Benício started
are two observations for such regions.                                     his search around Eiffel Tower and Arc de Triomphe (Figure 1.B)
Observation 1. We conjecture that a region is more interesting             and gradually showed interest in south (Figure 1.C) and north
for the user if it is denser, i.e., the user moves his/her mouse in that   (Figure 1.D) as well. All intersections between those clusters are
region several times, to collect information from the background           discovered (hatching regions in Figure 1.E) which will constitute
map.                                                                       the set of IDRs (Figure 1.F), i.e., IDR1 to IDR4.
                                    3

                         2


                    1



   A                                                    B                                           C


                                                                                                                    IDR2
                                                                                                                                  IDR3

                                                                                                            IDR1




                                                                                                                           IDR4



   D                                                    E                                            F


                                              Figure 1: The process of finding IDRs on Airbnb dataset.


4.2     Matching Points
Being a function of mouse move points, IDRs are discovered
in the interaction layer. We then need to find out which POIs
in P fall into IDRs, hence forming the subset Ps . We employ
Equation 2 to transform those POIs from the spatial layer to the
interaction layer. Then a simple “spatial containment” function
can verify which POIs fit into the IDRs. Given a POI p and an
IDR r , a function contains(p, r ) returns “true” if p is inside r , oth-
erwise “false”. In our case, we simply use the implementation of
ST _Within(p, r ) module in PostGIS5 , i.e., our underlying spatial            IDR1                                IDR2

DBMS which hosts the data.
   In the vanilla version of our spatial containment function, all
POIs should be checked against all IDRs. Obviously, this depletes
the execution time. To prevent the exhaustive scan, we employ
Quadtrees [20] in a two-step approach.
(i) In an offline process, we build a Quadtree index for all POIs
in P. We record the membership relations of POIs and cells in
the index.
(ii) When IDRs are discovered, we record which cells in the
Quadtree index intersect with IDRs. As we often end up with                    IDR3                                IDR4

few IDRs, the intersection verification performs fast. Then for
matching POIs, we only check a subset which is inside the cells
                                                                                      Figure 2: Matching POIs for IDR1 to IDR4.
associated to IDRs and ignore the POIs outside. This leads to a
drastic pruning of POIs in P.
   We follow our running example and illustrate the matching
process in Figure 2. In the Airbnb dataset, POIs are home-stays             in a feedback vector F . The vector is initialized by zero, i.e., the
which are shown with their nightly price on the map. We observe             user has no preference at the beginning. We update F using the
that there exist many matching POIs with IDR3 and absolutely                attributes of the POIs in Ps . We enable transparency by choosing
no matching POI for IDR2. For IDR4, although there exist many               F ’s schema to be defined on the POI attributes. In exploratory
home-stays below the region, we never check their containment,              analysis scenarios, it is often the case that users do not trust in
as they belong to a Quadtree cell which doesn’t intersect with              what they get from the system (i.e., algorithmic anxiety [21])
the IDR.                                                                    and want to know what has been learned from them. Having
                                                                            a transparent user profile enables an easy examination of its
4.3     Updating user Feedback Vector                                       content by the user.
The set of matching POIs Ps (line 2 of Algorithm 1) depicts the                We consider an increment value δ to update F . If p ∈ Ps gets v 1
implicit preference of the user. We keep track of this preference           for attribute a 1 , we augment the value in the F ’s cell of ⟨a 1, v 1 ⟩
                                                                            by δ . Note that we only consider incremental feedback, i.e., we
5 https://postgis.net/docs/manual-dev/ST_Within.html                        never decrease a value in F .
              Table 1: Attributes of POIs in IDR1.                      diversity. First, the highlights should be in the same direction of
                                                                        the user’s implicit feedback, hence similar to the vector F . The
      ID    Price   #Beds    Balcony     Air-cond.      Rating          similarity between a POI p ∈ P and the vector F is defined as
       1    130e      1        Yes          Yes          5/5            follows.
       2     58e      1        Yes          No           5/5
       3     92e      2        Yes          No           5/5                          similarity(p, F ) = avg a ∈A (sim(p, F, a))              (3)
       4     67e      1        Yes          No           4/5
                                                                           The sim() function can be any function such as Jaccard or
                                                                        Cosine. Each attribute can have its own similarity function (as
             Table 2: Updating user Feedback Vector                     string and integer attributes are compared differently.) Then sim()
                                                                        works as an overriding-function which provides encapsulated
      Attribute-value       Applying IDR 1      Normalized              similarity computations for any type of attribute.
          ⟨#Beds,1⟩               +3δ              0.19                    Second, highlighted POIs should also represent distinct di-
          ⟨#Beds,2⟩               +δ               0.06                 rections so that the user can observe different aspects of data
         ⟨#Beds,+2⟩           (no update)          0.00                 and decide based on the big picture. Given a set of POIs Pk =
        ⟨Balcony,Yes⟩             +4δ              0.25                 {p1, p2 . . . pk } ⊆ P, we consider a pair-wise definition of diver-
        ⟨Balcony,No⟩          (no update)          0.00                 sity as follows.
       ⟨Air-cond.,Yes⟩            +δ               0.06
       ⟨Air-cond.,No⟩             +3δ              0.19                         diversity(Pk ) = avg {p,p ′ } ⊂ Pk |p̸=p ′ distance(p, p ′ )   (4)
          ⟨Rating,1⟩          (no update)          0.00
          ⟨Rating,2⟩          (no update)          0.00                    The function distance(p, p ′ ) operates on geographical coordi-
          ⟨Rating,3⟩          (no update)          0.00                 nates of p and p ′ and can be considered as any distance function
          ⟨Rating,4⟩              +δ               0.06                 of Minkowski distance family. However, as distance computa-
          ⟨Rating,5⟩              +3δ              0.19                 tions are done in the spherical space, a natural choice is to employ
                                                                        Haversine distance.
                                                                           To solve our bi-objective problem (i.e., maximizing similar-
   We explain the process of updating the feedback vector using a       ity and diversity), we employ the ϵ-constraint method [22], i.e.,
toy example. Given the four matched POIs in IDR1 (Figure 2) with        we maximize diversity while satisfying a linear constraint for
prices 130e, 58e, 92e and 67e, we want to update the vector F           similarity which only accepts values larger than or equal to ϵ.
given those POIs. A few attributes of these POIs are mentioned          Algorithm 3 describes our approach for highlighting k similar
in Table 1. In practice, there are often more than 50 attributes for    and diverse POIs. We propose a best-effort greedy approach to
POIs. The cells of F are illustrated in the first column of Table 2.    efficiently compute highlighted POIs. We consider an offline step
As three POIs get the value “1” for the attribute “#Beds”, then the     followed by the online execution of our algorithm.
value in cell ⟨#Beds,1⟩ is augmented three times by δ . The same            In order to speed up the similarity computation in the online
process is repeated for all attribute-values of POIs in Ps . Note       execution, we pre-compute an inverted index for each single POI
that all cells of F are not necessarily touched in the feedback         p ∈ P in the offline step (as is commonly done in the Web search).
update process. For instance, in the above example, 5 cells out of      Each index Lp for the POI p keeps all other POIs in P in decreas-
12 remain unchanged.                                                    ing order of their similarity with p. While index computation
   By specifying an increment value, we can materialize the up-         may be extremely time-consuming, we stop generating those
dates and normalize the vector using a Softmax function. Given          lists if the similarity value is lower than the ϵ threshold.
δ = 1.0, the normalized values of the F vector is illustrated in the       The first step of Algorithm 3 is to find the most similar POI to F ,
third column of Table 2. Higher values of δ increase the influence      so-called p ∗ . The POI p ∗ is the closest possible approximation
of feedbacks.                                                           of F in order to exploit pre-computed similarities. The algorithm
   The normalized content of the vector F captures the implicit         makes sequential accesses to Lp ∗ (i.e., the inverted index of the
preferences of the user. For instance, the content of F after ap-       POI p ∗ ) to greedily maximize diversity. Algorithm 3 does not
plying POIs in IDR1 shows that the user has a high interest in          sacrifice efficiency in price of value. We consider a time limit
having a balcony in his/her home-stay, as his/her score for the         parameter which determines when the algorithm should stop
cell ⟨Balcony,Yes⟩ is 0.25, i.e., the highest among other cells. This   seeking maximized diversity. Scanning inverted indexes guaran-
reflects the fact that all POIs in IDR1 has balcony. It is impor-       tees the similarity maximization even if time limit is chosen to be
tant to note that although we only consider positive feedback,          very restrictive. Our observations with several spatial datasets
the Softmax function lowers the values of untouched cells once          show that we achieve the diversity of more than 0.9 with time
other cells get rewarded. The updated feedback vector is fully          limit set to 200ms.
transparent and the user can easily comprehend what has been               In line 2 of Algorithm 3, Pk is initialized with the k highest
learned from his/her previous actions.                                  ranking POIs in Lp ∗ . Function get_next(Lp ∗ ) (line 3) returns the
                                                                        next POI pnext in Lp ∗ in sequential order. Lines 4 to 12 iterate
4.4        Generating Highlights                                        over the inverted indexes to determine if other POIs should be
The ultimate goal is to highlight k POIs to guide users in analyzing    considered to increase diversity while staying within the time
their spatial data. The updated feedback vector F is the input to       limit.
the highlighting phase. We assume that POIs in IDRs are already            The algorithm looks for a candidate POI pcurrent ∈ Pk to
investigated by the user. Hence our search space for highlighting       replace in order to increase diversity. The boolean function
is P \ Ps . We seek two properties in k highlights: similarity and      diversity_improved() (line 6) checks if by replacing pcurrent by
  Algorithm 3: Get k similar and diverse highlights                      Table 3: Interactions of “novice” and “expert” participants
  get_highlights()
                                                                                                            T1/I1   T2/I1    T1/I2    T2/I2
   Input: POIs P, Feedback vector F , k, time_limit
   Output: Pk                                                                 Novices using Airbnb          372s    421s     365s     430s
 1 p ← max_sim_to(P, F )
    ∗                                                                         Experts using Airbnb          253s    302s     249s     298s
 2 Pk ← top_k(Lp∗ , k)
                                                                             Novices using highlights        19s     24s      20s      25s
                                                                             Experts using highlights        14s     17s      13s      17s
 3 pnex t ← дet_next(Lp∗ )
 4 while time_limit not exceeded ∨ pnex t = ∅ do
 5     for pcur r ent ∈ Pk do                                            their goals in a reasonable time. Also expert participants need on
 6         if diversity_improved(Pk , pnex t , pcur r ent ) then         average 6.25 seconds less time than the novice participants when
 7             Pk ← replace(Pk , pnex t , pcur r ent )                   using the highlights. Interestingly, starting POIs, i.e., I1 and I2, do
 8             break                                                     not have a huge impact on the number of steps. It is potentially
 9         end                                                           due to the diversity component which provides distinct options
10     end                                                               and can quickly steer users towards their region of interest. We
11     pnex t ← дet_next(Lp∗ )                                           also observe that the task T1 is an easier task than T2, as on aver-
12 end                                                                   age it took less time to be accomplished. This is potentially due
13 return Pk                                                             to where the user can request options similar to what he/she has
                                                                         already observed and greedily move to his/her preferred regions.

pnext in Pk , the overall diversity of the new Pk increases. It is       5.2      Effectiveness
important to highlight that for each run of the algorithm, we            In the second part of our experiments, we analyze the effective-
only focus on one specific inverted list associated to the input         ness of our approach in generating IDRs for information high-
POI. Algorithm 3 verifies the similarity and diversity of each POI       lighting. We employ two different datasets, i.e., Airbnb and Yelp6 .
with all other POIs, and then processes the normalization.               We pick a similar subset from both datasets, i.e., home-stays and
                                                                         restaurants in Paris, respectively. We consider four different sizes
5     EXPERIMENTS                                                        of those datasets, i.e., 100, 1000, 2000 and 4000 POIs, respectively.
We discuss two sets of experiments. The first set is on the useful-      For each size of the datasets, we manually perform 20 sessions,
ness of our approach. Then we focus more on IDR discovery and            and then present some statistics as the average of the sessions.
present some statistics and insights on the functionality of our         We limit each session to 2 minutes where we seek for interesting
approach. The experiments are done on a 2.8 GHz Intel Core i5            POIs in the datasets. We capture the following information in
machine running a Mac OS operating system.                               each session:
                                                                         • The number of regions created from the mouse moves during
5.1    Usefulness                                                          the session;
First off, we validate the usefulness of our approach. For this aim,     • The number of generated IDRs (intersection of regions);
we design a user study with 30 participants who are all students         • The number of POIs from the dataset presented in each IDR;
of Computer Science. Given an exploratory task to fulfill, we            • The coverage of POIs (in the dataset) with IDRs collectively.
hypothesize that there are three prominent factors which influ-             Tables 4 and 5 show the results for Airbnb and Yelp, respec-
ence the usefulness of our system: user expertise, type of the task,     tively. In Table 4, we observe that the number of regions decreases
and starting point. To examine the user expertise, we recruited          when the number of POIs increases. On average, 7.7 regions are
15 “novice” participants who don’t know the location under in-           constructed per session. The average number of POIs presented
vestigation, and also 15 “experts.” We also define two different         in IDRs is 25.97, which shows that our approach highlights at
types of tasks on the Airbnb dataset of Paris with 1,000 POIs: T1:       least 8.05% of POIs from the dataset, on average. We notice an
“finding a POI in a requested location” (e.g., find a home-stay in the   outlier in the experiment with 2000 POIs in Table 4. This hap-
“Champ de Mars” area), and T2: “finding a POI with a requested           pened due the fact that the user concentrated in a very small
profile” (e.g., find a cheap home-stay.) Last, the participants may      area generating a smaller number of IDRs, and consequently a
have different starting points, i.e., they may begin their naviga-       smaller number of POIs.
tion from a POI which is either I1: “geographically close to the
                                                                            More uniform results are observed in Table 5, i.e., for Yelp
goal” or I2: “far from the goal”. For each participant, we report
                                                                         dataset vis-à-vis Airbnb. The average number of generated re-
a variant of time-to-insight measure, i.e., how long (in seconds)
                                                                         gions reaches 12.75 per session. Also, the number of regions
the participants interact with the tool before fulfilling the task.
                                                                         decreases by increasing the number of POIs. The number of POIs
Evidently, less number of interactions are preferred as it means
                                                                         presented in IDRs is on average 108.65 and it represents on aver-
that the participant can reach insights faster.
                                                                         age 13.11% of POIs highlighted from the dataset.
   The participants perform the tasks T 1 and T 2 on two differ-
ent tools: Airbnb website (without information highlighting and          6     RELATED WORK
without incorporation of implicit feedbacks, as our baseline) and
                                                                         To the best of our knowledge, the problem of spatial information
our highlighting tool. Table 3 shows the results. Our initial obser-
                                                                         highlighting using implicit feedback has not been addressed be-
vation is that the highlights enable users to fulfill tasks one order
                                                                         fore in the literature. However, our work relates to a few others
of magnitude faster than the baseline without the highlights.
                                                                         in their semantics.
This shows that the highlighting approach with implicit feedback
                                                                         6 https:// www.yelp.com/ dataset
capturing is an effective mechanism which helps users to reach
          Table 4: IDR statistics on Airbnb dataset                       a set of input POIs on a plane. In [36, 37], imprecise regions are
                                                                          delineated into a convex or concave polygon. In our approach, it
  # POIs      # regions     # IDRs     # POIs in IDRs       % POIs        is important to discover regions by capturing mouse move POIs.
    100          11.35       10.05         29.40            29.40%        In case a concave polygon is constructed, the “dents” of such
   1000          10.75        6.75         11.70             1.17%        a polygon may entail POIs which are not necessarily in M. In
   2000           7.37        3.63          5.63            0.003%        the IDR’s algorithm, however, we adapt Quickhull [19], due its
   4000           1.30        1.15         53.15             1.33%        simplicity, efficiency and its natural implementation of convex
 average         7.69        7.64          25.97            8.05%         polygons.


            Table 5: IDR statistics on Yelp dataset                       7    LIMITATIONS
                                                                          In this paper, we presented a solution for highlighting out-of-
  # POIs      # regions     # IDRs     # POIs in IDRs       % POIs        sight information using a polygon-based approach for capturing
    100          14.90        7.55          28.30           28.30%        implicit feedbacks. To the best of our knowledge, our work is the
   1000          13.90       10.00         149.55           14.96%        first effort towards formalizing and implementing information
   2000          11.05        9.80         111.05            5.55%        highlighting using implicit feedback. However, we consider our
   4000          10.45        8.55          145.7            3.64%        work as an on-going effort where we envision to address some
 average        12.57        8.97          108.65           13.11%        limitations in the future, such as “customizability”, “performance”,
                                                                          “cold start”, and “quantitative experiments”.
                                                                          Custom maps and feedbacks. One limitation is about the “cus-
Information Highlighting. The literature contains a few in-               tomizable” use of geographical maps as an interaction means. As
stances of information highlighting approaches [23–26]. How-              we only consider static maps, we plan to work on translations and
ever, all these methods are objective, i.e., they assume that user’s      rotations as a future work. We also consider adapting other types
preferences are given as a constant input and will never change           of feedback enablers such as gaze tracking and touch gestures as
in the future. This limits their functionality for serving scenarios      a direction of future work. Our research will investigate poten-
of exploratory analysis. The only way to fulfill “spatial guidance”       tial challenges which may arise when employing those feedback
is to consider the evolutionary and subjective nature of user’s           enablers.
feedback. In our approach, the feedback vector gets updated in            Mobility. Our current focus is only on POIs, and hence trajecto-
time based on the implicit feedback of the user.                          ries and areas of mobility are not incorporated directly. While
   Online recommendation approaches can also be considered as             trajectories and areas can be broken into a subset of POIs (and
an information highlighting approach where recommended items              hence be easily fed to our approach), we envision to enrich our
count as highlights. Most recommendation algorithms are space-            model in a way that those mobility elements are considered as
agnostic and do not take into account the spatial information.            first-class citizens.
While a few approaches focus on the spatial dimension [27–29],            Cold start. Our problem bears similarities with recommendation
they still lack the evolutionary feedback capturing. Moreover,            algorithms where the quality of the output may be influenced
most recommendation methods miss “result diversification”, i.e.,          by scarce availability of input. This problem is referred to as the
highlights may not be useful due to overlaps.                             cold start problem [38, 39]. While there is no guarantee for a
Feedback Capturing. Several approaches are proposed in the                meaningful highlight in case of the complete absence of implicit
state of the art for capturing different forms of feedback [8, 9, 11,     feedbacks, our approach can return a reasonable set of highlights
12, 30, 31]. The common approach is a top-k processing method-            even with one single iteration of mouse moves. In the future, we
ology in order to prune the search space based on the explicit            envision to tackle the no-input challenge by leveraging statis-
feedback of the user and return a small subset of interesting re-         tical properties of the spatial data to obtain a default view for
sults of size k. A clear distinction of our proposal is that it doesn’t   highlights.
aim for pruning, but leveraging the actual data with potential            Performance. Another limitation is the medium-size datasets
interesting results (i.e., highlights) that the user may miss due         to be processed. Our algorithm processes similarity and diversity
to the huge volume of spatial data. Moreover, in a typical top-           in an O(n 2 ) complexity. Also Quickhull [19] uses a divide and
k processing algorithm, user’s choices are limited to k. On the           conquer approach similar to that of Quicksort, and its worst com-
contrary, our IDR approach enables a freedom of choice where              plexity is O(n2 ). While processing a 10K-POI dataset is straight-
highlights get seamlessly updated with new user’s choices.                forward in our framework, we plan to experiment with larger
   Few works formulate fusing approaches of explicit and im-              datasets in the future by improving our algorithms towards better
plicit feedbacks to better capture user preferences [6, 7, 32]. Our       performance. Another direction for future work is to consider
approach functions purely on implicit feedback and does not               experiments which measure the quantitative and qualitative in-
require any sort of explicit signal from the user.                        fluence of each component separately.
Region Discovery. Our approach finds interesting dense re-
gions (IDRs) in order to derive user’s implicit preferences. There        8    CONCLUSION
exist several approaches to infer a spatial region for a given set        In this paper, we present an approach to explore Interesting
of POIs [19, 33–37]. The common approach is to cluster POIs in            Dense Regions (IDRs) using implicit feedback in order to detect
form of concave and convex polygons. In [33], an algorithm is             user latent preferences. The implicit feedbacks are captured from
proposed to verify if a given POI p on the surface of a sphere is lo-     mouse moves of users over the geographical map while analyzing
cated inside, outside, or along the border of an arbitrary spherical      spatial data. We formalize a novel polygon-based mining algo-
polygon. In [34, 35], a non-convex polygon is constructed from            rithm which returns a few highlights in-line with user’s implicit
preferences. The highlights enable users to focus on what matters                         [17] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-
the most and prevent information overload.                                                     based algorithm for discovering clusters a density-based algorithm for discov-
                                                                                               ering clusters in large spatial databases with noise. In Proceedings of the Second
   We consider various future directions for this work, including                              International Conference on Knowledge Discovery and Data Mining, KDD’96,
                                                                                               pages 226–231. AAAI Press, 1996.
the limitations of our current approach which we listed in Sec-                           [18] Derya Birant and Alp Kut. St-dbscan: An algorithm for clustering spatial-
tion 7. First, we are interested to incorporate an “explainability”                            temporal data. Data Knowl. Eng., 60(1):208–221, January 2007.
                                                                                          [19] C. Bradford Barber, David P. Dobkin, and Hannu Huhdanpaa. The quickhull
component which can describe causalities behind preferences.                                   algorithm for convex hulls. ACM Trans. Math. Softw., 22(4):469–483, December
For instance, we are interested to find seasonal patterns to see                               1996.
why the preferences of users change from place to place during                            [20] Raphael A. Finkel and Jon Louis Bentley. Quad trees a data structure for
                                                                                               retrieval on composite keys. Acta informatica, 4(1):1–9, 1974.
various seasons of the year. Another direction is to incorporate                          [21] Shagun Jhaver, Yoni Karpfen, and Judd Antin. Algorithmic anxiety and coping
“Query by Visualization” approaches, where users can specify                                   strategies of airbnb hosts. In Proceedings of the 2018 CHI Conference on Human
their intents alongside their implicit preferences, directly on the                            Factors in Computing Systems, page 421. ACM, 2018.
                                                                                          [22] R Timothy Marler and Jasbir S Arora. Survey of multi-objective optimiza-
map [40].                                                                                      tion methods for engineering. Structural and multidisciplinary optimization,
                                                                                               26(6):369–395, 2004.
REFERENCES                                                                                [23] J. Liang and M. L. Huang. Highlighting in information visualization: A survey.
                                                                                               In 2010 14th International Conference Information Visualisation, July 2010.
 [1] John F. Roddick, Max J. Egenhofer, Erik G. Hoel, Dimitris Papadias, and Betty        [24] Anthony C. Robinson. Highlighting in geovisualization. Cartography and
     Salzberg. Spatial, temporal and spatio-temporal databases - hot issues and                Geographic Information Science, 38(4):373–383, 2011.
     directions for phd research. SIGMOD Record, 33(2):126–131, 2004.                     [25] Kanit Wongsuphasawat, Dominik Moritz, Anushka Anand, Jock Mackinlay,
 [2] Aditya Telang, Deepak Padmanabhan, and Prasad Deshpande. Spatio-temporal                  Bill Howe, and Jeffrey Heer. Voyager: Exploratory analysis via faceted brows-
     indexing: Current scenario, challenges and approaches. In Proceedings of the              ing of visualization recommendations. TVCG, 22(1), 2016.
     18th International Conference on Management of Data, COMAD ’12, pages                [26] Wesley Willett, Jeffrey Heer, and Maneesh Agrawala. Scented widgets: Im-
     9–11, Mumbai, India, India, 2012. Computer Society of India.                              proving navigation cues with embedded visualizations. IEEE Transactions on
 [3] Lauro Lins, James T Klosowski, and Carlos Scheidegger. Nanocubes for real-                Visualization and Computer Graphics, 13(6):1129–1136, 2007.
     time exploration of spatiotemporal datasets. IEEE Transactions on Visualization      [27] Jie Bao, Yu Zheng, David Wilkie, and Mohamed Mokbel. Recommendations in
     and Computer Graphics, 19(12):2456–2465, 2013.                                            location-based social networks: a survey. GeoInformatica, 19(3):525–565, 2015.
 [4] Jia Yu, Zongsi Zhang, and Mohamed Sarwat. Spatial data management in                 [28] Justin J. Levandoski, Mohamed Sarwat, Ahmed Eldawy, and Mohamed F.
     apache spark: the geospark perspective and beyond. GeoInformatica, pages                  Mokbel. Lars: A location-aware recommender system. In ICDE, pages 450–
     1–42, 2018.                                                                               461, 2012.
 [5] W. Bruce Croft. The importance of interaction for information retrieval. In          [29] Marina Drosou and Evaggelia Pitoura. Disc diversity: result diversification
     Proceedings of the 42nd International ACM SIGIR Conference on Research and                based on dissimilarity and coverage. PVLDB, 6(1):13–24, 2012.
     Development in Information Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019,   [30] Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. Aide: an active
     pages 1–2, 2019.                                                                          learning-based approach for interactive data exploration. IEEE Transactions
 [6] Andrea Ballatore and Michela Bertolotto. Semantically enriching vgi in sup-               on Knowledge and Data Engineering, 28(11):2842–2856, 2016.
     port of implicit feedback analysis. In Katsumi Tanaka, Peter Fröhlich, and           [31] Mario Boley, Michael Mampaey, Bo Kang, Pavel Tokmakov, and Stefan Wro-
     Kyoung-Sook Kim, editors, Web and Wireless Geographical Information Systems,              bel. One click mining: Interactive local pattern discovery through implicit
     pages 78–93, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg.                        preference and performance learning. In Proceedings of the ACM SIGKDD
 [7] Nathan N. Liu, Evan W. Xiang, Min Zhao, and Qiang Yang. Unifying explicit                 Workshop on Interactive Data Exploration and Analytics, pages 27–35. ACM,
     and implicit feedback for collaborative filtering. In Proceedings of the 19th             2013.
     ACM International Conference on Information and Knowledge Management,                [32] Eoin Mac Aoidh, Michela Bertolotto, and David C. Wilson. Analysis of implicit
     CIKM ’10, pages 1445–1448, New York, NY, USA, 2010. ACM.                                  interest indicators for spatial data. In 15th ACM International Symposium on
 [8] Dong Xin, Xuehua Shen, Qiaozhu Mei, and Jiawei Han. Discovering interesting               Geographic Information Systems, ACM-GIS 2007, November 7-9, 2007, Seattle,
     patterns through user’s interactive feedback. In Proceedings of the 12th ACM              Washington, USA, Proceedings, page 47, 2007.
     SIGKDD international conference on Knowledge discovery and data mining,              [33] Michael Bevis and Jean-Luc Chatelain. Locating a point on a spherical surface
     pages 773–778. ACM, 2006.                                                                 relative to a spherical polygon of arbitrary shape. Mathematical Geology,
 [9] Mansurul Bhuiyan, Snehasis Mukhopadhyay, and Mohammad Al Hasan. In-                       21(8):811–828, Oct 1989.
     teractive pattern mining on hidden data: a sampling-based solution. In Pro-          [34] Matt Duckham, Lars Kulik, Mike Worboys, and Antony Galton. Efficient
     ceedings of the 21st ACM international conference on Information and knowledge            generation of simple polygons for characterizing the shape of a set of points
     management, pages 95–104. ACM, 2012.                                                      in the plane. Pattern Recognition, 41(10):3224 – 3236, 2008.
[10] Behrooz Omidvar-Tehrani, Sihem Amer-Yahia, Eric Simon, Fabian Colque                 [35] M.J. Fadili, M. Melkemi, and A. ElMoataz. Non-convex onion-peeling using a
     Zegarra, João LD Comba, and Viviane Moreira. Userdev: A mixed-initiative                  shape hull algorithm. Pattern Recognition Letters, 25(14):1577 – 1585, 2004.
     system for user group analytics. In Proceedings of the Workshop on Human-In-         [36] Avi Arampatzis, Marc van Kreveld, Iris Reinbacher, Christopher B. Jones,
     the-Loop Data Analytics, pages 1–8, 2019.                                                 Subodh Vaid, Paul Clough, Hideo Joho, and Mark Sanderson. Web-based
[11] Niranjan Kamat, Prasanth Jayachandran, Karthik Tunga, and Arnab Nandi.                    delineation of imprecise regions. Computers, Environment and Urban Systems,
     Distributed and interactive cube exploration. In Data Engineering (ICDE), 2014            30(4):436 – 459, 2006. Geographic Information Retrieval (GIR).
     IEEE 30th International Conference on, pages 472–483. IEEE, 2014.                    [37] Antony Galton and Matt Duckham. What is the region occupied by a set of
[12] Behrooz Omidvar-Tehrani, Sihem Amer-Yahia, and Alexandre Termier. Inter-                  points? In Martin Raubal, Harvey J. Miller, Andrew U. Frank, and Michael F.
     active user group analysis. In CIKM, pages 403–412. ACM, 2015.                            Goodchild, editors, Geographic Information Science, pages 81–98, Berlin, Hei-
[13] Behrooz Omidvar-Tehrani, Plácido A Souza Neto, Felipe M Freire Pontes, and                delberg, 2006. Springer Berlin Heidelberg.
     Francisco Bento. Geoguide: An interactive guidance approach for spatial data.        [38] Vincent Leroy, Berkant Barla Cambazoglu, and Francesco Bonchi. Cold start
     In Internet of Things (iThings) and IEEE Green Computing and Communications               link prediction. In Proceedings of the 16th ACM SIGKDD International Con-
     (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and                     ference on Knowledge Discovery and Data Mining, Washington, DC, USA, July
     IEEE Smart Data (SmartData), 2017 IEEE International Conference on, pages                 25-28, 2010, pages 393–402, 2010.
     1112–1117. IEEE, 2017.                                                               [39] Behrooz Omidvar-Tehrani, Sruthi Viswanathan, Frederic Roulland,
[14] Mon Chu Chen, John R. Anderson, and Myeong Ho Sohn. What can a mouse                      and Jean-Michel Renders.            SAGE: Interactive state-aware point-of-
     cursor tell us more?: Correlation of eye/mouse movements on web browsing.                 interestrecommendation. In WSDM Workshop on State-based User Modelling,
     In CHI ’01 Extended Abstracts on Human Factors in Computing Systems, CHI                  2020.
     EA ’01, pages 281–282, New York, NY, USA, 2001. ACM.                                 [40] Tarique Siddiqui, Albert Kim, John Lee, Karrie Karahalios, and Aditya
[15] Ioannis Arapakis, Mounia Lalmas, and George Valkanas. Understanding                       Parameswaran. Effortless data exploration with zenvisage: an expressive
     within-content engagement through pattern analysis of mouse gestures. In                  and interactive visual analytics system. Proceedings of the VLDB Endowment,
     Proceedings of the 23rd ACM International Conference on Conference on Infor-              10(4):457–468, 2016.
     mation and Knowledge Management, CIKM ’14, pages 1439–1448, New York,
     NY, USA, 2014. ACM.
[16] Jacek Malczewski and Claus Rinner. Multicriteria decision analysis in geo-
     graphic information science. Springer, 2015.