Exploration of Interesting Dense Regions on Spatial Data Behrooz Omidvar-Tehrani Plácido A. Souza Neto NAVER LABS Europe (France) Federal Institute of Rio Grande do Norte (Brazil) behrooz.omidvar-tehrani@naverlabs.com placido.neto@ifrn.edu.br Francisco B. Silva Júnior Felipe F. Pontes Federal University of Rio Grande do Sul (Brazil) Federal Institute of Rio Grande do Norte (Brazil) fbsjunior@inf.ufrgs.br freire.pontes@academico.ifrn.edu.br ABSTRACT However, there has been less attention to the “value” derived Nowadays, spatial data are ubiquitous in various fields of science, from spatial data. Despite the decent progress on the efficiency such as transportation and smart city management. A recent front, a user may easily get lost in the plethora of geographical research direction in analyzing spatial data is to provide means points of interest (POIs), mainly due to two following reasons: for “exploratory analysis” of such data where users are guided • In an exploratory context, the user doesn’t know a priori what towards interesting options in consecutive analysis iterations. to investigate next. Typically, the guidance component learns user’s preferences us- • Moreover, he/she may easily get distracted and miss interesting ing his/her explicit feedback, e.g., picking a spatial point of in- POIs by visual clutter caused by huge overlaps among POIs. terest (POI) or selecting a region of interest. However, it is often the case that users forget or don’t feel necessary to explicitly ex- The main drawback of the traditional analysis models is that press their feedback in what they find interesting. Our approach the user has a passive role in the process. In other words, the captures implicit feedback on spatial data. The approach consists user’s feedback (i.e., his/her likes and dislikes) is ignored and of observing mouse moves (as a means of user’s interaction) to only the input query (i.e., his/her explicit request) is served. In discover interesting spatial regions with dense mouse hovers. In case feedback is incorporated, the process can be more directed this paper, we define, formalize, and explore Interesting Dense towards user’s interests where his/her partial needs can be served Regions (IDRs) which capture preferences of users, in order to earlier in the process [5]. In this paper, we advocate for a “guid- automatically find interesting spatial highlights. Our approach ance layer” on top of the raw visualization of spatial data to involves a polygon-based abstraction layer for capturing pref- enable users know “what to see next”. This guidance should be erences. Using these IDRs, we highlight POIs to guide users in a function of user feedback: the system should return options the analysis process. We discuss the efficiency and effectiveness similar to what the user has already appreciated. of our approach through realistic examples and experiments on Various approaches in the literature propose methodologies to Airbnb and Yelp datasets. incorporate user’s feedback in the exploration process of spatial data [6–10]. Typically, feedback is considered as a function which 1 INTRODUCTION is triggered by any user’s action on the map. The action can be Nowadays, there has been a meteoric rise in the generation of “selecting a POI”, “moving to a region”, “asking for more details”, spatial datasets in various fields of science, such as transporta- etc. The function then updates a “profile vector” which records tion, lodging services, and smart city management [1, 2]. As each user’s interests in a transparent way. The updated content in the record in spatial data represents an activity in a precise geograph- profile vector enables the guidance functionality. For instance, if ical location, analyzing such data enables discoveries grounded the user shows interest in a POI which describes a house with on facts. Typically, spatial data analysis begins with an impre- balcony, this choice of amenity will reflect his/her profile to cise question in the mind of the user, i.e., exploratory analysis. prioritize other houses with balcony in future iterations. The user requires to go through several trial-and-error iterations Feedback is often expressed explicitly, i.e., the user clicks on to improve his/her understanding of the spatial data and gain a POI and mentions if he/she likes or dislikes the POI [11–13]. insights. Each iteration involves visualizing a subset of data on However, there are several cases that the feedback is expressed geographical maps using an off-the-shelf product (e.g., Tableau1 , implicitly, i.e., the user does not explicitly click on a POI, but Exhibit2 , Spotfire3 ) where the user can investigate on different there exist correlations with other signals captured from the user parts of the visualization by zooming in/out and panning. which provide hint on his/her interest. For instance, it is often Spatial data are often voluminous. Hence the focus in the liter- the case in spatial data analysis that users look at some regions ature of spatial data analysis is on “efficiency”, i.e., enabling fluid of interest but do not provide an explicit feedback. Another ex- means of navigation in spatial data to facilitate the exploratory ample is frequent mouse moves around a region which is a good analysis. The common approach is to design pre-computed in- indicator of the user’s potential interest in the POIs in that region. dexes which enable efficient retrieval of spatial data (e.g., [3, 4]). Users often hover their mouse in a region of interest to collect information on the background map (e.g., touristic places, parks, 1 http://www.tableau.com and nearby grocery stores, presented in the form of map layers 2 http://www.simile-widgets.org/exhibit/ 3 http://spotfire.tibco.com and tooltips) before landing on a decision about picking a POI (such as a hotel) in that region. Implicit feedbacks are more chal- © 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed- lenging to capture and hence less investigated in the literature. ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen, The following example describes a use case of implicit feedbacks. Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- This will be our running example which we follow throughout tribution 4.0 International (CC BY 4.0) the paper. Example. Benício is planning to live in Paris for a sabbatical leave. 2 DATA MODEL He decides to rent a home-stay from Airbnb website4 . He likes to We consider two different layers on a geographical map: “spatial discover the city, hence he is open to any type of lodging in any layer” and “interaction layer”. The spatial layer contains POIs region with an interest to stay in the center of Paris. The website from a spatial database P. The interaction layer contains mouse returns 1500 different locations. As he has no other preferences, move points M. an exhaustive investigation needs scanning each location indepen- Spatial layer. Each POI p ∈ P is described using its coordinates, dently, which is nearly infeasible. While he is scanning the very first latitude and longitude, i.e., p = ⟨lat, lon⟩. Note that in this work, options, he shows interest in the region of Trocadero by hovering his we don’t consider “time” for POIs, as our contribution focuses on mouse around the Eiffel tower and checking the amenities within their location. POIs are also associated to a set of domain-specific that region. However, he forgets or doesn’t feel necessary to click attributes A. For instance, for a dataset of a real estate agency, a POI (i.e., a home-stay) in that region. An ideal system should POIs are properties (houses and apartments) and A contains at- capture this implicit feedback in order to short-list a small subset tributes such as “surface”, “number of rooms” and “price”. The set of locations that Benício should consider as high priority. of all possible values for an attribute a ∈ A is denoted as dom(a). The above example shows in practice that implicit feedback We also define user’s feedback F as a vector over all attribute val- −−−−−−−−−−−→ capturing is crucial in the context of spatial data analysis, as a POI ues (i.e., facets), i.e., F = ∪a ∈A dom(a). The vector F is initialized can easily remain out of sight and be missed. It is particularly the by zeros and will be updated to express user’s preferences. case in small-screen devices such as smart watches, smartphones, Interaction layer. Whenever the user moves his/her mouse, and tablets. a new point m is appended to the set M. Each mouse move In this paper, we present a simple yet effective approach whose point is described using the pixel position that it touches and the aim is to capture and analyze implicit feedback of users in spatial clock time of the move. Hence each mouse move point is a tuple data analysis. Without loss of generality, we focus on “mouse m = ⟨x, y, t⟩, where x and y specifies the pixel location and t is a moves” as the implicit feedback received from the user. Mouse Unix Epoch time. To conform with geographical standards, we moves are the most common way that users interact with geo- assume m = ⟨0, 0⟩ sits at the middle of the interaction layer, both graphical maps to collect information within a region of inter- horizontally and vertically. est [14], such as information provided in the background map The user is in contact with the interaction layer. To update (e.g., parks, theaters, shopping centers) and the tooltip informa- the feedback vector F , we need to translate pixel locations in the tion which pop up by hovering (e.g., information about the POIs interaction layer to latitudes and longitudes in the spatial layer. such as price and reviews). It is shown in [15] that mouse gestures We employ equirectangular projection to obtain the best possible have a strong correlation with “user engagement”. Intuitively, approximation of a point m = ⟨x, y, t⟩ ∈ M in the spatial layer, a POI gets a higher weight in the user’s profile if the mouse denoted as p(m). cursor moves around it frequently. However, our approach is independent from the type of feedback enabler and can be easily x p(m = ⟨x, y, t⟩) = ⟨lat = y + γ , lon = + θ⟩ (1) extended to other types such as gaze tracking, leap motions, as cosγ well as touch gestures in touch screens. The inverse operation, i.e., transforming a point p = ⟨lat, lon⟩ Contributions. In this paper, we make the following contribu- from the spatial layer to the interaction is done using Equation 2. tions: • We define and explore the notion of “implicit user feedback” m(p = ⟨lat, lon⟩) = ⟨x = (lon − θ ) × cosγ , y = lat − γ ⟩ (2) which enables a seamless navigation in spatial data. The reference point for the transformation is the center of • We define the notion of “information highlighting”, a mecha- both layers. In Equations 1 and 2, we assume that γ is the latitude nism to highlight out-of-sight important information for users. and θ is the longitude of a point in the spatial layer corresponding A clear distinction of our proposal with the literature is that it to the center of the interaction layer, i.e., m = ⟨0, 0⟩. doesn’t aim for pruning (such as top-k recommendation), but leveraging the actual data with potential interesting results 3 PROBLEM DEFINITION (i.e., highlights). When analyzing spatial data, users require to obtain only a few • We define and formalize the concept of Interesting Dense Re- options (so-called “highlights”) to focus on. These options should gions (IDRs), a polygon-based approach to explore and high- be in-line with what they have already appreciated. In this paper, light spatial data. we formulate the problem of “information highlighting using • We propose an efficient greedy approach to compute highlights implicit feedback”, i.e., highlight a few POIs based on implicit on-the-fly. interests of the user in order to guide him/her towards what • We show the effectiveness of our approach through a set of he/she should concentrate on in consecutive iterations of the qualitative experiments. analysis process. We formally define our problem as follows. The outline of the paper is the following. Section 2 describes Problem. Given a time tc and an integer constant k, update the our data model. In Section 3, we formally define our problem. feedback vector F using points m ∈ M where m.t ≤ tc , and choose Then in Section 4, we present our solution and its algorithmic k POIs Pk ⊆ P as “highlights” where Pk satisfies two following details. Section 5 reports our experiments on the framework. We constraints. review the related work in Section 6. We present some limitations (i) ∀p ∈ Pk , similarity(p, F ) is maximized. of our work in Section 7. Last, we conclude in Section 8. (ii) diversity(Pk ) is maximized. The first constraint guarantees that returned highlights are 4 http://www.airbnb.com highly similar with user’s interests captured in F . The second Algorithm 1: Spatial Highlighting Algorithm Algorithm 2: Find Interesting Dense Regions (IDRs) Input: Current time tc , mouse move points M Input: Current time tc , mouse move points M Output: Highlights Pk Output: IDRs S 1 S ← find_interesting_dense_regions(tc , M) // Section 4.1 1 S ← ∅ 2 Ps ← match_points(S, P) // Section 4.2 2 д ←number of time segments 3 F ← update_feedback_vector(F, Ps ) // Section 4.3 3 for i ∈ [1, д] do 4 Pk ← get_highlights(P, F ) // Section 4.4 4 Mi ← {m = ⟨x, y, t⟩|( tдc × i) ≤ t ≤ ( tдc × (i + 1))} 5 return Pk 5 Ci ← mine_clusters(Mi ) 6 Oi ← find_ploygons(Ci ) 7 end constraint ensures that k POIs cover different regions and they 8 for Oi , Oj where i, j ∈ [0, д] and i ̸= j do don’t repeat themselves. While our approach is independent from S.append(intersect(Oi , Oj )) the way that similarity and diversity functions are formulated, 9 return S we provide a formal definition of these functions in Section 4. The aforementioned problem is hard to solve due to the fol- lowing challenges. Challenge 1. First, it is not clear how mouse move points influ- Observation 2. It is possible that the user moves his/her mouse ence the feedback vector. Mouse moves occur on a separate layer everywhere in the map. This should not signify that all those and there should be some meaningful transformations to inter- locations have the same significance. pret mouse moves as potential changes in the feedback vector. Following our observations, we propose Algorithm 2 for min- Challenge 2. Analyzing all generated mouse moves is challeng- ing IDRs. We add points to M only every 200ms to prevent adding ing and may introduce false positives. A typical mouse with 1600 redundant points (i.e., Challenge 2). Following Observation 1 and DPI (Dots Per Inch) touches 630 pixels for one centimeter of in order to mine the recurring behavior of the user, the algorithm move. Hence a mouse move from the bottom to the top of a typi- begins by partitioning the set M into д fixed-length consecutive cal 13-inch screen would yield 14,427 points which may not be segments M1 to Mд . The first segment starts at time zero (where necessarily meaningful. the system started), and the last segment ends at tc , i.e., the cur- Challenge 3. Beyond the two first challenges, finding the most rent time. Following Observation 2, we then find dense clusters similar and diverse POIs with F needs an exhaustive scan of in each segment of M using a variant of DB-SCAN approach [17]. all POIs in P which is prohibitively expensive: in most spatial Finally, we return intersections among those clusters as IDRs. datasets, there exist millions of POIs. Moreover, we need to follow For clustering points in each time segment (i.e., line 5 of Al- bi-objective optimization considerations as we aim to optimize gorithm 2), we use ST-DBSCAN [18], a space-aware variant of both similarity and diversity at the same time [16]. DB-SCAN for clustering points based on density. For each subset We recognize the complexity of our problem using the afore- of mouse move points Mi , i ∈ [1, д], ST-DBSCAN begins with a mentioned challenges. In Section 4, we discuss a solution for the random point m 0 ∈ Mi and collects all density-reachable points discussed problem and its associated challenges. from m 0 using a distance metric. As mouse move points are in the 2-dimensional pixel space (i.e., the display), we choose euclidean 4 INTERESTING DENSE REGIONS distance as the distance metric. If m 0 turns out to be a core object, Our approach exploits user’s implicit feedback (i.e., mouse moves) a cluster will be generated. Otherwise, if m 0 is a border object, to highlight a few interesting POIs as future analysis directions. no point is density-reachable from m 0 and the algorithm picks Algorithm 1 summarizes the principled steps of our approach. another random point in Mi . The process is repeated until all of The algorithm begins by mining the set of mouse move points M the points have been processed. in the interaction layer to discover one or several Interesting Once clusters are obtained for all subsets of M, we find their Dense Regions, abbr., IDRs, in which most user’s interactions intersections to locate recurring regions (line 6). To obtain inter- occur (line 1). Then it matches the POIs P with IDRs using Equa- sections, we need to clearly define the spatial boundaries of each tion 2 in order to find POIs inside each region (line 2). The at- cluster. Hence for each cluster, we discover its corresponding tributes of resulting POIs will be exploited to update the user’s polygon that covers the points inside. For this aim, we employ feedback vector F (line 3). The updated vector F will then be Quickhull algorithm, a quicksort-style method which computes used to find k highlights (line 4). These steps ensure that the final the convex hull for a given set of points in a 2D plane [19]. highlights reflect user’s implicit interests. We detail each step as We describe the process of finding IDRs in an example. Figure 1 follows (Sections 4.1 to 4.4). shows the steps that Benício follows in our running example to explore home-stays in Paris. Figure 1.A shows mouse moves of 4.1 Discovering IDRs Benício in different time stages. In this example, we consider д = 3 The objective of this step is to obtain one or several regions in and capture Benício’s feedback in three different time segments which the user has expressed his/her implicit feedback. There (progressing from Figures 1.B to 1.D). It shows that Benício started are two observations for such regions. his search around Eiffel Tower and Arc de Triomphe (Figure 1.B) Observation 1. We conjecture that a region is more interesting and gradually showed interest in south (Figure 1.C) and north for the user if it is denser, i.e., the user moves his/her mouse in that (Figure 1.D) as well. All intersections between those clusters are region several times, to collect information from the background discovered (hatching regions in Figure 1.E) which will constitute map. the set of IDRs (Figure 1.F), i.e., IDR1 to IDR4. 3 2 1 A B C IDR2 IDR3 IDR1 IDR4 D E F Figure 1: The process of finding IDRs on Airbnb dataset. 4.2 Matching Points Being a function of mouse move points, IDRs are discovered in the interaction layer. We then need to find out which POIs in P fall into IDRs, hence forming the subset Ps . We employ Equation 2 to transform those POIs from the spatial layer to the interaction layer. Then a simple “spatial containment” function can verify which POIs fit into the IDRs. Given a POI p and an IDR r , a function contains(p, r ) returns “true” if p is inside r , oth- erwise “false”. In our case, we simply use the implementation of ST _Within(p, r ) module in PostGIS5 , i.e., our underlying spatial IDR1 IDR2 DBMS which hosts the data. In the vanilla version of our spatial containment function, all POIs should be checked against all IDRs. Obviously, this depletes the execution time. To prevent the exhaustive scan, we employ Quadtrees [20] in a two-step approach. (i) In an offline process, we build a Quadtree index for all POIs in P. We record the membership relations of POIs and cells in the index. (ii) When IDRs are discovered, we record which cells in the Quadtree index intersect with IDRs. As we often end up with IDR3 IDR4 few IDRs, the intersection verification performs fast. Then for matching POIs, we only check a subset which is inside the cells Figure 2: Matching POIs for IDR1 to IDR4. associated to IDRs and ignore the POIs outside. This leads to a drastic pruning of POIs in P. We follow our running example and illustrate the matching process in Figure 2. In the Airbnb dataset, POIs are home-stays in a feedback vector F . The vector is initialized by zero, i.e., the which are shown with their nightly price on the map. We observe user has no preference at the beginning. We update F using the that there exist many matching POIs with IDR3 and absolutely attributes of the POIs in Ps . We enable transparency by choosing no matching POI for IDR2. For IDR4, although there exist many F ’s schema to be defined on the POI attributes. In exploratory home-stays below the region, we never check their containment, analysis scenarios, it is often the case that users do not trust in as they belong to a Quadtree cell which doesn’t intersect with what they get from the system (i.e., algorithmic anxiety [21]) the IDR. and want to know what has been learned from them. Having a transparent user profile enables an easy examination of its 4.3 Updating user Feedback Vector content by the user. The set of matching POIs Ps (line 2 of Algorithm 1) depicts the We consider an increment value δ to update F . If p ∈ Ps gets v 1 implicit preference of the user. We keep track of this preference for attribute a 1 , we augment the value in the F ’s cell of ⟨a 1, v 1 ⟩ by δ . Note that we only consider incremental feedback, i.e., we 5 https://postgis.net/docs/manual-dev/ST_Within.html never decrease a value in F . Table 1: Attributes of POIs in IDR1. diversity. First, the highlights should be in the same direction of the user’s implicit feedback, hence similar to the vector F . The ID Price #Beds Balcony Air-cond. Rating similarity between a POI p ∈ P and the vector F is defined as 1 130e 1 Yes Yes 5/5 follows. 2 58e 1 Yes No 5/5 3 92e 2 Yes No 5/5 similarity(p, F ) = avg a ∈A (sim(p, F, a)) (3) 4 67e 1 Yes No 4/5 The sim() function can be any function such as Jaccard or Cosine. Each attribute can have its own similarity function (as Table 2: Updating user Feedback Vector string and integer attributes are compared differently.) Then sim() works as an overriding-function which provides encapsulated Attribute-value Applying IDR 1 Normalized similarity computations for any type of attribute. ⟨#Beds,1⟩ +3δ 0.19 Second, highlighted POIs should also represent distinct di- ⟨#Beds,2⟩ +δ 0.06 rections so that the user can observe different aspects of data ⟨#Beds,+2⟩ (no update) 0.00 and decide based on the big picture. Given a set of POIs Pk = ⟨Balcony,Yes⟩ +4δ 0.25 {p1, p2 . . . pk } ⊆ P, we consider a pair-wise definition of diver- ⟨Balcony,No⟩ (no update) 0.00 sity as follows. ⟨Air-cond.,Yes⟩ +δ 0.06 ⟨Air-cond.,No⟩ +3δ 0.19 diversity(Pk ) = avg {p,p ′ } ⊂ Pk |p̸=p ′ distance(p, p ′ ) (4) ⟨Rating,1⟩ (no update) 0.00 ⟨Rating,2⟩ (no update) 0.00 The function distance(p, p ′ ) operates on geographical coordi- ⟨Rating,3⟩ (no update) 0.00 nates of p and p ′ and can be considered as any distance function ⟨Rating,4⟩ +δ 0.06 of Minkowski distance family. However, as distance computa- ⟨Rating,5⟩ +3δ 0.19 tions are done in the spherical space, a natural choice is to employ Haversine distance. To solve our bi-objective problem (i.e., maximizing similar- We explain the process of updating the feedback vector using a ity and diversity), we employ the ϵ-constraint method [22], i.e., toy example. Given the four matched POIs in IDR1 (Figure 2) with we maximize diversity while satisfying a linear constraint for prices 130e, 58e, 92e and 67e, we want to update the vector F similarity which only accepts values larger than or equal to ϵ. given those POIs. A few attributes of these POIs are mentioned Algorithm 3 describes our approach for highlighting k similar in Table 1. In practice, there are often more than 50 attributes for and diverse POIs. We propose a best-effort greedy approach to POIs. The cells of F are illustrated in the first column of Table 2. efficiently compute highlighted POIs. We consider an offline step As three POIs get the value “1” for the attribute “#Beds”, then the followed by the online execution of our algorithm. value in cell ⟨#Beds,1⟩ is augmented three times by δ . The same In order to speed up the similarity computation in the online process is repeated for all attribute-values of POIs in Ps . Note execution, we pre-compute an inverted index for each single POI that all cells of F are not necessarily touched in the feedback p ∈ P in the offline step (as is commonly done in the Web search). update process. For instance, in the above example, 5 cells out of Each index Lp for the POI p keeps all other POIs in P in decreas- 12 remain unchanged. ing order of their similarity with p. While index computation By specifying an increment value, we can materialize the up- may be extremely time-consuming, we stop generating those dates and normalize the vector using a Softmax function. Given lists if the similarity value is lower than the ϵ threshold. δ = 1.0, the normalized values of the F vector is illustrated in the The first step of Algorithm 3 is to find the most similar POI to F , third column of Table 2. Higher values of δ increase the influence so-called p ∗ . The POI p ∗ is the closest possible approximation of feedbacks. of F in order to exploit pre-computed similarities. The algorithm The normalized content of the vector F captures the implicit makes sequential accesses to Lp ∗ (i.e., the inverted index of the preferences of the user. For instance, the content of F after ap- POI p ∗ ) to greedily maximize diversity. Algorithm 3 does not plying POIs in IDR1 shows that the user has a high interest in sacrifice efficiency in price of value. We consider a time limit having a balcony in his/her home-stay, as his/her score for the parameter which determines when the algorithm should stop cell ⟨Balcony,Yes⟩ is 0.25, i.e., the highest among other cells. This seeking maximized diversity. Scanning inverted indexes guaran- reflects the fact that all POIs in IDR1 has balcony. It is impor- tees the similarity maximization even if time limit is chosen to be tant to note that although we only consider positive feedback, very restrictive. Our observations with several spatial datasets the Softmax function lowers the values of untouched cells once show that we achieve the diversity of more than 0.9 with time other cells get rewarded. The updated feedback vector is fully limit set to 200ms. transparent and the user can easily comprehend what has been In line 2 of Algorithm 3, Pk is initialized with the k highest learned from his/her previous actions. ranking POIs in Lp ∗ . Function get_next(Lp ∗ ) (line 3) returns the next POI pnext in Lp ∗ in sequential order. Lines 4 to 12 iterate 4.4 Generating Highlights over the inverted indexes to determine if other POIs should be The ultimate goal is to highlight k POIs to guide users in analyzing considered to increase diversity while staying within the time their spatial data. The updated feedback vector F is the input to limit. the highlighting phase. We assume that POIs in IDRs are already The algorithm looks for a candidate POI pcurrent ∈ Pk to investigated by the user. Hence our search space for highlighting replace in order to increase diversity. The boolean function is P \ Ps . We seek two properties in k highlights: similarity and diversity_improved() (line 6) checks if by replacing pcurrent by Algorithm 3: Get k similar and diverse highlights Table 3: Interactions of “novice” and “expert” participants get_highlights() T1/I1 T2/I1 T1/I2 T2/I2 Input: POIs P, Feedback vector F , k, time_limit Output: Pk Novices using Airbnb 372s 421s 365s 430s 1 p ← max_sim_to(P, F ) ∗ Experts using Airbnb 253s 302s 249s 298s 2 Pk ← top_k(Lp∗ , k) Novices using highlights 19s 24s 20s 25s Experts using highlights 14s 17s 13s 17s 3 pnex t ← дet_next(Lp∗ ) 4 while time_limit not exceeded ∨ pnex t = ∅ do 5 for pcur r ent ∈ Pk do their goals in a reasonable time. Also expert participants need on 6 if diversity_improved(Pk , pnex t , pcur r ent ) then average 6.25 seconds less time than the novice participants when 7 Pk ← replace(Pk , pnex t , pcur r ent ) using the highlights. Interestingly, starting POIs, i.e., I1 and I2, do 8 break not have a huge impact on the number of steps. It is potentially 9 end due to the diversity component which provides distinct options 10 end and can quickly steer users towards their region of interest. We 11 pnex t ← дet_next(Lp∗ ) also observe that the task T1 is an easier task than T2, as on aver- 12 end age it took less time to be accomplished. This is potentially due 13 return Pk to where the user can request options similar to what he/she has already observed and greedily move to his/her preferred regions. pnext in Pk , the overall diversity of the new Pk increases. It is 5.2 Effectiveness important to highlight that for each run of the algorithm, we In the second part of our experiments, we analyze the effective- only focus on one specific inverted list associated to the input ness of our approach in generating IDRs for information high- POI. Algorithm 3 verifies the similarity and diversity of each POI lighting. We employ two different datasets, i.e., Airbnb and Yelp6 . with all other POIs, and then processes the normalization. We pick a similar subset from both datasets, i.e., home-stays and restaurants in Paris, respectively. We consider four different sizes 5 EXPERIMENTS of those datasets, i.e., 100, 1000, 2000 and 4000 POIs, respectively. We discuss two sets of experiments. The first set is on the useful- For each size of the datasets, we manually perform 20 sessions, ness of our approach. Then we focus more on IDR discovery and and then present some statistics as the average of the sessions. present some statistics and insights on the functionality of our We limit each session to 2 minutes where we seek for interesting approach. The experiments are done on a 2.8 GHz Intel Core i5 POIs in the datasets. We capture the following information in machine running a Mac OS operating system. each session: • The number of regions created from the mouse moves during 5.1 Usefulness the session; First off, we validate the usefulness of our approach. For this aim, • The number of generated IDRs (intersection of regions); we design a user study with 30 participants who are all students • The number of POIs from the dataset presented in each IDR; of Computer Science. Given an exploratory task to fulfill, we • The coverage of POIs (in the dataset) with IDRs collectively. hypothesize that there are three prominent factors which influ- Tables 4 and 5 show the results for Airbnb and Yelp, respec- ence the usefulness of our system: user expertise, type of the task, tively. In Table 4, we observe that the number of regions decreases and starting point. To examine the user expertise, we recruited when the number of POIs increases. On average, 7.7 regions are 15 “novice” participants who don’t know the location under in- constructed per session. The average number of POIs presented vestigation, and also 15 “experts.” We also define two different in IDRs is 25.97, which shows that our approach highlights at types of tasks on the Airbnb dataset of Paris with 1,000 POIs: T1: least 8.05% of POIs from the dataset, on average. We notice an “finding a POI in a requested location” (e.g., find a home-stay in the outlier in the experiment with 2000 POIs in Table 4. This hap- “Champ de Mars” area), and T2: “finding a POI with a requested pened due the fact that the user concentrated in a very small profile” (e.g., find a cheap home-stay.) Last, the participants may area generating a smaller number of IDRs, and consequently a have different starting points, i.e., they may begin their naviga- smaller number of POIs. tion from a POI which is either I1: “geographically close to the More uniform results are observed in Table 5, i.e., for Yelp goal” or I2: “far from the goal”. For each participant, we report dataset vis-à-vis Airbnb. The average number of generated re- a variant of time-to-insight measure, i.e., how long (in seconds) gions reaches 12.75 per session. Also, the number of regions the participants interact with the tool before fulfilling the task. decreases by increasing the number of POIs. The number of POIs Evidently, less number of interactions are preferred as it means presented in IDRs is on average 108.65 and it represents on aver- that the participant can reach insights faster. age 13.11% of POIs highlighted from the dataset. The participants perform the tasks T 1 and T 2 on two differ- ent tools: Airbnb website (without information highlighting and 6 RELATED WORK without incorporation of implicit feedbacks, as our baseline) and To the best of our knowledge, the problem of spatial information our highlighting tool. Table 3 shows the results. Our initial obser- highlighting using implicit feedback has not been addressed be- vation is that the highlights enable users to fulfill tasks one order fore in the literature. However, our work relates to a few others of magnitude faster than the baseline without the highlights. in their semantics. This shows that the highlighting approach with implicit feedback 6 https:// www.yelp.com/ dataset capturing is an effective mechanism which helps users to reach Table 4: IDR statistics on Airbnb dataset a set of input POIs on a plane. In [36, 37], imprecise regions are delineated into a convex or concave polygon. In our approach, it # POIs # regions # IDRs # POIs in IDRs % POIs is important to discover regions by capturing mouse move POIs. 100 11.35 10.05 29.40 29.40% In case a concave polygon is constructed, the “dents” of such 1000 10.75 6.75 11.70 1.17% a polygon may entail POIs which are not necessarily in M. In 2000 7.37 3.63 5.63 0.003% the IDR’s algorithm, however, we adapt Quickhull [19], due its 4000 1.30 1.15 53.15 1.33% simplicity, efficiency and its natural implementation of convex average 7.69 7.64 25.97 8.05% polygons. Table 5: IDR statistics on Yelp dataset 7 LIMITATIONS In this paper, we presented a solution for highlighting out-of- # POIs # regions # IDRs # POIs in IDRs % POIs sight information using a polygon-based approach for capturing 100 14.90 7.55 28.30 28.30% implicit feedbacks. To the best of our knowledge, our work is the 1000 13.90 10.00 149.55 14.96% first effort towards formalizing and implementing information 2000 11.05 9.80 111.05 5.55% highlighting using implicit feedback. However, we consider our 4000 10.45 8.55 145.7 3.64% work as an on-going effort where we envision to address some average 12.57 8.97 108.65 13.11% limitations in the future, such as “customizability”, “performance”, “cold start”, and “quantitative experiments”. Custom maps and feedbacks. One limitation is about the “cus- Information Highlighting. The literature contains a few in- tomizable” use of geographical maps as an interaction means. As stances of information highlighting approaches [23–26]. How- we only consider static maps, we plan to work on translations and ever, all these methods are objective, i.e., they assume that user’s rotations as a future work. We also consider adapting other types preferences are given as a constant input and will never change of feedback enablers such as gaze tracking and touch gestures as in the future. This limits their functionality for serving scenarios a direction of future work. Our research will investigate poten- of exploratory analysis. The only way to fulfill “spatial guidance” tial challenges which may arise when employing those feedback is to consider the evolutionary and subjective nature of user’s enablers. feedback. In our approach, the feedback vector gets updated in Mobility. Our current focus is only on POIs, and hence trajecto- time based on the implicit feedback of the user. ries and areas of mobility are not incorporated directly. While Online recommendation approaches can also be considered as trajectories and areas can be broken into a subset of POIs (and an information highlighting approach where recommended items hence be easily fed to our approach), we envision to enrich our count as highlights. Most recommendation algorithms are space- model in a way that those mobility elements are considered as agnostic and do not take into account the spatial information. first-class citizens. While a few approaches focus on the spatial dimension [27–29], Cold start. Our problem bears similarities with recommendation they still lack the evolutionary feedback capturing. Moreover, algorithms where the quality of the output may be influenced most recommendation methods miss “result diversification”, i.e., by scarce availability of input. This problem is referred to as the highlights may not be useful due to overlaps. cold start problem [38, 39]. While there is no guarantee for a Feedback Capturing. Several approaches are proposed in the meaningful highlight in case of the complete absence of implicit state of the art for capturing different forms of feedback [8, 9, 11, feedbacks, our approach can return a reasonable set of highlights 12, 30, 31]. The common approach is a top-k processing method- even with one single iteration of mouse moves. In the future, we ology in order to prune the search space based on the explicit envision to tackle the no-input challenge by leveraging statis- feedback of the user and return a small subset of interesting re- tical properties of the spatial data to obtain a default view for sults of size k. A clear distinction of our proposal is that it doesn’t highlights. aim for pruning, but leveraging the actual data with potential Performance. Another limitation is the medium-size datasets interesting results (i.e., highlights) that the user may miss due to be processed. Our algorithm processes similarity and diversity to the huge volume of spatial data. Moreover, in a typical top- in an O(n 2 ) complexity. Also Quickhull [19] uses a divide and k processing algorithm, user’s choices are limited to k. On the conquer approach similar to that of Quicksort, and its worst com- contrary, our IDR approach enables a freedom of choice where plexity is O(n2 ). While processing a 10K-POI dataset is straight- highlights get seamlessly updated with new user’s choices. forward in our framework, we plan to experiment with larger Few works formulate fusing approaches of explicit and im- datasets in the future by improving our algorithms towards better plicit feedbacks to better capture user preferences [6, 7, 32]. Our performance. Another direction for future work is to consider approach functions purely on implicit feedback and does not experiments which measure the quantitative and qualitative in- require any sort of explicit signal from the user. fluence of each component separately. Region Discovery. Our approach finds interesting dense re- gions (IDRs) in order to derive user’s implicit preferences. There 8 CONCLUSION exist several approaches to infer a spatial region for a given set In this paper, we present an approach to explore Interesting of POIs [19, 33–37]. The common approach is to cluster POIs in Dense Regions (IDRs) using implicit feedback in order to detect form of concave and convex polygons. In [33], an algorithm is user latent preferences. The implicit feedbacks are captured from proposed to verify if a given POI p on the surface of a sphere is lo- mouse moves of users over the geographical map while analyzing cated inside, outside, or along the border of an arbitrary spherical spatial data. We formalize a novel polygon-based mining algo- polygon. In [34, 35], a non-convex polygon is constructed from rithm which returns a few highlights in-line with user’s implicit preferences. The highlights enable users to focus on what matters [17] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density- the most and prevent information overload. based algorithm for discovering clusters a density-based algorithm for discov- ering clusters in large spatial databases with noise. In Proceedings of the Second We consider various future directions for this work, including International Conference on Knowledge Discovery and Data Mining, KDD’96, pages 226–231. AAAI Press, 1996. the limitations of our current approach which we listed in Sec- [18] Derya Birant and Alp Kut. St-dbscan: An algorithm for clustering spatial- tion 7. First, we are interested to incorporate an “explainability” temporal data. Data Knowl. Eng., 60(1):208–221, January 2007. [19] C. Bradford Barber, David P. Dobkin, and Hannu Huhdanpaa. The quickhull component which can describe causalities behind preferences. algorithm for convex hulls. ACM Trans. Math. Softw., 22(4):469–483, December For instance, we are interested to find seasonal patterns to see 1996. why the preferences of users change from place to place during [20] Raphael A. Finkel and Jon Louis Bentley. Quad trees a data structure for retrieval on composite keys. Acta informatica, 4(1):1–9, 1974. various seasons of the year. Another direction is to incorporate [21] Shagun Jhaver, Yoni Karpfen, and Judd Antin. Algorithmic anxiety and coping “Query by Visualization” approaches, where users can specify strategies of airbnb hosts. In Proceedings of the 2018 CHI Conference on Human their intents alongside their implicit preferences, directly on the Factors in Computing Systems, page 421. ACM, 2018. [22] R Timothy Marler and Jasbir S Arora. Survey of multi-objective optimiza- map [40]. tion methods for engineering. Structural and multidisciplinary optimization, 26(6):369–395, 2004. REFERENCES [23] J. Liang and M. L. Huang. Highlighting in information visualization: A survey. In 2010 14th International Conference Information Visualisation, July 2010. [1] John F. Roddick, Max J. Egenhofer, Erik G. Hoel, Dimitris Papadias, and Betty [24] Anthony C. Robinson. Highlighting in geovisualization. Cartography and Salzberg. Spatial, temporal and spatio-temporal databases - hot issues and Geographic Information Science, 38(4):373–383, 2011. directions for phd research. SIGMOD Record, 33(2):126–131, 2004. [25] Kanit Wongsuphasawat, Dominik Moritz, Anushka Anand, Jock Mackinlay, [2] Aditya Telang, Deepak Padmanabhan, and Prasad Deshpande. Spatio-temporal Bill Howe, and Jeffrey Heer. Voyager: Exploratory analysis via faceted brows- indexing: Current scenario, challenges and approaches. In Proceedings of the ing of visualization recommendations. TVCG, 22(1), 2016. 18th International Conference on Management of Data, COMAD ’12, pages [26] Wesley Willett, Jeffrey Heer, and Maneesh Agrawala. Scented widgets: Im- 9–11, Mumbai, India, India, 2012. Computer Society of India. proving navigation cues with embedded visualizations. IEEE Transactions on [3] Lauro Lins, James T Klosowski, and Carlos Scheidegger. Nanocubes for real- Visualization and Computer Graphics, 13(6):1129–1136, 2007. time exploration of spatiotemporal datasets. IEEE Transactions on Visualization [27] Jie Bao, Yu Zheng, David Wilkie, and Mohamed Mokbel. Recommendations in and Computer Graphics, 19(12):2456–2465, 2013. location-based social networks: a survey. GeoInformatica, 19(3):525–565, 2015. [4] Jia Yu, Zongsi Zhang, and Mohamed Sarwat. Spatial data management in [28] Justin J. Levandoski, Mohamed Sarwat, Ahmed Eldawy, and Mohamed F. apache spark: the geospark perspective and beyond. GeoInformatica, pages Mokbel. Lars: A location-aware recommender system. In ICDE, pages 450– 1–42, 2018. 461, 2012. [5] W. Bruce Croft. The importance of interaction for information retrieval. In [29] Marina Drosou and Evaggelia Pitoura. Disc diversity: result diversification Proceedings of the 42nd International ACM SIGIR Conference on Research and based on dissimilarity and coverage. PVLDB, 6(1):13–24, 2012. Development in Information Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019, [30] Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. Aide: an active pages 1–2, 2019. learning-based approach for interactive data exploration. IEEE Transactions [6] Andrea Ballatore and Michela Bertolotto. Semantically enriching vgi in sup- on Knowledge and Data Engineering, 28(11):2842–2856, 2016. port of implicit feedback analysis. In Katsumi Tanaka, Peter Fröhlich, and [31] Mario Boley, Michael Mampaey, Bo Kang, Pavel Tokmakov, and Stefan Wro- Kyoung-Sook Kim, editors, Web and Wireless Geographical Information Systems, bel. One click mining: Interactive local pattern discovery through implicit pages 78–93, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. preference and performance learning. In Proceedings of the ACM SIGKDD [7] Nathan N. Liu, Evan W. Xiang, Min Zhao, and Qiang Yang. Unifying explicit Workshop on Interactive Data Exploration and Analytics, pages 27–35. ACM, and implicit feedback for collaborative filtering. In Proceedings of the 19th 2013. ACM International Conference on Information and Knowledge Management, [32] Eoin Mac Aoidh, Michela Bertolotto, and David C. Wilson. Analysis of implicit CIKM ’10, pages 1445–1448, New York, NY, USA, 2010. ACM. interest indicators for spatial data. In 15th ACM International Symposium on [8] Dong Xin, Xuehua Shen, Qiaozhu Mei, and Jiawei Han. Discovering interesting Geographic Information Systems, ACM-GIS 2007, November 7-9, 2007, Seattle, patterns through user’s interactive feedback. In Proceedings of the 12th ACM Washington, USA, Proceedings, page 47, 2007. SIGKDD international conference on Knowledge discovery and data mining, [33] Michael Bevis and Jean-Luc Chatelain. Locating a point on a spherical surface pages 773–778. ACM, 2006. relative to a spherical polygon of arbitrary shape. Mathematical Geology, [9] Mansurul Bhuiyan, Snehasis Mukhopadhyay, and Mohammad Al Hasan. In- 21(8):811–828, Oct 1989. teractive pattern mining on hidden data: a sampling-based solution. In Pro- [34] Matt Duckham, Lars Kulik, Mike Worboys, and Antony Galton. Efficient ceedings of the 21st ACM international conference on Information and knowledge generation of simple polygons for characterizing the shape of a set of points management, pages 95–104. ACM, 2012. in the plane. Pattern Recognition, 41(10):3224 – 3236, 2008. [10] Behrooz Omidvar-Tehrani, Sihem Amer-Yahia, Eric Simon, Fabian Colque [35] M.J. Fadili, M. Melkemi, and A. ElMoataz. Non-convex onion-peeling using a Zegarra, João LD Comba, and Viviane Moreira. Userdev: A mixed-initiative shape hull algorithm. Pattern Recognition Letters, 25(14):1577 – 1585, 2004. system for user group analytics. In Proceedings of the Workshop on Human-In- [36] Avi Arampatzis, Marc van Kreveld, Iris Reinbacher, Christopher B. Jones, the-Loop Data Analytics, pages 1–8, 2019. Subodh Vaid, Paul Clough, Hideo Joho, and Mark Sanderson. Web-based [11] Niranjan Kamat, Prasanth Jayachandran, Karthik Tunga, and Arnab Nandi. delineation of imprecise regions. Computers, Environment and Urban Systems, Distributed and interactive cube exploration. In Data Engineering (ICDE), 2014 30(4):436 – 459, 2006. Geographic Information Retrieval (GIR). IEEE 30th International Conference on, pages 472–483. IEEE, 2014. [37] Antony Galton and Matt Duckham. What is the region occupied by a set of [12] Behrooz Omidvar-Tehrani, Sihem Amer-Yahia, and Alexandre Termier. Inter- points? In Martin Raubal, Harvey J. Miller, Andrew U. Frank, and Michael F. active user group analysis. In CIKM, pages 403–412. ACM, 2015. Goodchild, editors, Geographic Information Science, pages 81–98, Berlin, Hei- [13] Behrooz Omidvar-Tehrani, Plácido A Souza Neto, Felipe M Freire Pontes, and delberg, 2006. Springer Berlin Heidelberg. Francisco Bento. Geoguide: An interactive guidance approach for spatial data. [38] Vincent Leroy, Berkant Barla Cambazoglu, and Francesco Bonchi. Cold start In Internet of Things (iThings) and IEEE Green Computing and Communications link prediction. In Proceedings of the 16th ACM SIGKDD International Con- (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and ference on Knowledge Discovery and Data Mining, Washington, DC, USA, July IEEE Smart Data (SmartData), 2017 IEEE International Conference on, pages 25-28, 2010, pages 393–402, 2010. 1112–1117. IEEE, 2017. [39] Behrooz Omidvar-Tehrani, Sruthi Viswanathan, Frederic Roulland, [14] Mon Chu Chen, John R. Anderson, and Myeong Ho Sohn. What can a mouse and Jean-Michel Renders. SAGE: Interactive state-aware point-of- cursor tell us more?: Correlation of eye/mouse movements on web browsing. interestrecommendation. In WSDM Workshop on State-based User Modelling, In CHI ’01 Extended Abstracts on Human Factors in Computing Systems, CHI 2020. EA ’01, pages 281–282, New York, NY, USA, 2001. ACM. [40] Tarique Siddiqui, Albert Kim, John Lee, Karrie Karahalios, and Aditya [15] Ioannis Arapakis, Mounia Lalmas, and George Valkanas. Understanding Parameswaran. Effortless data exploration with zenvisage: an expressive within-content engagement through pattern analysis of mouse gestures. In and interactive visual analytics system. Proceedings of the VLDB Endowment, Proceedings of the 23rd ACM International Conference on Conference on Infor- 10(4):457–468, 2016. mation and Knowledge Management, CIKM ’14, pages 1439–1448, New York, NY, USA, 2014. ACM. [16] Jacek Malczewski and Claus Rinner. Multicriteria decision analysis in geo- graphic information science. Springer, 2015.