Flexible Querying in Geo-Finder Gloria Bordogna1 , Giuseppe Psaila2 1 CNR-IDPA - via Pasubio 5, I-24044 Dalmine (BG) (Italy) gloria.bordogna@idpa.cnr.it 2 Università di Bergamo - viale Marconi 5, I-24044 Dalmine (BG) (Italy) psaila@unibg.it Abstract. The evaluation of queries specifying both content based con- ditions and spatial conditions on documents contents in Geographic In- formation Retrieval requires representing the vagueness and context de- pendency of spatial conditions and the personal user’s preferences. The Geo-Finder system [1] implements a Geo-Retrieval model that eval- uates flexible spatial queries combined with content queries. The spa- tial condition is interpreted as the soft constraint “close” on the user’s perceived distance. Two distinct semantics can be used to combine the spatial and the content conditions: and possibly or average; in both cases it is possible to modify the relative weight (preference) of conditions. Keywords: Geographic Information Retrieval, Fuzzy aggregation oper- ators, context dependent spatial query, soft constraint. 1 Introduction An important issue in GIR is the problem of spatial querying [2, 5, 3], intended as supporting the distinct information needs of users that may access the same collection for different purposes. To address it, GIRs must be developed to take user’s preferences into account, to rank query results in terms of relevance [4]. In the Geo-Finder system [1], we devised a Geo-Retrieval model for flexible querying a GIR, such that: the user expresses the spatial condition based on the “close” soft constraint, adapting the spatial scope to the perceived meaning of spatial conditions; the user expresses preferences on how to combine the content conditions with the spatial conditions. In the spatial condition, the user’s context is modeled as user’s perceived distance measure, that modifies the spatial scope of the query. Two distinct semantics are provided for flexibly combining the content condi- tion and the spatial condition: the asymmetric and possibly aggregation combines the mandatory content condition with the optional spatial condition; the com- pensative average aggregation linearly combines the two conditions. The relative weight between the conditions can be specified to achieve personalization. 2 The Geo-Retrieval model In this paper, we present the Geo-Retrieval model devised in Geo-Finder. It is based on the concept of Fuzzy Footprint, that represents the degree with which a geographic reference is relevant for a document: for each indexed document, the Geo-Indexer [1] generates a set of fuzzy footprints. A fuzzy footprint of a document d, denoted as Foot(d), is a fuzzy set of geo- graphic coordinates gc= (lat,lon), where lat=latitude lon=longitude (expressed in degrees), with a membership degree µF oot(d) (gc) ∈ [0, 1] representing the sig- nificance by which the geographic location gc belongs to the geographic focus of document d: Foot(d) = {h gc1 , µF oot(d) (gc1 )i, . . . , hgcn , µF oot(d) (gcn )i} where each gci = (lat i ,lon i ) and its membership degree µF oot(d) (gci ) are deter- mined by the Geo-Indexing module [1]. A user query q consists of two conditions: a content-based condition, ex- pressed by a list of content keywords, and a spatial condition, expressed by a list of geographic names. The spatial condition is interpreted as the requirement for documents with geographic reference “close” to the specified place names. These two conditions are evaluated by specific partial matching functions that compute two distinct scores in [0,1]: the Retrieval Status Value w.r.t. the con- tent, denoted as RSV content (d), and the Geographic Retrieval Value, denoted as GRV closeness (d). In Geo-Finder, RSV content (d) is a classical cosine similarity measure, computed by means of the Lucene library. These two scores are finally combined to compute the global Retrieval Status Value w.r.t. the whole query q, indicated by RSV q (d), by applying a suitable aggregation function. We defined two aggregation functions, since we considered two distinct aggregation semantics, i.e., the and possibly asymmetric aggregation and the average compensative aggregation. Evaluation of the spatial condition. Given the fuzzy footprint Foot(q) of the ge- ographic names in the query q, the fuzzy footprints of the documents d, Foot(d), that are likely to satisfy the query are retrieved by accessing the footprint spa- tial index. The semantics of the spatial condition is that of evaluating a user’s context dependent “closeness” of the documents’ footprints Foot(d) to the query footprint Foot(q). This is done by a matching function close which models the concept of “close” as a user’s context dependent soft constraint. The matching function close computes a Geographic Retrieval Value, GRV closeness (d) ∈ [0, 1], depending on the closeness of the document footprint to the query footprint as follows: GRV closeness (d) = µclose (Foot(d),Foot(q)) = =max i∈Foot (d),j∈Foot (q) qscope(dist(i, j) × min(µF oot(d) (i), µF oot(q) (j))) Where µF oot(d) (i) and µF oot(q) (j) are the membership degrees of the i-th and j-th fuzzy spatial references gci ∈Foot(d) and gcj ∈Foot(q), i.e., the extent to which a spatial reference represents the geographic focus of the document and of the query, respectively. The dist(i, j) function is a great circle approximation of the actual distance between the two spherical coordinates gci and gcj . The qscope function modifies the geographic distance so as to model the user perceived distance as follows:  δ/(x + δ) if x ≤ δ + k × MaxDist(Foot(d)) with δ ≥ 0, k > 0 qscope(x) = 0 otherwise MaxDist(X) =max i,j∈X (dist(i, j)) is the maximum geographic distance between any two geographic places i and j in the footprint X, and can be considered as the maximum dispersion of the fuzzy footprint X. It is zero in the case X contains just one single place. Thus MaxDist(Foot(d)) is the query dispersion. Its value depends on the number of geographic names specified in the query and on the maximum distance between their geographic coordinates. The parameters δ and k permit to change the spatial scope of the query. The parameter δ is the query range, and is useful in the case of a query footprint consisting of a single geographic coordinate pair gc in order to retrieve also documents with footprint in the surrounding places. Distinct δ can adapt the evaluation of the spatial condition “close” to the user perception, thus, modeling strict or relaxed interpretations of the “closeness” surroundings of a point. The higher the δ, the greater is the surrounding. The parameter k makes it possible to model a tolerance on the geographic distance between a document fuzzy footprint and the query footprint, so that one can consider close places within a distance of k times MaxDist(Foot(d)), i.e., k times the query maximum dispersion. We consider four main query scopes that can be related to the user’s context, and that are defined in the Geo-Finder system by the following default values of k and δ. (1) The small scope is defined with k = 5, δ = 3 km; it is useful when Foot(q) is a street address within a city or a small city and we are interested in its very near surroundings (in this case, Foot(q) could vary approximately between 0 and about 10 km): with this setting, one can retrieve documents within a distance from the query of 3 km to about 50 km. (2) The meso scope is defined with k = 4, δ = 50 km; in this case, MaxDist(Foot(d)) covers the area of either a region or a small nation like Belgium. (3) The large scope is defined with k = 3, δ = 1000 km, in this case MaxDist(Foot(d)) covers the area of a medium nation such as France (in this case Foot(q) could vary approximately between 0 and a few thousand kilometers). (4) The full scope is defined with k = 3, δ = 10000 km; in thsi case, MaxDist(Foot(d)) covers the area of a big nation such as Russia or of a continent. For example, if one specifies a spatial condition with the two geographic names Bergamo, Como (Como being at about 40 km from Bergamo), and the query scope is meso (i.e. k = 4 and δ = 50 km) the documents with footprints at a maximum distance of 210 km from the query footprint are retrieved: for instance, both documents in Milano and Lugano are retrieved while a document with a footprint in Rome is not. The Global RSV. Geo-Finder implements two distinct semantics to combine RSV content (d) and GRV closeness (d). The asymmetric and possibly semantics is defined as follows: RSV q (d) =RSV content (d) and possibly α GRV closeness (d) = =RSV content (d)× max ((1 − α),GRV closeness (d)) Parameter α specifies the user’s preference of the spatial condition w.r.t. the content condition. When α = 0, it means that the spatial condition can be dis- regarded to rank the documents, and in this case the global Retrieval Status Values is determined solely based on the content relevance score RSV content (d). When α = 1, the two conditions are both mandatory: this means that the Ge- ographic Retrieval Value GRV closeness (d) has the same relevance of the content Retrieval Status Value RSV content (d). In this case, the aggregation reduces to the product, i.e., the “fuzzy Anding” of the two relevance scores. Intermediate values of α in (0, 1) demands for an asymmetric combination. The value (1 − α) guarantees a minimum satisfaction level for GRV closenss (d), so that the spatial condition becomes optional and the global RSV q (d) is not too much penalized in the case in which the spatial condition is not satisfied. With the symmetric Average semantics, the Global RSV is defined as follows: RSV q (d) =RSV content (d) average α GRV closeness (d) = = (1 − α)× RSV content (d) + α×GRV closeness (d) When the preference degree α = 0, the result is determined solely by the satisfaction of the content condition; conversely, when α = 1, the global RSV is determined solely by the satisfaction of the spatial condition, and the content based condition is irrelevant. Intermediate values of α permit to vary the trade- off between the influences of the two conditions; in this case, the two conditions compensate each other, while with the and possibly semantics it is mandatory to satisfy the content condition to retrieve a document. 3 Conclusions The Geo-Retrieval model described in this paper is implemented in the Geo- Finder system. In [1], we extensively presented its features. Furthermore, in [1], some evaluation results are also discussed showing the improvement of Geo- Finder ranking over Google ranking. The evaluations also showed that the preci- sion of Geo-Finder improves when restricting the geographic domain of interest, thus outlining the positive role of modeling the user’s context which determines the perceived distance when evaluating the spatial query condition. References 1. G. Bordogna, G. Ghisalberti, and G. Psaila. Geographic information retrieval: Mod- eling uncertainty of user’s context. Fuzzy Sets and Systems. 2. G. Cai. GeoVSM: An integrated retrieval model for geographic information. In M.J. Egenhofer and D.M. Marks (Eds), GIScience 2002, LNCS 2478, pages 65–79. ‘Springer Verlag, 2002. 3. Z. Li, C. Wang, X. Xie, X. Wang, and W.Y. Ma. Indexing implicit locations for geographical information retrieval. In n Proceedings of GIR-2006, Int. Conf. on Geographical Inf. Retrieval, Seattle, USA, August 2006. 4. G. Mountrakis and A. Stefanidis. Moving towards personalized geospatial queries. Journal of Geographic Information System, 3:‘334–344, 2011. 5. R.S. Purves, P. Clough, C.B. Jones, A. Arampatzis, B. Bucher, D. Finch, G. Fu, H. Joho, A.K. Syed, S. Vaid, and B. Yang. The design and implementation of SPIRIT: a spatially aware search engine for information retrieval on the internet. International Journal of Geographical Information Science, 21(7):‘717–745, 2007.