Geographic Information Retrieval Alexander Markowetz ∗ Thomas Brinkhoff † Bernhard Seeger ‡ Abstract Due to the abundance of information, querying the World Wide Web has become increasingly difficult. Geographic properties of Internet resources can be directly used for making Internet search more personalized and for increasing the quality of results. However, little work has been done in this area so far and the general direction of research and development has been uncertain. In this paper, essential questions in this field are addressed: First, we outline a three-stage architecture for an efficient and effective mapping of Internet resources to geographic locations. Geospatial search engines are one important application of this mapping. Such search engines fundamentally differ from their traditional counterparts, particularly in respect to selecting and ranking search results. Finally, we propose geospatial analyses using localized web crawls. Such analyses allow the support of new types of queries as well as the reduction of cost compared to conventional data-collection techniques. The paper concludes with an overview on challenging research questions. 1 Introduction The state of the Internet is characterized by an abundance of information. Two-term queries posted by search-engine users lead to extreme iceberg queries. There are two major strategies to circumvent the burial of interesting pages under a large quantity of less interesting results. The first approach is to let the user browse dynamically through search results. He trims parameters and therefore quickly drills into the iceberg. The second strategy is personalized search. On a very abstract level, it could be described as the addition of query attributes inferred from personal information of the author and/or of his peers. This allows reducing and reordering the result set in order to match the user’s interests. In many web-searches, the user’s geographic position is most relevant personal proterty. It is exploited by geospatial search engines, a new and important field of research. Geospatial search engines are based on two underlying assumptions. First, almost every web page has a local context: Where was this information created? Which locations does this information apply to? Where does the targeted audience reside? We will later show how to infer this context. Second, geographically close information is the most interesting. This idea is best captured by the proverb: ”All business is local”. Of course, this is not limited to business. From all pizza restaurants in the world, the closest is the most interesting. From all hotels, those from my next holiday destination are the most interesting. From all yoga courses, those within a driving distance from my house are the most interesting. Given the importance of the ”local” factor, geospatial search engines could very well turn out to be the killer application for location-based services. ∗ Fachbereich Mathematik und Informatik, Philipps Universität Marburg, markow@Mathematik.Uni-Marburg.de † Institute for Applied Photogrammetry and Geoinformatics (IAPG), FH Oldenburg/Ostfriesland/Wilhelmshaven, Thomas.Brinkhoff@fh-oldenburg.de ‡ Fachbereich Mathematik und Informatik, Philipps Universität Marburg, seeger@Mathematik.Uni-Marburg.de 1 116 Despite the obvious advantages of geospatial search engines, the interest of commercial search companies is only slowly increasing. There have been some first early prototypes in 2003. Overall, the topic has has received little attention from the scientific community. There is some work on localizing servers, but almost none on content. The first issue in this paper will be how to infer geographic locations for Internet resources. If these are not already stored explicitly as metadata, there are still numerous ways of deducing them from various aspects of page content and link structure. We present a three-stage architecture that allows combining different techniques and producing a satisfying mapping from web pages to geographic locations. Once this is achieved, groundwork has been laid for two major applications: a location-aware search engine targeting individual users and geospatial analysis for corporate users. A geospatial search engine will allow users to specify a location in addition to the keywords they are searching for. The search engine will then return results that are not only valid to these keywords, but are also located near the specified location. It will also need to incorporate the second strategy mentioned above and allow for dynamic trimming between distance and topic importance. The search engine therefore does not only require new interfaces but also efficient implementations that allow for this flexibility. The second application will be the geospatial analysis of Internet resources. The World Wide Web seems predestined for any sort of social analysis, because it mirrors society to an exceptional degree. This observation does not only hold for the explicit information stored as page content. It also applies to its less obvious properties, such as relationships between pages. By taking this implicit information into account, we can derive insight regarding social systems, such as business or science. Locality is a key factor in all aspects of human interaction. Enhancing the Web by geospatial properties will take web analysis to a new level that will allow a new class of queries. The rest of the paper is organized as follows. In Section 2, we demonstrate how to map web pages to geographic locations. In the next two sections, we outline geospatial search engines and geospatial analysis. Related work that has not already been discussed in the previous sections will be treated in Section 5. Finally, we provide conclusions and an outlook on a broad field of future work. 2 Computing Geospatial Properties of Internet Resources In this section, we introduce two essential geospatial properties of web pages: The location of a page will later be used to compute its distance to the position a user is searching for. The locality of a page allows distinguishing between pages that are globally important and those that are only of local interest. 2.1 Geographic Locations of Internet Resources Computing the geographic locations of a web page is not an easy task and like most things on the Internet performed in a manner of ”best effort”. There is a multitude of mapping techniques, how- ever, none of which work very well by themselves. Therefore, we propose a three-stage architecture. In the first stage, a broad range of techniques is used, each of them assigning a set of locations to a page. In a second stage, we fuse the different sets of locations. In the final stage, we consider link structures and user behavior to validate and refine the mapping. These techniques produce high quality results, but require the initial mapping from the first two stages. Note that multiple locations may be associated to a single page, e.g., a page of a retailer might refer to the locations of multiple outlets. 2 117 2.1.1 Initial Mappings A whole range of techniques can be applied to assign initial locations to web resources. For a broad overview, we refer to [MCu01]. One of the most basic, yet powerful approaches simply processes the admin-c section of the whois entry of a URL. In most cases, this section directly points to the company or individual who registered that domain. For most companies, this corresponds to exactly the location, for which that information is relevant. Our evaluations have demonstrated the very high relevance of the admin-c section. The evaluation of other parts of the whois entry often fails because they are concerned with the location of web servers. However, most small companies or individuals do not host their own server, but may co-host at a server farm, hundreds of miles away from their home. Many authors propose adding geospatial meta information to web pages, denoting that the content of this page is relevant to a certain location. The location may be described by using the proposals of the Dublin Core Metadata Initiative [DCMI00] or according to the ISO/TC 211 standard 19115. The use of geospatial tags, however, is quite problematic. As long as no search engine relies on geospatial tags, there is no need for administrators to implement them and vice versa. Even worse, webmasters may not be trusted. They may maliciously include tags for regions, for which their site is of no relevance. For this reason, no commercial search engine takes HTML tags into account. So geospatial tags can serve as a mere hint of the location of a web resource. Another range of techniques requires parsing URLs as well as entire web pages for extracting names of geographic features like cities and landmarks, which can be mapped to locations. There are several problematic issues regarding the use of parsing techniques that prohibit their exclusive use. First of all, parsing is quite expensive and might not be applicable to large amounts of web pages. Second, homonyms and synonyms cause tremendous problems. For example, wide-spread names such as ”Springfield” are impossible to map. Analogically, geospatial codes like zip or dialing codes can be extracted. However, the same problems hold for such an approach. Therefore, several such hints need to be combined for an acceptable guess. 2.1.2 Fusion and Integration of Multiple Mappings The previous discussion demonstrated that multiple sets of locations might be assigned to a single web page. In the following, we outline how to integrate the results in such a way that a unique set of locations is computed for each page. First, we detect and remove outliers. The general assumption is that outliers are produced by faulty data such as misleading geographic tags. Since spatial data is generally imprecise due to different underlying resolutions, we require a second step to condense clusters of locations that refer to the same place. We try to identify these clusters and represent each by a single location. In case of point locations only, there are two common representative locations for a cluster: the centroid and medoid. For areas, the common intersection might be taken into account. 2.1.3 Further Refinement The final stage of our architecture validates and refines the previous mapping. The following techniques increase the quality of the results, but would not work without the initial mapping from the first two stages. As a simple, yet again very powerful approach, we propose using the web’s link structure. If a cluster of pages from NY points to a web site, which is so far assumed to be in LA without any 3 118 links from that area, we might conclude that the site is more relevant to NY than LA. Finding such clusters and detecting outliers is a task, for which data mining techniques [KH00] need to be adapted. Additionally, locations of users accessing a web resource can be used for verifying its location. In the near future, the widespread use of mobile web clients can be expected, which know their position from GPS, Galileo or their mobile phone. Then, it is reasonable to assume a strong relation between the location of web resources and their users. This relation can be evaluated by analyzing corresponding clickstreams. 2.2 Locality of Internet Resources The introduction of locality enables us to distinguish between sites that are globally important on the subject and those that are not so significant on a global level but are of highest local importance. On the one hand, there are web sites that have high global importance, but are locally rather irrelevant. Examples can be found among web sites of magazines, mail-order stores, etc. On the other hand, there are web sites that have high local importance, but are outperformed on the overall subject by a multitude of other web sites. Examples can be found among sites of local stores and institutions. The idea of locality is equally important in the context of geospatial search engines as well as geospatial analysis. Let us consider the web sites of a pizza restaurant and an international magazine for Pizza lovers, both based in Marburg. The magazine for pizza lovers will have thousands of links from all over. It is globally important. The local pizza parlor might be referenced by only twenty or thirty links, but all from within Marburg. It is locally important. When searching for pizza in Marburg, it is really the locally important site that is desired, not the global one. Equally, in geospatial analysis, one might want to distinguish between sites with a global audience and those that target a more local one. Computing and storing locality does not come for free. Therefore, the granularity in which it is computed and the way it is stored highly depend on the application. One needs to take into account, how flexible the modelling of locality has to be, how long its computation is allowed to take, how much storage is required and how long retrieval will take. Depending on these factors, one will have to select the appropriate level of detail and flexibility. The highest level can be achieved by storing the precise distribution of links as a function of their distance. For efficient storing, it might be smoothed and compressed. For many applications however, this will prove to be an overkill. The average lengths of links with other sites can serve as a measure of locality, which is much easier to compute, store and retrieve later. Also, one could simply count the links coming from within a distance of ². Taking the total number of links into account, one could compute the relative locality. Together with the variance of link lengths, this could serve as an appropriate measure for locality. In particular, in the context of geospatial analysis, one might want to distinguish between inbound locality and outbound locality, taking only in- or outgoing links into account. 3 Search Criteria of Geospatial Search Engines Geospatial search engines will be the first commercialy available geospatial web applications. First prototypes are already available [NL03, Ov03, Go03]. In addition to the search terms, Geospatial search engines require a specification of the location a user is interested in. The simplest solution 4 119 for specifying such a search area is a text field for defining a place or an address. In the case of mobile clients, the current position can automatically be passed to the search engine. The search engine will return those results first that are not only relevant to the search terms, but also within close distance to the specified location. Localized queries are poorly supported by traditional search engines for various reasons. There is no support for continuous space. When searching for ”Marburg AND Cycling”, the user will typically receive pages for all interesting cycling activities in Marburg, but some of the real interesting results just outside the city boundaries will be missing. The available granularity is often too coarse. Searching for a pizza, a web site from the same city might not be ”close enough”, if the city is L.A. The name of the search area is a poor indication. A web resource might not contain the name of the location exactly as the user spelled it. Synonyms might be used. In consequence, many interesting pages will fail to show up in the results. The order, in which search results are presented by geospatial search engines, differs fundamentally from the ranking of their traditional counterparts. The ordering does not only depend on one criterion (the relevance of subject) but also on a second (the geographic proximity). The balance between these two criteria is crucial for delivering useful results. Depending on the search terms, one criterion could be of a higher importance than the other. For example, when looking for a restaurant, its proximity is of much higher importance than when looking for a car dealership. Depending on the first batch of results delivered to the user, he or she might even want to re- adjust the balance. In the following, we compare different solutions that allow an adjustment of the balance between the two search criteria. 3.1 A Post-Processing Solution A simple solution is to use a traditional search engine that allows specifying the search terms using some keywords k. The search engine offers r results, from which n results are retrieved and reordered according to their proximity to the search area l. From all parameters, n is extremely important for a useful result. If n is chosen too small, only very important pages on the keyword search are retrieved. It may happen that all of them are located far away from l, and therefore prove useless in a geographic context. If on the other hand, n is chosen too large, there may be a multitude of pages that are located very close to l, but are only remotely interesting in the context of k. The interesting web sites will be buried under these useless results. A relevant page may show up so late, that a user gets tired of searching through the results and aborts the search too early. Modifying n allows changing the balance between the two search criteria. By making n smaller, the overall importance on the subject becomes more important. By making n larger, geographic proximity shifts into the center of focus. Setting n to a fixed number is useless, since for some k, the search engine will return hundreds of results, while for another, it may return millions. Setting n to a fixed percentage of r seems a better approach. Still, the percentage of locally interesting sites very much depends on the topic, so a fixed percentage that will work for searches regarding mountain bikes might not work for searches regarding computers or knitting patterns. Ideally, we want the user to change the balance between the two criteria dynamically, while he is browsing the results. Using any standard search engine, results are presented in chunks of ten or twenty. If none of the results from the first batch look interesting, the user finds a button at the bottom of the page that will show the next batch. This is the point, at which the user is allowed to change his preferences. Say, the user has just browsed through a batch of results. By the time he reached the end, there are four possibilities: Done: In the case one of the results proved of sufficient 5 120 quality, we consider the search done. More: If the user thinks he is on the right track, but somehow the results just seen were not what he wanted, he can continue browsing through the results, using the same balance. More Important If the results seem only remotely important in the context of k, the user might want to trade geographic proximity for importance. Closer: If the results just seen were important on the subject, but too far away, the user could chose to consider results that are not as important on the subject, but closer to l. Geographic search engines will be judged by how efficient they support this dynamic balancing. How many intermediate results have to be materialized, before the first batch is returned? How many can be re-used, if the user re-adjusts the balance between the two factors? The simple approach described in the beginning of this subsection does not perform well under any of these questions. It necessarily materializes all results, before returning the first batch to the user. The algorithm does not allow any re-adjustment between importance and proximity. If any such re-adjustment should take place, the entire query has to be recomputed. Therefore, none of the already materialized results can be reused. This method is not suitable for production use. 3.2 Zones So far, we have not given much thought to the properties of distance. Intuitively, we assumed it being smooth and strictly monotonic. In our everyday lives however, our perception of distance is quite different. We do not care if the nearest supermarket is 6.8 or 7.2 km from our home. In fact, we probably do not even know. Instead, we tend to think in terms of: Can I walk there or do I need to take the car? Do I have to cross an international border? Therefore, we end up conceptualizing distances in zones, such as: In walking distance of l. A short or medium drive. Travels within the same political entity as l. Applying this observation to web sites, we have developed a second technique that is much more flexible than the first, yet as simple. It offers significant and meaningful re-adjustments while browsing results and does not require any re-computation after a re-adjustment. In this approach, sorting and browsing are two entirely independent steps. In the first step, we sort the important pages, or any significant subset, into fixed categories such as presented above. We name the zones z0 through zmax , the first being the innermost, and ∀0≤i