GIR experiements with Forostar at GeoCLEF 2007 Simon Overell1 , João Magalhães1 and Stefan Rüger2,1 1 Multimedia & Information Systems Department of Computing, Imperial College London, SW7 2AZ, UK 2 Knowledge Media Institute The Open University, Milton Keynes, MK7 6AA, UK {simon.overell01, j.magalhaes} @imperial.ac.uk and s.rueger@open.ac.uk Abstract In this paper we describe our Geographic Information Retrieval experiments with Forostar, our GIR application on the GeoCLEF 2007 corpus and query set. We compare the results from orthogonal text with no geographic entities and only geo- graphic entities with standard text retrieval and combined text and geographic rele- vance methods. The text and named entity analysis and retrieval methods of Forostar are described in detail. We also detail our placename disambiguation and geographic relevance ranking methods. The paper concludes with an analysis of our results including significance testing where we show our baseline method, in fact, to be best. Finally we identify weaknesses in our approach and ways in which the system could be optimised and improved. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software General Terms Measurement, Performance, Experimentation Keywords Geographic Information Retrieval, Relevance Ranking, Disambiguation 1 Introduction This paper describes the experiments performed by the Multimedia and Information Systems group at GeoCLEF 2007 with our GIR application: Forostar. We compare the results from orthogonal text with no geographic entities and only geographic entities with standard text retrieval and combined text and geographic relevance methods. In Section 2 we outline how we index the GeoCLEF corpus and the three field types: Text, Named Entity and Geographic. We then describe how the manually constructed queries are expanded and submitted to the query engine. Section 3 describes and justifies the placename disambiguation methods and geographic relevance ranking methods in more detail. In Section 4 we describe our experiments followed by the results in Section 5. Finally Section 6 analyses the weaknesses of our system and identifies areas for improvement. Figure 1: Building the Lucene Index 2 System Forostar is our ad-hoc Geographic Information Retrieval system. At indexing time, documents are analysed and named entities extracted. Named entities tagged as locations are then disambiguated using our co-occurrence model. The free-text fields, named entities and disambiguated locations are then indexed by Lucene. In the querying stage we combine the relevance scores assigned to the Geographic fields and Textual fields using the vector space model. Fields designated as containing more information (i.e. The Headline) have a boost value assigned to them. 2.1 Indexing The indexing stage of Forostar begins by extracting named entities from text using ANNIE, the Information Extraction engine bundled with GATE. GATE is Sheffield University’s General Architecture for Text Engineering [2]. Of the series of tasks ANNIE is able to perform, the only one we use is named entity recognition. We consider ANNIE a “black box” where text goes in, and categorised named entities are returned; because of this, we will not discuss the workings of ANNIE further here but rather refer you to the GATE manual [2]. 2.1.1 Named Entity fields We index all the named entities categorised by GATE in a “Named Entity” field in Lucene (e.g. “Police,” “City Council,” or “President Clinton”). The named entities tagged as Locations by ANNIE we index as “Named Entity – Location” (e.g. “Los Angeles,” “Scotland” or “California”) and as a Geographic Location (described in Section 2.1.3). The body of the GeoCLEF articles and the article titles are indexed as text fields. This process is described in the next section. 2.1.2 Text fields Text fields are pre-processed by a customised analyser similar to Lucene’s default analyser [1]. Text is split at white space into tokens, the tokens are then converted to lower case, stop words discarded and stemmed with the “Snowball Stemmer”. The processed tokens are held in Lucene’s inverted index. 2.1.3 Geographic fields The locations tagged by the named entity recogniser are passed to the disambiguation system. We have implemented a simple disambiguation method based on heuristic rules. For each placename being classified we build a list of candidate locations, if the placename being classified is followed by a referent location this can often cut down the candidate locations enough to make the placename unambiguous. If the placename is not followed by a referent location or is still ambiguous we disambiguate it as the most commonly occurring location with that name. Topological relationships between locations are looked up in the Getty Thesaurus of Geograph- ical Names (TGN) [4]. Statistics on how commonly different placenames refer to different locations and a set of synonyms for each location are harvested from our Geographic Co-occurrence model, which in turn is built by crawling Wikipedia [8]. Once placenames have been mapped to unique locations in the TGN, they need to be converted into Geographic fields to be stored in Lucene. We store locations in two fields: • Coordinates. The coordinate field is simply the latitude and longitude as read from the TGN. • Unique strings. The unique string is the unique id of this location, preceded with the unique id of all the parent locations, separated with slashes. Thus the unique string for the location “London, UK” is the unique id for London (7011781), preceded by its parent, Greater London (7008136), preceded by its parent, Britain (7002445). . . until the root location, the World (1000000) is reached. Giving the unique string for London as 1000000\1000003\7008591\7002445\7008136\7011781. Note the text, named entity and geographic fields are not orthogonal. This has the effect of multiplying the impact of terms occurring in multiple fields. For example if the term “London” appears in text, the token “london” will be indexed in the text field. “London” will be recognised by ANNIE as a Named Entity and tagged as a location (and indexed as Location Entity, “Lon- don”). The Location Entity will then be disambiguated as location “7011781” and corresponding geographic fields will be added. Previous experiments conducted on the GeoCLEF data set in [7] showed improved results from having overlapping fields. We concluded from these experiments that the increased weighting given to locations caused these improvements. 2.2 Querying The querying stage of Forostar is a two step process. First manually constructed queries are expanded and converted into Lucene’s bespoke querying language; then we query the Lucene index with these expanded queries and perform blind relevance feedback on the result. 2.2.1 Manually constructed query The queries are manually constructed in a similar structure to the Lucene index. Queries have the following parts: a text field, a Named Entity field and a location field. The text field contains the query with no alteration. The named entity field contains a list of named entities referred to in the query (manually extracted). The location field contains a list of location – relationship pairs. These are the locations contained in the query and their relationship to the location being searched for. Figure 2: Expanding the geographic queries A location can be specified either with a placename (optionally disambiguated with a referent placename), a bounding box, a bounding circle (centre and radius), or a geographic feature type (such as “lake” or “city”). A relationship can either be “exact match,” “contained in (vertical topology),” “contained in (geographic area),” or “same parent (vertical topology)”. The negation of relationships can also be expressed i.e. “excluding,” “outside,” etc. We believe such a manually constructed query could be automated with relative ease in a similar fashion to the processing that documents go through when indexed. This was not implemented due to time constraints. 2.2.2 Expanding the geographic query The geographic queries are expanded in a pipeline. The location – relation pairs are expanded in turn. The relation governs at which stage the location enters the pipeline. At each stage in the pipeline the geographic query is added to. At the first stage an exact match for this location’s unique string is added: for “London” this would be 1000000\1000003\7008591\7002445\7008136\ 7011781. Then places within the location are added, this is done using Lucene’s wild-card character notation: for locations in “London” this becomes 1000000\1000003\7008591\7002445\7008136\ 7011781\*. Then places sharing the same parent location are added, again using Lucene’s wild- card character notation. For “London” this becomes all places within “Greater London,” 1000000\ 1000003\7008591\7002445\7008136\*. Finally the coordinates of all the locations falling close to this location are added. A closeness value can manually be set in the location field, however default values are based on feature type (default values were chosen by the authors). The feature listed in the Getty TGN for “London” is “Administrative Capital,” the default value of closeness for this feature is 100km. 2.2.3 Combining using the VSM A Lucene query is built using the text fields, named entity fields and expanded geographic fields. The text field is processed by the same analyzer as at indexing time and compared to both the text and headline fields in the Lucene index. We define a separate boost factor for each field. These boost values were set by the authors during initial iterative tests (they are comparable to similar weighting in past GeoCLEF papers [6, 9]). The headline had a boost of 10, the text a boost of 7, named entities a boost of 5, geographic unique string a boost of 5 and geographic coordinates a boost of 3. The geographic, text and named entity relevance are then combined using Lucene’s Vector Space Model. We perform blind relevance feedback on the text fields only. To do this the whole expanded query is submitted to the Lucene query engine, and the top 10 documents considered relevant. The top occurring terms in these documents with more than 5 occurrences are added to the text parts of the query. A maximum of 10 terms are added. The final expanded query is re-submitted to the query engine and our final results are returned. 3 Geographic retrieval Forostar allows us to perform experiments on placename disambiguation and geographic relevance ranking. Our geographic model represents locations as points. We choose a point representation over a more accurate polygon representation for several reasons: It makes minimal appreciable difference for queries at the small (city or county) scale; Eigenhofer and Mark’s topology matters metrics refine premise [3], suggests that for queries of a larger scale than city or county topology is of greater importance than distance; and far more point data is available. We represent each location referred to in a document with a single point rather than constructing an encompassing footprint because, we argue, if several locations are referred to in a document does, it does not imply locations occurring between the referenced locations are relevant. 3.1 Placename disambiguation As discussed in Section 2.1.3 our placename disambiguation is performed using simple heuristic rules. A key part of the disambiguation is our default gazetteer, the generation of which is explained in this section. The default gazetteer is used to disambiguate placenames that are not immediately followed by a referent placename. The default gazetteer is a many-to-one mapping of Placenames to locations (i.e. for every placename there is a single location). We extract our default gazetteer from a co-occurrence model built from Wikipedia. Our Geographic co-occurrence model contains a mapping of Wikipedia articles to locations in the TGN. It also contains the placenames used to refer to every article describing a location (extracted from anchor texts). In total we crawled 2.3 million links from Wikipedia to articles describing locations. This gave us a mapping of 75,322 placenames to 53,643 locations. The default gazetteer contains these 75,322 placenames mapped to a subset of the TGN. A full description and analysis of the co-occurrence model can be found in [8]. The motivation of this disambiguation method is to provide a baseline of placename disam- biguation achievable with our co-occurrence model. Analysis of the co-occurrence model suggests its application should recognise ∼ 75% of locations with an accuracy of between ∼ 80% and ∼ 90%. The unrecognised ∼ 25% of locations will only be indexed as “Named Entity – Location.” 3.2 Geographic relevance In Section 2.1.3 our geographic relevance strategy is described. In this section we provide a justi- fication for the methods used. We have 4 types of geographic relations each expanded differently: • ‘Exact Match,’ the motivation behind this is the most relevant documents to a query will mention the location being searched for; • ‘Contained in (Vertical Topology)’ assumes locations within a location being searched for are relevant, for example ‘London’ will be relevant to queries which search for ‘England’; • Locations that share the same parent, these locations are topologically close. For example a query for ‘Wales’ would have ‘Scotland’, ‘England’ and ‘Northern Ireland’ added; • The final method of geographic relevance is defining a viewing area, all locations within a certain radius are considered relevant. Table 1: Mean Average Precision of our four methods Text 0.185 TextNoGeo 0.099 Geo 0.011 Text+Geo 0.107 Each geographic relation is considered of greater importance than the following one. This follows Egenhofer and Mark’s ‘Topology Matters, Metrics Refine’ premise. The methods of greater importance are expanded in a pipeline as illustrated in Figure 2. The expanded query is finally combined in Lucene using the Vector space model. 4 Experiments We compared four methods of query construction. All methods query the same index. • Standard Text (Text). This method only used the standard text retrieval part of the system. The motivation for this method was to evaluate our text retrieval engine and provide a baseline. • Text with geographic entities removed (TextNoGeo). For this method we manually removed the geographic entities from the text queries to quantify the importance of ambigu- ous geographic entities. The results produced by this method should be othogornal to the results produced by the Geo method. • Geographic Entities (Geo). The Geo method uses only the geographic entities contained in a query, these are matched ambiguously against the named entity index and unambigu- ously against the geographic index. Ranking is performed using the geographic relevance methods described in Section 3.2. • Text and geographic entities (Text + Geo). Our combined method combines elements of textual relevance with geographic relevance using the vector space model. It is a combi- nation of the Text and Geo methods. Our hypothesis is that it will show an improvement over the other tested methods. Our hypothesis is that a combination of Text and Geographic relevance will give the best results as it uses the most information to discover documents relevant to the query. The Standard Text method should provide a good baseline to compare this hypothesis against and the orthogonal Geo and TextNoGeo entries should help us interpret where the majority of the information is held. 5 Results The experimental results are displayed in Table 1. Surprisingly the Text result is the best with a confidence greater than 99.95% using the Wilcoxon signed rank test [5]. The Text+Geo method is better than the TextNoGeo method with a confidence greater than 95%. The Geo results are the worst with a confidence greater than 99.5%. 74.9% of named entities tagged by ANNIE as locations were mapped to locations in the default gazetteer. This is consistent with the prediction of ∼ 75% made in Section 3.1. Some brief observations of the per query results shows that the Text+Geo results are better than Geo in all except 1 case, while the Text results are better in all except 2 cases. The largest variation in results (and smallest significant difference) is the Text+Geo and the TextNoGeo results. 6 Conclusions Surprisingly the Text method achieved significantly better results than the combination of textual and geographic relevance. We attribute the relatively poor results of the Text+Geo method to the way the textual and geographic relevance were combined. The separate types of geographic relevance and the textual relevance were all combined within Lucene’s vector space model with no normalisation. The motivation behind this was that using Lucene’s term boosting we should be able to give greater weighting to text terms. The difference in information between the Text+Geo method and Text method are captured in the Geo method. Observations of the per query results shows that in cases where the Geo method performed poorly and the Text method performed well the Text+Geo method performed poorly. The intention of combining the two methods was to produce synergy, however, in reality the Geo method under- mined the Text results. The geo method alone performed poorly compared to the other methods. However, when considering the only information provided in these queries is geographic information (generally a list of placenames), the results are very promising. The highest per query result achieved by the geo method had an average precision of 0.097. Further work is needed to evaluate the accuracy of the placename disambiguation. Currently we have only quantified that 74.9% of locations recognised by ANNIE are disambiguated. We have not yet evaluated the disambiguation accuracy or the proportion of locations that are missed by ANNIE. In future work we would like to repeat the combination experiment detailed in this paper however separating the geographic relevance and textual relevance into two separate indexes. Similarity values with respect to a query could be calculated for both indexes, normalised and combined in a weighted sum. A similar approach to this was taken in GeoCLEF 2006 by Martins et al. [6]. References [1] Apache Lucene Project. http://lucene.apache.org/java/docs/. Accessed 1 August 2007, 2007. [2] H. Cunningham, D. Maynard, V. Tablan, C. Ursu, and K. Bontcheva. Developing language processing components with GATE. Technical report, University of Sheffield, 2001. [3] M. Egenhofer and D. Mark. Naive geography. In the 1st Conference on Spatial Theory (COSIT), 1995. [4] P. Harping. User’s Guide to the TGN Data Releases. The Getty Vocabulary Program, 2.0 edition, 2000. [5] D. Hull. Using statistical testing in the evaluation of retrieval experiments. In Annual inter- national ACM SIGIR Conference, pages 329–338, 1993. [6] B. Martins, N. Cardoso, M. Chaves, L. Andrade, and M. Silva. The University of Lisbon at GeoCLEF 2006. In Working Notes for the CLEF Workshop, 2006. [7] S. Overell, J. Magalhães, and S. Rüger. Forostar: A system for GIR. In (to appear) Lecture Notes from the Cross Language Evaluation Forum 2006, 2007. [8] S. Overell and S. Rüger. Geographic co-occurrence as a tool for GIR. In (to appear) CIKM Workshop on Geographic Information Retrieval, 2007. [9] M. Ruiz, S. Shapiro, J. Abbas, S. Southwick, and D. Mark. UB at GeoCLEF 2006. In Working Notes for the CLEF Workshop, 2006.