MSRA Columbus at GeoCLEF2007 Zhisheng Li1, Chong Wang2, Xing Xie2, Wei-Ying Ma2 1 Department of Computer Science, University of Sci. & Tech. of China, Hefei, Anhui, 230026, P.R. China zsli@mail.ustc.edu.cn 2 Microsoft Research Asia, 4F, Sigma Center, No.49, Zhichun Road, Beijing, 100080, P.R. China {chwang, xingx, wyma}@microsoft.com Abstract This paper describes the participation of Columbus Project of Microsoft Research Asia (MSRA) in GeoCLEF2007 (a cross-language geographical retrieval track which is part of Cross Language Evaluation Forum). This is the second time we participate in this event. Since the queries in GeoCLEF2007 are similar to those in GeoCLEF2006, we leve- rage most of the methods that we used in GeoCLEF2006, including MSRAWhitelist, MSRAExpansion, MSRALoca- tion and MSRAText approaches. The difference is that MSRAManual approach is not included in GeoCLEF2007 this time, and we use MSRALDA instead. In MSRALDA, we combine the Latent Dirichlet Allocation (LDA) model with the text retrieval model. The results show that the application of LDA model in GeoCLEF monolingual English task needs to be further explored. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and Retrieval; H.3.4 Systems and Software; H.2.3 [Database Management]: Languages – Query Languages General Terms Measurement, Performance, Experimentation Keywords Geographic information retrieval, System design, Latent Dirichlet Allocation, Evaluation 1. Introduction In general web search and mining, location information is usually discarded. However, people need to deal with lo- cations all the time, such as dining, traveling and shopping. GeoCLEF [7] aims at providing necessary platform for evaluating the geographic information retrieval. We also participated in GeoCLEF2006, and this is second time we participate this GeoCLEF event. The same to GeoCLEF2006, we only participated in the Monolingual GeoCLEF evaluation (EN-EN) and submitted five runs based on different methods. 2. Geographic Information Retrieval System Query Results Online-Phase Query Ranking Processing Searching Text Indexing Hybrid Text Index Collection Geographic Indexing GKB Offline-Phase Figure 1. Architecture of our GIR system 1 Figure 1 is the system flow of our GIR system used in GeoCLEF2007. Our geographic information retrieval system is mainly composed of geo-knowledge base, location extraction module, geo-focus detection module, query- processing module, geo-indexing module and geo-ranking module. We will describe these modules in the following sections. 2.1 Geographic Knowledge Base The Geographic Knowledge Base (GKB) we use is the same as that used in the last year. We use an internal geo- graphic database as our basic gazetteer. This gazetteer contains basic information about locations all over the world, including location name, location type, location importance and hierarchical relationship between locations. We util- ize this gazetteer to extract locations, to disambiguate locations and detect focuses of documents. Besides this gazet- teer, we also use some other resources to improve the performance, including stop word list, person name list, white list and location indicator list. The method to generate the stop word list can be found in our report last year [12]. The white list and location indicator list is maintained manually, while the person name list is downloaded from the Internet. Finally we integrated all these resources as a GKB. 2.2 Location Extraction Module Location Extraction module aims to extract locations and also disambiguate them from unstructured text. It is used in the query processing module and geo-indexing module. We manually composed the rules to address this task. It in- cludes several parts: text parsing, geo-parsing, geo-disambiguating and geo-coding. For more details, please see our reports last year [12]. 2.3 Geographic Focus Detection Module When the locations’ exact positions are determined, we want to get the focus of the documents. We adopted the algo- rithm described in [5]. Its main idea is to accumulate the score of each node in the hierarchical tree from bottom to up, and to sort all the nodes whose score are not equal to zero. Then we can get a list of focuses about the documents. The location with the biggest score is the most possible focus. For example, a document, mainly talking about Red- mond economics, also has mentioned Seattle economics and the focus of the document is Redmond. 2.4 Query Processing Module GeoCLEF2007 topics are structured topics, in which they contain topic numbers, topic-titles, topic-descriptions and topic-narratives. They don’t provide explicit locations and relationships, so we need to parse the queries first and identify the geographic references, e.g. the textual terms, spatial relationships and the locations, from the different parts of the topics. But some topics are hard to be parsed. For example, “Lakes with monsters”, “Rivers with floods”, they don’t contain explicit locations in the topics. For other examples, “Sport events in the French speaking part of Switzerland”, “F1 circuits where Ayrton Senna competed in 1994”, these human-language style queries are too diffi- cult for machines to understand. Therefore, we designed three schemes to process the topics: automatic extraction, pseudo feedback and man-made whitelist. 1. Automatic extraction. We use the location extraction module to extract locations and get coordinates. To identi- fy the relationships, e.g. “in”, “near”, we design a simple relationship matching program by adopting a rule- based approach. Except locations and relationships, we regard the left parts in the query-title as the text keyword. In such a way, we can handle topics containing explicit locations, e.g. “Damage from acid rain in northern Eu- rope”, “OSCE meetings in Eastern Europe”. 2. Pseudo Feedback. For topics which don’t contain explicit locations, we use pseudo feedback technique to ex- pand the queries. We do this in the following steps. First, we search the topic title in our search engine to get the top-N documents (here we set N = 100), then we use the location extraction module to extract the locations from these documents and select the most frequent ones (the top 10 ones in our experiments). Finally, we use the se- lected ones as the locations for the queries. 3. Manual expansion. For the topics like “Whisky making in the Scottish Islands”, “Water quality along coastlines of the Mediterranean Sea”, though they contain location names in the titles, it is still difficult to identify the pre- cise locations from these imprecise names, e.g. the coastlines of the Mediterranean Sea, the Scottish Islands. We expand them to exact locations manually by looking up in our geographic base as the location whitelist. After processing the topics, we obtain the textual terms, spatial relationship and locations of the topics and send them to our GIR system. 2 2.5 Geo-Indexing Module We use a hybrid indexing schema in our GIR system, which contains two parts: text index and geo-index. In our sys- tem, explicit locations and implicit locations [9] are indexed together and different geo-confidence scores are as- signed to them. The advantage of this mechanism is that no query expansion is necessary and implicit location in- formation can be computed offline for fast retrieval. In our system, we adopt two types of geo-indexes: one is called focus-index, which utilizes the inverted index to store all the explicit, and implicit locations of documents; the other is called grid-index, which divides the surface of the Earth into 1000 × 2000 grids. The documents will be indexed by these grids according to their focuses. For more details, please see our reports last year [12]. 2.6 Geo-Ranking module For the ranking module, we adopt IREngine, developed by MSRA, as our basic search engine. Then we integrated the geo-ranking module into it. To test the effectiveness of different methods, we totally designed three kinds of ranking algorithms: 1) pure textual ranking. Its basic ranking function is BM25; 2) linearly combining the text relev- ance and the geo-relevance; 3) linearly combing the text relevance and the LDA relevance. 2.6.1 Geo-based Model In the second scheme, we retrieve a document list with geo-relevance from the geo-index by looking up the geo- graphic terms. That is, for the focus-index, the matched docID list can be retrieved by looking up the locationID in the inverted index. For the grid-index, we can get the docID list by looking up the grids that the query location cov- ers. We first retrieve two lists of documents relevant to the textual terms and the geographical terms respectively, and then merge them to get the final results. For re-ranking, we used a combined ranking function , where is the textual relevance score and is the geo-relevance score. Experiments show that textual relevance scores should be weighted higher than geo-relevance scores ( In our experiments). 2.6.2 LDA-based Model For the third scheme, we explored the Latent Dirichlet Allocation model in our GeoCLEF2007 experiments. Latent Dirichlet Allocation (LDA) model [10] is a semantically consistent topic model. In LDA, the topic mixture is drawn from a conjugate Dirichlet prior that remains the same for all documents. The graphical model of LDA is shown in Figure 2. α β θ z w N D Figure 2. Graphical representation of LDA model. The generative process for LDA can be stated as follows: For each text document : 1. Choose . 2. For each word in document : a. Choose a topic . b. Choose a word , which is a topic-specific multinomial probability distribution. Thus, the likelihood of generating a corpus is: The LDA model is represented as a probabilistic graphical model. Compared to the probabilistic Latent Semantic Indexing model (pLSI) [11], LDA processes fully consistent generative semantics by treating the topic mixture dis- 3 tribution as a -parameter hidden variable rather than a large set of individual parameters which are explicitly linked to the training set. Thus LDA overcomes the overfitting problem and has the fully generative process for new docu- ments. In [13], Xing et al. discussed the application of LDA in ad hoc retrieval. We use the similar approach for our geo- graphic information retrieval task in GeoCLEF2007, which allows us to compute a probability a query given a doc- ument using LDA model. That is each document is scored by the likelihood of its model generating a query , where is a document model, is the query and q is a query term in . is the likelihood of the document model generating the query terms under the “bag-of-words” assumption that terms are independent given the docu- ments. In our experiment, we use LDA model as the document model. After we computed the , we selected the top 1000 documents with the highest for each query. We also use our text search engine to retrieve top 1000 documents respectively. Then we merged these two docu- ment-lists. If one document in both of the list, we used a combined score function , where is the textual relevance score and is the LDA model probability (here we set = 0.5). Both scores are normalized. Otherwise, we computed a new score for the document by multiplying a decay factor 0.5. Finally we re-ranked all these documents by the new scores and selected the top 1000 ones as result. 3. Monolingual GeoCLEF Experiments (English - English) In Table 1, we show all the five runs submitted to GeoCLEF. When the topic field is “Title”, we just use the title element of the topic to generate the query of the run. When the topic field is “Title + Description”, this means that the title and desc are both used in the run. When the topic field is “Title + Description + Narrative”, this means that title, desc and narr are all used. And the “Description” field in Table 1 gives a simple explanation of the methods used in the runs. Priorities are assigned by us, where priority 1 is the highest and 5 the lowest. Table 1. Run information Run-ID Topic Fields Description Priority MSRALDA Title Combining LDA Model 1 MSRAWhiteList Title + Description using geo knowledge base and manual query con- 2 struction MSRAExpansion Title + Description using query expansion 3 MSRALocation Title without geo knowledge base and query expansion 4 MSRAText Title + Description + Narrative using pure text 5 In MSRALDA, we used the title elements to generate the queries. Then we used the LDA-based model described in section 2.6.2 to select 1000 documents for each query. In MSRAWhiteList, we used the Title and Desc elements of the topics to generate the queries. For some special queries, e.g. “Scottish Islands”, “coastlines of the Mediterranean Sea”, we cannot get the exact locations directly from our gazetteer, so we utilized the GKB to get the corresponding geo-entities. Then we can make a whitelist manually for the geo-terms of these queries. In MSRAExpansion, we generated the queries with title and desc elements of the topics. Different from MSRAWhiteList, the queries were automatically expanded based on the pseudo-feedback technique. First we used the original queries to search the corpus. Then we extracted the locations from the returned documents and calculated the times each location appears in the documents. Finally we got the top 10 most frequent location names and combined them with the original geo- terms in the queries. In MSRALocation, we used the title elements of the topics to generate the queries. And we do not use geo knowledge base or query expansion method to expand the query locations. We just utilize our location extraction module to extract the locations automatically from the queries. In MSRAText, we generated the queries with title, desc and narr elements of the topics. We just utilized our pure text search engine “IREngine” to process the queries. 4 4. Results and Discussion Figure 3. Standard recall levels vs mean interpolated precision for all five runs Figure 3 and Table 2 show the results of MSRA Columbus on the GeoCLEF monolingual English task. MSRAText run achieves the best precision in our results. The precision of MSRALDA run decreases after combing the LDA model with the pure text model. As the same as GeoCLEF2006, the performance of MSRAExpansion is the lowest 5 among the five runs, because many unrelated locations are added to new topics after pseudo feedback for some top- ics. From Table 2, we can see that MSRALDA drops the performance significantly compared with MSRAText by about 7.6% in MAP. This indicates that linearly combining LDA model with text model does not work well. The reason may be that we haven’t tune the parameter to be the best or linear combination is not a good choice. Though the MAP of MSRALDA is lower than MSRAText, it still outperforms the latter one in some cases. For ex- ample, for the 10.2452/53-GC “Scientific research at east coast Scottish Universities”, MSRAText just retrieves 39 relevant documents, while MSRALDA retrieves 43 relevant ones (The number of relevant documents is 64). For 10.2452/65-GC “Free elections in Africa”, MSRAText retrieves 59 relevant documents and MSRALDA retrieves 74 (The number of relevant documents is 93). And we can see that the standard deviation of MSRALDA is just 0.09, lower than MSRAText. This indicates that MSRAText performs badly in some cases while MSRALDA performs more stably. MSRAWhiteList and MSRALocation achieve similar MAP with each other, about 8.6%. Their MAPs are much low- er than MSRAText by about 6.5% and just a little better than MSRAExpansion. Different from the results of GeoC- LEF2006, automatic location extraction and manual expansion don’t bring improvements. Table 2. MAP & Standard Deviation for five runs RUN-ID MAP Standard Deviation MSRALDA 7.51% 0.090 MSRAWhiteList 8.61% 0.145 MSRAExpansion 7.01% 0.134 MSRALocation 8.61% 0.145 MSRAText 15.19% 0.197 5. Conclusions We conclude that the application of LDA model in GeoCLEF monolingual English task needs to be further explored. Another conclusion is that automatic location extraction from the topics does not improve the retrieval performance, even decrease it sometimes. The third conclusion is the same as last year. That is automatic query expansion by pseudo feedback weakens the performance because the topics are too hard to be handled and many unrelated loca- tions are added to new topics. Obviously, we still need to improve the system in many aspects, such as query processing, geo-indexing and geo-ranking. 6. Reference: [1] E. Amitay, N. Har'El, R. Sivan and A. Soffer. Web-a-where: Geotagging Web Content. SIGIR 2004. [2] Y.Y. Chen, T. Suel and A. Markowitz. Efficient Query Processing in Geographical Web Search Engines. SIG- MOD’06, Chicago, IL, USA. [3] B. Martins, M. J. Silva and L. Andrade. Indexing and Ranking in Geo-IR Systems. GIR’05, Bremen, Germany. [4] A.T. Chen. Cross-Language Retrieval Experiments at CLEF 2002. Lecture Notes in Computer Science 2785, Springer 2003. [5] C. Wang, X. Xie, L. Wang, Y.S. Lu and W.Y. Ma. Detecting Geographical Locations from Web Resources. GIR’05, Bremen, Germany. [6] M. Sanderson and J. Kohler. Analyzing Geographical Queries. GIR’04, Sheffield, UK. [7] GeoCLEF2007. http://ir.shef.ac.uk/geoclef/ [8] C.B. Jones, A.I. Abdelmoty, D. Finch, G. Fu and S. Vaid. The SPIRIT Spatial Search Engine: Architecture, On- tologies and Spatial Indexing. Lecture Notes in Computer Science 3234, 2004. [9] Z. S. Li, C. Wang, X. Xie, X. F. Wang and W.Y. Ma. Indexing implicit locations for geographic information retrieval. GIR’06, Seattle, USA. [10] Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet allocation. Journal of Machine Learning Re- search. 3:993-1022. Jan. 2003. 6 [11] Hofmann, T. Probabilistic latent semantic indexing. Proceedings of the Twenty-Second Annual International SIGIR Conference, 1999. [12] Z.S. Li, C. Wang, X. Xie and W.Y. Ma, MSRA Columbus at GeoCLEF 2006, working note, GeoCLEF 2006, Alicante, Spain, Sep. 2006 [13] Wei, X. and Croft, W.B. LDA-based Document Models for Ad-hoc Retrieval. In the Proceedings of SIGIR '06, 178-185, 2006. 7