=Paper=
{{Paper
|id=Vol-1173/CLEF2007wn-GeoCLEF-PereaOrtegaEt2007
|storemode=property
|title=GEOUJA System. University of Jaén at GeoCLEF 2007
|pdfUrl=https://ceur-ws.org/Vol-1173/CLEF2007wn-GeoCLEF-PereaOrtegaEt2007.pdf
|volume=Vol-1173
|dblpUrl=https://dblp.org/rec/conf/clef/Perea-OrtegaGGM07
}}
==GEOUJA System. University of Jaén at GeoCLEF 2007==
GEOUJA System. University of Jaén at GEOCLEF 2007 José M. Perea-Ortega, Miguel A. Garcı́a-Cumbreras, Manuel Garcı́a-Vega, Arturo Montejo-Ráez SINAI Group. Department of Computer Science. University of Jaén Campus Las Lagunillas, Ed. A3, E-23071, Jaén, Spain {jmperea,magc,mgarcia,amontejo}@ujaen.es Abstract This paper describes the second participation of the SINAI group of the University of Jaén in GeoCLEF 2007. We have developed a system different from the one presented in GeoCLEF 2006. Our architecture is made up of five main modules. The first one is the Information Retrieval Subsystem, that works with collections and queries in English and returns relevant documents for a query. The queries that are not in English are translated by the Translation Subsystem. All the queries are filtered by the Geo-Relation Finder Subsystem, that finds any spatial relation in the topic, and NER (Named Entities Recognition) Subsystem, that looks for any location in the topic. The most important module is the Geo-Relation Validator Subsystem, it applies some heuristics to filter documents recovered by the IR Subsystem. We have made several runs, combining these modules to resolve the monolingual and the bilingual tasks. The results obtained show that the heuristics applied are quite restrictive and therefore it must be generated new heuristics and to improve the definition of new rules to filter recovered documents. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries General Terms Algorithms, Languages, Performance, Experimentation Keywords Information Retrieval, Geographic Information Retrieval, Named Entity Recognition, GeoCLEF 1 Introduction The objective of GeoCLEF is to evaluate Geographical Information Retrieval (GIR) systems in tasks that involve both spatial and multilingual aspects. Given a multilingual statement describing a spatial user need (topic), the challenge is to find relevant documents from target collections in English, but with topics in English, Spanish, German or Portuguese [3]. This is our second participation in GeoCLEF, after the previous year[2]. In the last edition we studied the behavior of query expansion. The results obtained showed us that filtering improves precision and recall. For this reason, our system consists of five subsystems: Translation, Geographical Relations Finder, NER, Validator and Information Retrieval. The most important one is the Validator module. The list of relevant documents is filtered using it: if a document doesn’t pass the validation test it is removed from the list. Next section describes the whole system. Then, in the section 3, each module of the system is explained. Later on, results are described and finally, the conclusions about our participation in GeoCLEF 2007 are expounded. 2 System overview We propose a Geographical Information Retrieval System that is made up of five related subsys- tems. These modules are explained in detail in the next section. In our architecture we only worked with the English collection1 and we have applied a off-line preprocess to it. This preprocess consists in using the English stop-words list, a named entity recognizer (NER) and the Porter stemmer [4]. The preprocessed data set will be indexed later using the IR Subsystem. The translated query or topic proposed is indexed using the IR Subsystem too. If the language of topic is different from English, then it is translated by means of the Translation Subsystem. For each translated query to evaluate we labeled it with NER and geo-relation information. The Geo- Relation Finder Subsystem (GR Finder Subsystem) extracts spatial relations from the geographic query and the NER Subsystem recognizes named entities. From the original English query the documents are recovered by the IR Subsystem that pre- viously indexed the data collection. The NER information and the geo-relation components in relevant documents, and the NER locations from the geographic query are the input for Geo- Relation Validator Subsystem (GR Validator Subsystem), the most important module in our architecture. In the GR Validator Subsystem we eliminate those relevant documents previously retrieved that do not agree with several rules. These rules are related to all the information that handles this module (locations and spatial relations from documents and geographic queries) and are explained in section 3.4. Figure 1 shows the proposed system architecture. 3 Subsystems description 3.1 Translation Subsystem As translation module, we have used SINTRAM (SINai TRAnslation Module)[1]. This subsystem translates the queries from several languages into English. SINTRAM uses some on-line Ma- chine Translators for each language pair and implements some heuristics to combine the different translations. After a complete research the best translators were found to be • Systran for French, Italian and Portuguese. It is available at http://www.systransoft.com • Prompt for Spanish. It is available at http://translation2.paralink.com 3.2 NER Subsystem The main goal of NER Subsystem is to detect and recognize the entities appearing in the queries. We are only interested in geographical information, so we have just used locations detected by this NER Module. We have used the NER module of the GATE2 toolkit. The location terms includes everything that is town, city, capital, country and even continent. The NER module adds entity labels to the topics with the found locations. An entity label example recognized in the title of the topic follows: < en title position = ”15” type = ”LOC” > USA < /en title > 1 English Los Angeles Times 94 (LA94) and English Glasgow Herald 95 (GH95) 2 http://gate.ac.uk/ Figure 1: GEOUJA System architecture where position is the position of the entity in the phrase. This value is greater than or equal to zero and we used it to know at any moment what locations and geo-relations are related to each other by proximity. The basic operation of NER Subsystem is the following: • The first step is the preprocessing phase. Each query is preprocessed using a tokenizer, a sentence splitter and a POS tagger. The NER Subsystem we have used needs this information in order to improves the named entity detection and recognition. • The second is the detection of the geographical places. For this proposal we have use a Gazetteer, included also in GATE. The NER Subsystem generates some topic labels, based on the original ones, adding the locations. These topic labels will be used later by the GR Validator Subsystem. 3.3 GR Finder Subsystem The Geo-Relation Finder Subsystem is used to find the spatial relations in the geographic queries. This module makes use of four text files to store the geo-relations identified. Four geo-relations files exist because our system detects spatial relations of four words at the most. Some geo-relations examples are: in, near, north of, next to, in or around, in the west of... The GR Finder module adds geo-relation labels to the topics with the found spatial relations. A geo-relation label example recognized in the title of the topic would be the following one: < gr title position = ”43” > near < /gr title > where position is the position of the spatial relation in the phrase. In this module we controlled a special geo-relation named between. For this case, the GR Finder Subsystem adds the two entities that this preposition relates. An example of the label that this module adds for description label ”To be relevant documents describing oil or gas production between the UK and the European continent will be relevant” is: < gr desc position = ”9” > between the; U K; European < /gr desc > where we can see how both entities (UK and European) are added after the preposition, separated by a semicolon. The basic operation of GR Finder Subsystem is the following: • For each topic label (title, desc or narr ) the subsystem looks for some spatial relation. It makes use of the text files that store the geo-relations that can detect. • For each found spatial relation we verified that the word that comes next is an entity. For that reason it is necessary that the NER Subsystem is executed before. Like NER Subsystem, the GR Finder Subsystem also generates topic labels, based on the original topic, adding the spatial relations. These topic labels (entities and geo-relations) will be used later by the GR Validator Subsystem. In the Figure 2 we can see an example of the text that generates this subsystem. Figure 2: Text example generated by GR Finder Subsystem 3.4 GR Validator Subsystem This it is the most important module of our system. Its main goal is to discriminate what documents among the recovered ones by the IR Subsystem are valid. In order to apply different heuristics, this module makes use of geographical data. This geo- information has been obtained from Geonames Gazetteer 3 . This module solves evaluations like: 3 http://www.geonames.org/. Geonames geographical database contains over eight million geographical names and consists of 6.3 million unique features whereof 2.2 million populated places and 1.8 million alternate names • Find the country name of a city. • Find the latitude and longitude for a given location. • Check if a city belongs to a certain country. • Check if a location is to the north of another one. • Calculate the distance from a location to another one. Many heuristics can be applied with the former check points to make the validation of a docu- ment recovered by the IR Subsystem. The GR Validator Subsystem receives external information from IR Subsystem (entities from each document recovered) and from GR Finder and NER Sub- systems (entities and spatial relations from each topic). This year we have used the following heuristics in our experiments: 1. For every entity appearing in query without an associated geo-relation, the system checks if this entity is present in documents recovered by IR Subsystem. The module discards a document whenever the number of entities found in topic with no associated geo-relation and not appearing in that document exceeds the fifty percent of the total of topic entities. 2. If the entity appearing in the topic has associated some spatial relation, the module checks if location is a continent, a country or a city. Depending on this location type for the query, the heuristics which we have followed in our experiments are the following ones: (a) If the location from a query is a continent or a country and its associated geo-relation is in, on, at, from, of or along, then the module checks if most of the entities of the document belong to that continent or country (at least fifty percent). (b) If the location from a query is a city and its associated spatial relation is near, north of, south of, east of or west of, the subsystem obtains the latitude and longitude informa- tion from Geonames Gazetteer about all locations from the document to be validated. The module will check if the geographic situation of each location is valid or not de- pending on the topic space relation. For each heuristic to check, the system is adding or reducing points of a final score, depending on the result of that validation. A recovered document will be considered valid when the sum of all the scores when applying the heuristics to each entity of the document is greater than zero. 3.5 IR Subsystem The information retrieval system that we have employed is Lemur4 . It is a toolkit that supports indexing of large-scale text databases, the construction of simple language models for documents, queries, or sub-collections, and the implementation of retrieval systems based on language models as well as a variety of other retrieval models. Previous to the index step, the English collection provided for GeoCLEF have been prepro- cessed, using the English stop-words list for meaningless terms removal, and the Porter stemmer [4] for suffix stripping. A NER has also been used to recognize possible entities in each document. Next, the English collection data set has been indexed using Lemur. After indexing the collection, each topic already translated is sent to Lemur. The relevant documents retrieved and their NER information will be used by the GR Validator Subsystem. One parameter for each experiment is the weighting function, such as Okapi [5] or TF.IDF. Another is the use or not of Pseudo-Relevant Feedback (PRF) [6]. 4 http://www.lemurproject.org/. The toolkit is being developed as part of the Lemur Project, a collaboration between the Computer Science Department at the University of Massachusetts and the School of Computer Science at Carnegie Mellon University. Experiment Mean Average Precision R-Precision Sinai ENEN Exp1 fb okapi 0.2605 0.2636 Sinai ENEN Exp1 fb tfidf 0.1803 0.1858 Sinai ENEN Exp1 simple okapi 0.2486 0.2624 Sinai ENEN Exp1 simple tfidf 0.1777 0.1745 Sinai ENEN Exp2 fb tfidf 0.1343 0.1656 Table 1: Summary of results for the monolingual task 4 Experiments and Results SINAI5 has participated in monolingual and bilingual tasks for GeoCLEF 2007 with a total of 26 experiments. In all experiments we have considered all tags from topics (title, description and narrative) as source for the information retrieval process. Our baseline experiment consists in the Lemur retrieval on preprocessed collections (stopper and stemmer ) without applying heuristics on relevant documents retrieved. This experiment has been applied in the monolingual and bilingual tasks. The second experiment that we have made, consists in applying the heuristics that has been explained in section 3.4, on relevant documents retrieved by the IR Subsystem. This experiment also has been used in the monolingual and bilingual tasks. 4.1 Monolingual task In the monolingual task we have participated with a total of 8 experiments: four about baseline experiment (Exp1 ) and other four about second experiment applying the heuristics introduced previously in this paper(Exp2 ). Some results are shown in Table 1. Experiments named against an ending ”fb okapi ” we have used Okapi with feedback as weight- ing function in the information retrieval process. Those ending with ”fb tfidf ” indicate that we have applied TF.IDF with feedback. Also we have run experiments with Okapi but without feedback (”simple okapi ”) and with TF.IDF but without feedback (”simple tfidf ”). 4.2 Bilingual task In the bilingual task we have participated with a total of 18 experiments: twelve about the baseline case (Exp1 ) and six applying our heuristics (Exp2 ). Some results are shown in Table 2. For naming the experiments we have followed the same convention described in the previous section (see section 4.1). For German-English task we submitted six experiments in total. The string ”GEEN ” identifies them. For Portuguese-English task we submit six experiments too, identified by the string ”PTEN ”. For Spanish-English task we submit six experiments in total (string ”SPEN ”). 5 Conclusions and Future work In this paper we have presented the experiments carried out in our second participation in the GeoCLEF campaign. The philosophy followed in this second experimental study has changed with respect to the approach presented last year. This year we have introduced a very restrictive system: we have tried to eliminate those documents recovered by the IR Subsystem that do not satisfy certain validation rules. However, the previous year we were centered in increasing the queries, expanding them with entities and thesauri information in order to improve retrieval effectiveness. The results obtained the previous year showed that query expansion does not improve in general the quality of the information retrieval process. The results of this year shown that the documents 5 http://sinai.ujaen.es Experiment Mean Average Precision R-Precision Sinai GEEN Exp1 fb okapi 0.0686 0.0704 Sinai PTEN Exp1 fb okapi 0.1568 0.1519 Sinai SPEN Exp1 fb okapi 0.2362 0.2238 Sinai GEEN Exp1 fb tfidf 0.0572 0.0606 Sinai PTEN Exp1 fb tfidf 0.1080 0.1133 Sinai SPEN Exp1 fb tfidf 0.1511 0.1533 Sinai GEEN Exp1 simple okapi 0.0484 0.0569 Sinai PTEN Exp1 simple okapi 0.1544 0.1525 Sinai SPEN Exp1 simple okapi 0.2310 0.2476 Sinai GEEN Exp1 simple tfidf 0.0435 0.0420 Sinai PTEN Exp1 simple tfidf 0.1053 0.1117 Sinai SPEN Exp1 simple tfidf 0.1447 0.1513 Sinai PTEN Exp2 fb tfidf 0.0695 0.1074 Table 2: Summary of results for the bilingual task that have been recovered are valid but the GR Validator Subsystem has filtered some ones that must not have eliminated. For the future, we will try to add more heuristics to the GR Validator Subsystem making use of Geonames Gazetteer. Also we will define more precise rules so that the system is less restrictive for the selection of recovered documents. Finally, we will also explore a larger number of retrieved documents by the IR Subsystem, in the aim of providing a larger variety of documents to be checked by the GR Validator Subsystem. 6 Acknowledgments This work has been supported by Spanish Government (MCYT) with grant TIN2006-15265-C06- 03. References [1] Miguel A. Garcı́a-Cumbreras, L. Alfonso Ureña-López, Fernando Martı́nez Santiago, and José M. Perea-Ortega. Bruja system. the university of jaén at the spanish task of qa@clef 2006. In Proceedings of the Cross Language Evaluation Forum (CLEF 2006), 2006. [2] Manuel Garcı́a-Vega, Miguel A. Garcı́a-Cumbreras, L.A. Ureña-López, and José M. Perea- Ortega. Geouja system. the first participation of the university of jaén at geoclef 2006. In Proceedings of the Cross Language Evaluation Forum (CLEF 2006), 2006. [3] Fredric Gey, Ray Larson, Mark Sanderson, Kerstin Bischoff, Thomas Mandl, Christa Womser- Hacker, Diana Santos, and Paulo Rocha. Geoclef 2006: the clef 2006 cross-language geographic information retrieval track overview. In Proceedings of the Cross Language Evaluation Forum (CLEF 2006), 2006. [4] M.F. Porter. An algorithm for suffix stripping. In Program 14, pages 130–137, 1980. [5] S.E. Robertson and S.Walker. Okapi-Keenbow at TREC-8. In Proceedings of the 8th Text Retrieval Conference TREC-8, NIST Special Publication 500-246, pages 151–162, 1999. [6] G. Salton and G. Buckley. Improving retrieval performance by relevance feedback. Journal of American Society for Information Sciences, 21:288–297, 1990.