TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing Daniel Ferrés and Horacio Rodrı́guez TALP Research Center Software Department Universitat Politècnica de Catalunya {dferres,horacio}@lsi.upc.edu Abstract This paper describes our experiments on the Geographical Query Parsing pilot-task for English at GeoCLEF 2007. Our system uses some modules of a Geographical Infor- mation Retrieval system presented at GeoCLEF 2006 [3] and modified for GeoCLEF 2007. The system uses deep linguistic analysis and Geographical Knowledge to perform the task. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software General Terms Design, Performance, Experimentation Keywords report. Information Retrieval, Passage Retrieval, Geographical Thesaurus, Gazetteers, Feature Type Thesaurus, Named Entity Recognition and Classification 1 Introduction This paper describes our experiments on the Geographical Query Parsing pilot-task for English at GeoCLEF 2007. The Query Parsing task (GeoQuery) is a pilot task proposed in GEOCLEF 2007. It consists on five subtasks: • Detect whether the query is geographic or no. • Extract the WHERE component of the query. • Extract the GEO-RELATION (from a set of predefined types) if present. • Extract the WHAT component of the query and classify it as MAP, YELLOW PAGE or INFORMATION types. • extract the coordinates (LAT-LONG) of the WHERE component. This process involves sometimes a disambiguation task. As an example, see in Table 1 the information that has to be extracted from the query ”Discount Airline Tickets to Brazil”. Field Content LOCAL YES WHAT Discount Airline Tickets WHAT-TYPE INFORMATION WHERE Brazil GEO-RELATION TO LAT-LONG -10.0 -55.0 In this paper we present the overall architecture of our Geographical Query Parsing system and we describe briefly its main components. We also present the experiments, results and conclusions in the context of the GeoCLEF’s 2007 GeoQuery pilot task. 2 System Description 2.1 Overview The system architecture has two main phases that are performed sequentially: Topic Analysis and Question Classification. 2.2 Topic Analysis The Topic Analysis phase has two main components: a Linguistic Analysis and a Geographical Analysis. 2.2.1 Linguistic Analysis This process extracts lexico-semantic and syntactic information using the following set of Natural Language Processing tools: i) TnT an statistical POS tagger [1], ii) WordNet lemmatizer (ver- sion 2.0), iii) A Maximum Entropy based NERC trained with the CONLL-2003 shared task English data set, iv) Spear1 , a modified version of the Collins parser, which performs full pars- ing and robust detection of verbal predicate arguments [2] (limited to three predicate arguments: agent, direct object (or theme), and indirect object (benefactive or instrument). We pre-processed the data-set of 800.000 queries in English from a web search-engine with linguistic tools to obtain the following data structures: • Sent, which provides lexical information for each word: form, lemma, POS tag (Penn- Tree-Bank (PTB) tag-set for English), semantic class of NE, list of EWN synsets and, finally, whenever possible the verbs associated with the actor and the relations between some locations (specially countries) and their gentiles (e.g. nationality). • Sint, composed of two lists, one recording the syntactic constituent structure of the question (basically nominal, prepositional and verbal phrases) and the other collecting the information of dependencies and other relations between these components. • Environment. The environment represents the semantic relations that hold between the different components identified in the question text. These relations are organized into an ontology of about 100 semantic classes and 25 relations (mostly binary) between them. Both classes and relations are related by taxonomic links. The ontology tries to reflect what is needed for an appropriate representation of the semantic environment of the question (and the expected answer). The environment of the question is obtained from Sint and Sent. A set of about 150 rules was built to perform this task. Refer to [4] for details. 1 http://www.lsi.upc.edu/~surdeanu/spear.html 2.2.2 Geographical Analysis The Geographical Analysis is applied to the Named Entities from the queries that have been classified as LOCATION or ORGANIZATION by the NERC module. A Geographical Thesaurus is used to extract geographical information about these Name Entities. This component has been built joining four gazetteers that contain entries with places and their geographical class, coordinates, and other information: 1. GEOnet Names Server (GNS)2 : a gazetteer covering worldwide excluding the United States and Antarctica, with 5.3 million entries. 2. Geographic Names Information System (GNIS)3 , contains 2.0 million entries about geo- graphic features of the United States and its territories. We used a subset of 39,906 entries of the most important geographical names. 3. GeoWorldMap4 World Gazetteer: a gazetteer with approximately 40,594 entries of the most important countries, regions, and cities of the world. 4. World Gazetteer5 : a gazetteer with approximately 171,021 entries of towns, administrative divisions and agglomerations with their features and current population. From this gazetteer we added only the 29,924 cities with more than 5,000 unhabitants. A subset of the most important features from this thesaurus has been manually set using 46.132 places (including all kind of geographical features: countries, cities, rivers, states,. . . ). This subset of important features has been used to decide if the query is geographical or not geographical. 2.3 Question Classification The query classification task is performed through the following steps: • The query is linguistically preprocessed (as described in the previous subsection) for get- ting its lexical, syntactic and semantic content. See in Figure 1 the results of the process for the former example. What is relevant in the example is the fine grained classification of ’Brazil’ as country, the existence of taxonomic information, both of location type (administra- tive areas@@political areas@@countries) and location content (America@@South America@@ Brazil), and coordinates (-10.0 -55.0, useful for disambiguating the location and for restrict- ing the search area) and the existence of a shallow syntactic tree consisting on simple tokens and chunks, in this case built by the composition of two chunks, a nominal chunk (’Discount Airline Tickets’) and a prepositional one (’to Brazil’). Query: “Discount Airline Tickets to Brazil” Semantic: [entity(3),mod(3,1),quality(1),mod(3,2),entity(2),i en proper country(5)] Linguistic: Brazil Brazil NNP LOCATION Geographical: America@@South America@@Brazil@@-10.0 -55.0 Feature type: administrative areas@@political areas@@countries Figure 1. Semantic and Geographical Content of GQ-38. • Over the sint structure, a DCG like grammar consisting of about 30 rules developed manually from the sample of GeoQuery and the set of queries of GeoCLEF 2006, is applied for obtaining the list of topics (each topic represented by its initial and final positions) represented by a triple ). A set of features (consultive operations over chunks or tokens and predicates on the corresponding sent structures) is used by the grammar. The following features were available: 2 GNS. http://gnswww.nima.mil/geonames/GNS/index.jsp 3 GNIS. http://geonames.usgs.gov/geonames/stategaz 4 Geobytes Inc.: Geoworldmap database containing cities, regions and countries of the world with geographical coordinates. http://www.geobytes.com/. 5 World Gazetteer: http://www.world-gazetteer.com – chunk features: category, inferior, superior, descendents. – token features: num, POS, word form, lemma, NE 1 (general), NE 2 specific. – token semantics: synsets, concrete and generic Named Entity type predicates (Named Entity types include: location, person, organization, date, entity, property, magnitude, unit, cardinal point, and geographical relation. – head of the chunk features: num, POS, word, lemma, first NE, second NE. – head of the chunk semantic features. – left corner of the chunk: num, POS, word form, lemma, NE 1 (general), NE 2 (specific) – left corner of the chunk semantics: WordNet synsets. See Figure 2 for a sample rule. The rule can be paraphrased as follows: a sentence is composed by two chunks followed by a gap. The first chunk is of type ’npb’ or ’np’, i.e. it is a nominal phrase, basic or complex, its head cannot be a Named Entity and the limits of the chunk provide the limits of the topic. The second chunk is a ’pp’ and it provides the list of locations. parse_sentence(1, DS,CT,CNES) --> cc(DS,[(cc,[npb,np]),(hne1,[nil]),(ci,[LI]),(cs,[LS])],[],(1,1)), {CT=[(LI,LS)]}, cc(DS,[(cc,[pp]),(cd,[CD])],[],(1,1)), {parse_pp(_,DS,CNES,CD,[])}, parse_gap(_,DS). Figure 2. Example of DCG rule. • Finally from the result of step 2 several rule-sets are in charge of extracting: i) LOCAL, ii) WHAT and WHAT-TYPE, iii) WHERE and GEO-RELATION, and iv) LAT-LONG data. So, there are four rule sets with a total of 25 rules. Figure 3 presents an example of a WHAT rule. The rule selects from the list of topics one containing a generic location (e.g. the noun ’city’). In this case the selected topic is assigned to WHAT and the WHAT TYPE set to ’Map’. classify_question_topic(X,WHAT,’Map’):- sentence_2(X,(_,CT,_,_)), CT\==[], sentence_1(X,S), member((LC1,LC2),CT), range(LC1,LC2,R), member(LC,R), nth(LC,S,Tk1), is_ generic_location(Tk1,_), concatenate_words_pos(X,R,WHAT),!. Figure 3. Example of WHAT classification rule. 3 Experiments and Results We performed only one experiment for the GeoQuery2007 data set. The experiment consisted in to extracting the requested data for the GeoQuery from a set of 800.000 queries. The results of the TALP system presented at the GeoCLEF’s 2007 GeoQuery Geographical parsing task for English are summarized in Table 1. This table has the following IR measures for each run: Precision, Recall, and F1. In the evaluation data set, a set of 500 queries had been labeled which are chosen to represent the whole query set (800.000). The submitted results have been manually evaluated using a strict criterion where a correct results should have all , , and fields correct (the field was ignored in the evaluation). Our run achieved the following results: 0.2222 of Precision, 0.249 of Recall, and 0.235 of F1. Table 1: TALPGeoIR results at GeoQuery 2007. Team Name Precision Recall F1 TALP 0.222 0.249 0.235 4 Conclusions This is our first approach to deal with a geographical query parsing task. Our system for the GeoCLEF’s 2007 GeoQuery pilot task is based on a deep linguistic and geographical knowledge analysis. Although we need to do further evaluations to compare the system with other ones it seems that our approach could be feasible for the task. As a future work we propose the following improvements to the system: i) further evaluations of each problem subtask, ii) apply more sophisticated geographical desambiguation algorithms. Acknowledgments This work has been supported by the Spanish Research Dept. (TEXT-MESS, TIN2006-15265- C06-05). Daniel Ferrés is supported by a UPC-Recerca grant from Universitat Politècnica de Catalunya (UPC). TALP Research Center is recognized as a Quality Research Group (2001 SGR 00254) by DURSI, the Research Department of the Catalan Government. References [1] T. Brants. TnT – a statistical part-of-speech tagger. In Proceedings of the 6th Applied NLP Conference (ANLP-2000), Seattle, WA, United States, 2000. [2] M. Collins. Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, Uni- versity of Pennsylvania, 1999. [3] D. Ferrés, A. Ageno, and H. Rodrı́guez. The GeoTALP-IR System at GeoCLEF-2005: Ex- periments Using a QA-based IR System, Linguistic Analysis, and a Geographical Thesaurus. In C. Peters, F. C. Gey, J. Gonzalo, G. J.F.Jones, M. Kluck, B. Magnini, H. Mller, and M. de Rijke., editors, CLEF, volume 4022 of Lecture Notes in Computer Science. Springer, 2005. [4] D. Ferrés, S. Kanaan, A. Ageno, E. González, H. Rodrı́guez, M. Surdeanu, and J. Turmo. The TALP-QA System for Spanish at CLEF 2004: Structural and Hierarchical Relaxing of Semantic Constraints. In C. Peters, P. Clough, J. Gonzalo, G. J. F. Jones, M. Kluck, and B. Magnini, editors, CLEF, volume 3491 of Lecture Notes in Computer Science, pages 557– 568. Springer, 2004.