UAIC: Participation in LAGI Task
Adrian Iftene
UAIC: Faculty of Computer Science, “Alexandru Ioan Cuza” University, Romania
adiftene@info.uaic.ro
Abstract. The LogCLEF track was launched in 2009 with aim to analyse and
classify the user’s queries and had two tasks LAGI (Log Analysis and
Geographic Query Identification) and LADS (Log Analysis for Digital
Societies). In this edition from 2009, we built a system in order to participate in
the LAGI task. The system uses GATE or Wikipedia like external resources in
order to identify geographical entities in user’s queries. Because, the results
obtained using these resource are comparable, the main advantage of using
GATE resources comes from the short duration of execution in comparison
with using of Wikipedia resources. A brief description of our system is given in
this paper.
1 Introduction
LogCLEF1 deals with the analysis of queries as expression of user behavior. The goal
is the analysis and classification of queries in order to improve search systems.
LogCLEF had two tasks:
• Log Analysis and Geographic Query Identification (LAGI): The recognition
of the geographic component within a query stream is a key problem for
geographic information retrieval (GIR).
• Log Analysis for Digital Societies (LADS): This task used logs from The
European Library (TEL) and had intention to analyze user behavior with a
focus on multilingual search.
Our group sends runs only for LAGI task. The aim of the LAGI task is to identify
geographic elements in search log queries. For that competitors’ received two sets of
logs:
1. Tumba! - a Portuguese web search engine
2. The European Library (TEL) - on line search for materials in various
libraries in Europe.
The way in which we built the system for LAGI track is presented in Section 2,
while Section 3 presents the runs submitted. Last Section presents conclusions
regarding our participation in LAGI 2009.
1 LogCLEF: http://www.uni-hildesheim.de/logclef/
2 UAIC System for LAGI
Our system searches only in the subset in English logs (which are the majority of the
logs). Organizers force participating groups to respect some rules. The rules respected
by our system were:
1. A query is a geographical query if and only if it is bounded geographically.
2. A place term can be any country, a city or town, mountain, province or region
from GATE (Cunningham et al., 2001) or if it is described as a place in
Wikipedia (Portuguese Wikipedia for Tumba! and English Wikipedia for our
English subset of TEL).
3. A candidate place term can map to more than one possible meaning in
Wikipedia.
4. A place term can occur in a title (of a book, movie, team, etc.), but the title
itself (if a different text span from the place) is not to be tagged.
5. Capitalization (upper and lower case) in the query is ignored, as it is used
inconsistently in the queries.
6. Wildcards ('*') are ignored.
7. If some words of a query can be interpreted as forming a phrase, this will be
preferred over interpreting those words as isolated words put in the same
query.
In order to participate in LAGI, additional to using Wikipedia resources offered by
organizers, we built another resource with geographical name entities starting from
GATE resources. This resource was loaded by our program in cache and it is used
after that in identification of geographical resources. The Figure 1 presents the system
architecture.
GATE Portuguese English
resources Wikipedia Wikipedia
Main Results
Test Data
Module
Figure 1: UAIC system used in LAGI
The used resources and the main module are presented below.
2.1 Test Data
The test data consists from two files, one from Tumba! with 152 entries and one from
TEL with 108 entries. In comparison the training files had 7 entries for Tumba! file
and 9 entries for TEL file. The format for input data and for requested output data are
presented below in Table 1 and in Table 2 and are taken from TEL training data.
Table 1: TEL input data for LAGI task
875336 & 5431 & ("central europe")
828587 & 12840 & ("sicilia")
902980 & 482 & (creator all "casanova")
196270 & 5365 & ("casanova")
528968 & 190 & ("iceland*")
470448 & 8435 & ("iceland")
712725 & 5409 & ("cavan county ireland 1870")
Table 2: TEL output data for LAGI task
875336 & 5431 & ("central europe")
828587 & 12840 & ("sicilia")
902980 & 482 & (creator all "casanova")
196270 & 5365 & ("casanova")
528968 & 190 & ("iceland*")
470448 & 8435 & ("iceland")
712725 & 5409 & ("cavan county ireland 1870")
How we can see in above tables the aim is to add tags to user
queries for geographical elements.
2.2 Resources
In order to build our resource with geographical entities we start from GATE
resources and additional, we search on the web in order to add new similar entities. In
separated runs, we load our resources or resources provide by organizers: page titles
from Portuguese and English Wikipedia.
GATE
From GATE (Cunningham et al., 2001) we use the following sets of named
entities: cities, countries, small regions, regions, mountains and provinces. In the end
the total number of entities used from GATE resources was 146.581 and the size on
disk was around 4.47 Mb. In Table 3 are examples of GATE resources.
Table 3: Examples of GATE resources
Type Examples Entities Number
City Aaccra, Aalborg, Aarhus, Ababa, 143.487
Abadan, Abakan, Aberdeen, Abha, …
Country Afghanistan, Afrique, Albania, Albanie, 465
Alderney, Algeria, …
Mountain Alps, Andes, Himalaya, Pyrenees, 5
Snowdonia
Province Aana, Aargau, Abaco, Abruzzi, Abyan, 2.411
Aceh, Acores, Acquaviva, …
Region Africa, Algarve, Antarctica, Ashmore 213
and Cartier Islands, Asia, …
Wikipedia
Organizers offered to the participants’ two files with titles from Portuguese and
English Wikipedia. In comparison with GATE, the number of entries from file with
Portuguese titles from Wikipedia was 934.395 and the size on disk was 35.3 Mb, and
the number of entries from file with English titles from Wikipedia was 6.996.744 and
the size on disk was 282 Mb. In Table 4 are few lines from these files.
Table 4: Examples from Wikipedia Resources
Language Examples
English
AmericanSamoa
AppliedEthics
AccessibleComputing
Anarchism…
Portuguese Astronomia
Astronomia e astrofísica
América Latina
Albino Forjaz de Sampaio…
2.3 Main Module
The main module load successively the GATE or Wikipedia resources in cache using
a hash map in which the key is named entity itself, and the value is the number of
words from initial named entity. After that using this cache we will try to identify in
TEL and Tumba test data the geographical entities.
Resources Loading
From beginning when we load our geographical resources in cache, we transform
all characters from these entities in lower case. Additional, we split every name entity
in components words and load also these separated words in our cache, but we specify
the number of words from initial entity (in this way we will know if this key from our
hash map come from a simple name entity or from a composed name entity). We will
see how we will use this value when we try to identify geographical entities in user
query. Even if we lose much time at resource loading, after that the operations of
identification are very fast regardless of the number of user’s queries that we want to
process.
Test Data Pre-Processing
In order to identify in test data geographical entities from our cache, we perform
pre-processing of test data. The most important steps are:
1) In first step we parse the current line from test data and extract only the
relevant text (actually this is the initial user query).
2) Second step has the aim to ignore special characters like +, (, ), *, “, ” or
white spaces (tab, space, return) in initial user query.
3) Third step transform all characters extracted in previous step in lower case
characters. The result is a new form of user query, called from now new
query.
Geographical Entities Identification
The most important operation of main module is identification of the geographical
entities in this new form of user query. From now the main question is: How we
identify the geographical entities in this new query?
Initial we try to see if we have in our cache the new query itself. If YES, then we
finished the process of identification of geographical entities and we skip to the next
line in test data file. This is the case of the following line from Tel test data:
4752 & 11759 & ("portugal)"
for which we have in our cache the all user query which is “Portugal” from GATE
countries file.
If NO, then we try to split new user query in separated words if this is possible.
When we have only one word in the new query we automatically skip to the next line
in test data file. When we have more than one word, we apply the following steps:
1. At step 1 every individual word is searched in hash map. If current word comes
from a simple named entity we simply add “place” tags to it. This is the case of
below line:
4892 & 5670 & ("climbing on the Himalaya and other
mountain ranges)"
for which we have in our cache the separated word “Himalaya” from GATE
mountains file.
2. At step 2 for every word searched in hash that comes from a composed named
entity we look successively in it left and it right in order to combine more words
with the same value in hash map.
2.1. If we have like neighbors these types of hash keys then we create a common
tag. This is the case of the following line:
13128 & 11516 & ("peter woods)"
for which we have in our cache separated keys “Peter” and “Woods” from
GATE cities file (first one from “Peter Tavy” and second one from “Harper
Woods”). Because both have the same value in hash (2 which represents the
number of words from initial entity) we create a common tag for both words.
2.2. If we haven’t like neighbors these types of hash keys, then we eliminate the
all these tags.
During all previous steps the stop words are ignored.
3 Submitted Runs
We submitted two pairs of runs: one in which the main module loaded GATE like
external resource, and one in which Portuguese or English Wikipedia are loaded like
external resources.
Table 5: UAIC Runs
Test Data Resource Rcount Hcount Match Prec Rec Fmeas
TEL GATE 21 29 7 24.14 33.33 28.00
TEL Wikipedia 21 99 16 16.16 76.19 26.67
Tumba GATE 35 69 18 26.09 51.43 34.62
Tumba Wikipedia 35 147 13 8.84 37.14 14.29
Differences between results obtained with GATE and results obtained with
Wikipedia are the following:
• With GATE we mark like geographical entities a lower number in
comparison with Wikipedia (29 in comparison with 99 for TEL test data, and
69 in comparison with 147 for Tumba test data) while the correct marches
are comparable and this is the reason for higher precisions in GATE case
(24.14% in comparison with 16.16% for TEL test data, and 26.09% in
comparison with 8.84% for Tumba test data);
• Regarding correct matches, in case of TEL test data, Wikipedia offered more
correct matches and in this case the recall is higher (76.19% in comparison
with 33.33%). Amazingly, in case of Tumba test data the number of correct
matches is higher again for GATE case (18 instead of 13) and obvious the
recall is higher also in this case (51.43% in comparison with 37.14%).
Because in this case precision and recall are both higher, in the end we have
the greater F-measure: 34.62%.
4 Conclusions
This paper presents the UAIC system which took part in the LogCLEF 2009
competition in LAGI task. The system uses like external resources GATE files and
two files offered by organizers with titles from Portuguese and English Wikipedia.
Initial we load external resources in cache using a hash map. After that, at pre-
processing part we obtain a new query from the initial user query. This new query is
used by main module in order to identify geographical entities. In this process we
searches in cache the initial query, and after that words components, with aim to have
in the end the most comprehensive succession of geographical entities.
The results show how the results obtained with GATE or with Wikipedia like
external resources are comparable like quality.
Acknowledgements
The author would like to thank to the students Victor Chircu and Alexandru Cristea
and their colleagues from group 4A, second year, for their help and support at
different stages of system development.
References
1. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: an architecture for
development of robust HLT applications. In ACL '02: Proceedings of the 40th Annual
Meeting on Association for Computational Linguistics, 168--175, Association for
Computational Linguistics, Morristown, NJ, USA (2001)