=Paper= {{Paper |id=Vol-1171/CLEF2005wn-GeoCLEF-Kornai2005 |storemode=property |title=MetaCarta at GeoCLEF 2005 |pdfUrl=https://ceur-ws.org/Vol-1171/CLEF2005wn-GeoCLEF-Kornai2005.pdf |volume=Vol-1171 |dblpUrl=https://dblp.org/rec/conf/clef/Kornai05a }} ==MetaCarta at GeoCLEF 2005== https://ceur-ws.org/Vol-1171/CLEF2005wn-GeoCLEF-Kornai2005.pdf
                   MetaCarta at GeoCLEF 2005
                                         András Kornai
                                         MetaCarta Inc.
                                     kornai@metacarta.com


                                                                         In Memoriam Erik Rauch
                                             Abstract
     In this paper we divide the processing steps required for the GeoCLEF task into two
     parts: those that are likely common to all participants and those that are specific to the
     MetaCarta system. After analyzing the 2005 task we conclude that it has surprisingly
     little geographic content.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2 [Database
management]: H.2.4 Textual databases; H.2.8 Spatial databases and GIS; I.5 [Pattern Recog-
nition]: I.5.4 Text processing

General Terms
Measurement, Performance, Experimentation

Keywords
Information Extraction, Information Retrieval, Question Answering


0    Introduction
MetaCarta participated in the GeoCLEF 2005 conference on a limited basis: only the English data
(Glasgow Herald and LA Times) was used; the German material was only considered in passing.
We came away from the evaluation with the impression that this was a keyword search task with
little geography-specific ability required from the participating systems.
     In Section 1 we describe the processing steps we used, with special emphasis on whether we
consider any given step manual or automatic. In Section 2 we describe the MetaCarta results,
and in Section 3 we consider the larger issue of whether the query texts require true geographical
capabilities or are answerable by generic systems as well.


1    System description
The MetaCarta system was not configured or prepared for this evaluation in any way that differs
from the standard setup: no changes were made to the underlying codebase, and no parameters
were tuned. The GeoCLEF topics as they stand are not suitable for input to the MetaCarta
system; we performed several mechanical conversion steps from one to the other.
   First, we ran the 25 topics through the MetaCarta tagger. On the 124 geographic entities we
had a precision of 100% (we had no false positives) and a recall of 96.8%: we missed Scottish
Islands (twice), Douglas, and Campeltown. This suggests two evaluation paths: on the discard
path missed entities are treated as plain (nongeographic) text, and on the pretend path we
pretend the system actually found these.
    Second, we removed meta-guidance such as find information about or relevant documents will
describe since the relevant documents will not have the word “relevant” or “document” in them.
This step is performed by the defluff.sed script (included with our submission) which, arguably,
is closer to manual fluff removal than automated conversion. However, MetaCarta does not en-
counter the fluffy question style in any context outside GeoCLEF, and it makes no sense for
MetaCarta to develop a fully automated module for the defluffing task.
    Third, we removed stopwords (defined as everything that has more than 1% of the frequency of
the word “the” in a terabyte corpus we used for frequency analysis). While the script defunc.sed
may look ad hoc, it was itself generated by a script based on the “1% of the” criterion and as such
we consider this step fully automated.
    Fourth, we removed geographic metawords in a manner similar to defluffing: when the task
description asks for countries involved in the fur trade the word “country” will not be in the docs.
The degeo.sed script is also included with our submission.
    We believe that these steps (though not necessarily in this order) are generic in the sense
that every geographic IR system must perform them one way or another. After the generic steps,
the topics (only title and desc fields kept) look as follows (autodetected geographic entities in
boldface):

     GC001 Shark Attacks Australia California shark attacks humans
     GC002 Vegetable Exporters Europe exporters fresh dried frozen vegetables
     GC003 AI Latin America Amnesty International human rights Latin America
     GC004 Actions against fur industry Europe USA protests violent acts against fur industry
     GC005 Japanese Rice Imports reasons consequences first imported rice Japan
     GC006 Oil Accidents Birds Europe damage injury birds caused accidental oil spills pollution
     GC007 Trade Unions Europe differences role importance trade unions European
     GC008 Milk Consumption Europe milk consumption European
     GC009 Child Labor Asia child labor Asia proposals eliminate improve working conditions
     children
     GC010 Flooding Holland Germany flood disasters Holland Germany 1995
     GC011 Roman UK Germany Roman UK Germany
     GC012 Cathedrals Europe particular cathedrals Europe United Kingdom Russia
     GC013 Visits American president Germany visits President Clinton Germany
     GC014 Environmentally hazardous Incidents North Sea environmental accidents hazards
     North Sea
     GC015 Consequences genocide Rwanda genocide Rwanda impacts
     GC016 Oil prospecting ecological problems Siberia and Caspian Sea Oil petroleum devel-
     opment related ecological problems Siberia Caspian Sea
     GC017 American Troops Sarajevo Bosnia Herzegovina American troop deployment
     Bosnia Herzegovina Sarajevo
     GC018 Walking holidays Scotland walking holidays Scotland
      GC019 Golf tournaments Europe golf tournaments held European

      GC020 Wind power Scottish Islands electrical power generation using wind power islands
      Scotland
      GC021 Sea rescue North Sea rescues North Sea
      GC022 Restored buildings Southern Scotland restoration historic buildings southern Scot-
      land
      GC023 Murders violence South-West Scotland violent acts murders South West part Scot-
      land
      GC024 Factors influencing tourist industry Scottish Highlands tourism industry Highlands
      Scotland factors affecting
      GC025 Environmental concerns around Scottish Trossachs environmental issues concerns
      Trossachs Scotland

                                 Table 1: Preprocessed Queries

    Note how well the results of stopword removal from the desc section approximate the title
section: aside from the last three topics, (where the desc section is really narrative) the two are
practically identical. Therefore, our first run is based on titles only (see below).

1.1    MetaCarta-specific steps
The natural mode of operation for the MetaCarta system is to use the map as a filter: select a
region of interest, such as the Trossachs, and type in some keywords such as environmental or
pollution and see what documents are displayed as a result.
    To emulate this operation, we created bounding boxes for each of the regions in the topics.
While we didn’t get around to fully automating the process of querying the database for polygons
and creating the bounding boxes automatically, there is nothing in this step that requires human
intervention and for the purposes of submission we consider this automated. The following table
was used:

Asia 25.0 179.9 6.0
Australia 112.9 159.1 -9.1 -54.7
Bosnia Herzegovina 15.7 19.6 45.2 42.5
California -124.4 -114.1 42.0 32.5
Caspian Sea 47.0 54.0 47.0 36.0
Europe -11.0 60.0 72.00 32.00
Germany 5.8 15.0 55.0 47.2
Holland 3.3 7.2 53.5 50.7
Japan 122.9 153.9 45.5 24.0
Latin America -118.0 -35.0 32.0 -55.0
North Sea -4.0 8.0 65.0 51.0
Russia 26.0 60.0 72.0 41.1
Rwanda 28.8 30.8 -1.0 -2.8
Scotland -8.0 0.0 61.0 55.0
Scottish Highlands -8.0 -2.0 59.3 56.0
* Scottish Islands -8.0 0.0 61.0 56.0
Siberia 60.0 179.9 82.0 48.0
* Trossachs -4.5 -4.25 56.5 56.0
United Kingdom -8.6 2.0 60.8 49.0
United States -125.0 -66.0 49.0 26.0
                                    Table 2: Bounding Boxes

    Items marked by * did not have a bounding box in the database and reflect manual assignment,
a fact that is reflected in our notion of discard versus pretend evaluation.
    Given a fixed collection of documents, such as the English dataset provided for GeoCLEF, a
MetaCarta query has three parameters: maxdocs is the maximum number of document IDs we
wish to see, typically 10 for “first page” results, bbleft bbright bbtop bbbottom are longitudes
and latitudes for the bounding box, and an arbitrary number of keywords, implicitly ANDed
together.


2    Results
In run 0 we only took the title words, the automatically detected regions, created a query as
described in 1.1 with maxdocs set at 200 (since the system returns results in rank order, to create
a first page one can just apply head to the result set). When the query implied logical OR rather
than AND, we run the queries separately and sorted the results together by relevance.
    Run 0 mimicked a true geographic search where the geographic portion of the query is
input through the map interface. Run 1 is a true keyword search where everything (including
geographic words) is treated just as a keyword (so the discard and the pretend paths coincide).
Running trec eval produces the summaries:

                                def           Run 0 Run 1
                                num q              22       15
                                num ret          1494     1002
                                num rel           895      765
                                num rel ret       289      132
                                map            0.1700   0.1105
                                R-prec         0.2155   0.1501
                                bpref          0.1708   0.1148
                                recip rank     0.6748   0.6522
                                ircl prn.0.00  0.6837   0.6633
                                ircl prn.0.10  0.4178   0.2904
                                ircl prn.0.20  0.3443   0.2188
                                ircl prn.0.30  0.2977   0.1700
                                ircl prn.0.40  0.1928   0.1103
                                ircl prn.0.50  0.0971   0.0676
                                ircl prn.0.60  0.0435   0.0365
                                ircl prn.0.70  0.0261   0.0109
                                ircl prn.0.80  0.0130   0.0109
                                ircl prn.0.90  0.0000   0.0109
                                ircl prn.1.00  0.0000   0.0089
                                P5             0.4455   0.3467
                                P10            0.3182   0.2333
                                P15            0.2667   0.1867
                                P20            0.2500   0.1867
                                P30            0.2182   0.1644
                                P100           0.1141   0.0740
                                P200           0.0636   0.0410
                                P500           0.0263   0.0176
                                P1000          0.0131   0.0088
                               Table 3: Results on Runs 0 and 1
3     Conclusions
As can be seen from comparing the two runs, the non-geographic and the geographic results are
remarkably close, which supports the conclusion we arrived at from an informal, manual assessment
that very little, if any, geographic specialization is required on these tasks.
    First, the selection of geographic entities is limited, and most of them fit in what MetaCarta
calls “Tier 1”, a small set (2350 entries) of core place names whose approximate locations are
known to everyone with a high school education. With the possible exception of the Scottish
Islands (a class better defined by listing than by coherent geography) and the Trossachs (whose
boundaries are clearly explained in the narrative task) a system with a small post-hoc gazetteer
table could handle most of the questions: the only entries missing from the Tier 1 gazetteer are
Argyll, Ayr, Callander, Loch Achray, Loch Katrine, Loch Lomond, Perthshire, Scottish Islands
and Trossachs, and these do not even appear in the non-narrative sections.
    Given that the problem of avoiding false positives is increasingly hard as we add more and
more entities to the gazetteer, a task that encourages the use of trivial gazetteers will not serve
the overall evaluation goals well. As it is, MetaCarta has an F-measure of 98.36% which would be
quite impressive, were it produced on a more realistic test set.
    Second, even within this limited set, one has the feeling (perhaps unsubstantiated, the guide-
lines didn’t address the issue) that many of the toponyms are used metonymically. In particular,
Europe seems to refer to the EU as a political entity rather than to the continent (see in particular
items 4 and 8).
    Because the bar is set so low, it is expected that almost all systems will pass the geographic
hurdle in GeoCLEF, and their comparison will amount to a comparison of their non-geographic
capabilities. To be sure, the issues that come to the fore are classical, and fascinating issues in
Information Retrieval:

    1. stopword filtering
    2. stemming (morphological analysis)
    3. selecting keywords for a concept (vocabulary enrichment)
    4. good handling of disjuncts and negation (Booleans)
    5. fluff removal

    There is little doubt that all of these issues are worthy of formal evaluation, though not
necessarily as part of an evaluation focusing on geographic IR.
    1. We believe that our stopword filtering may be overly generous (there are 75 words that
meet the “more than 1% of the” criterion) inasmuch as it includes content words like people, part,
two, or time, but the task is clearly not geared to test this further.
    2. We have purposely refrained from stemming, which decreased our overall scores signifi-
cantly (e.g. on GC001 we missed LA041994-0146 “WOMEN BITTEN BY SHARK HAD WON
LEUKEMIA BATTLE” since it has the word attacked rather than attack) but did not affect our
main point, which is to compare geographically oriented search to true keyword search.
    3. Where conjunction provides too few results, a reasonable approach would be to decrease
the number of keywords: e.g. topic GC006 finds nothing with keywords oil accident birds
so reducing oil accident to oil would be helpful. The queries display a clear preference for
semantic IR, asking e.g. for “consequences”, “concerns”, and other highly abstract concepts
generally considered beyond the ken of mainstream IR techniques. Ideally, one’s system should
understand the underlying concepts, but in reality even the human judges are on slippery ground
(for example, we do not believe that documents such as LA083194-0133 are actually responsive
to GC015) and of course MetaCarta has no software to make a semantic decision. While this is
clearly a weakness in the state of the art, again we feel the issue has little to do with geography.
    4. To automate booleans a reasonable blind approach would be to take a random word from
each conjunct and try the combinations e.g. given oil prospecting and ecological problems form
“oil ecological”, “oil problems”, “prospecting ecological”, and “prospecting problems” – we have
not implemented this and handle the cases of conjunction in toponyms North and South America
by a mechanism that was not exercised by the test. The descriptive queries in particular offer
a fascinating glimpse into other problems that are viewed as important research topics such as
negation: “Reports regarding canned vegetables, vegetable juices or otherwise processed vegetables
are not relevant”. We do not deny the importance of these problems, but we doubt the wisdom
of burdening this task with these.
    5. As we have stated earlier, our fluff removal (including geographic fluff) was post hoc, and we
do not believe that our scripts would generalize particularly well for future tests of a similar nature.
However, again we would suggest reengineering the task rather than building infrastructure for
the way it stands now.
    However fascinating these problems might be, tacking “in Rwanda” on a question does not make
it truly geographic, in fact there is reason to believe that the easy part of geography (continents
and countries) is not any different from any other topic hierarchies. GeoCLEF 2005 did not go
beyond the easy parts; let’s hope next time will be better.


Acknowledgements
Special thanks to Bradley Thompson and the MetaCarta team, very much including Erik Rauch,
whose sudden death fills us all with sorrow.