<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">MetaCarta at GeoCLEF 2005</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">András</forename><surname>Kornai</surname></persName>
							<email>kornai@metacarta.com</email>
						</author>
						<author>
							<persName><forename type="first">Metacarta</forename><surname>Inc</surname></persName>
						</author>
						<author>
							<persName><forename type="first">In</forename><surname>Memoriam</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Erik</forename><surname>Rauch</surname></persName>
						</author>
						<title level="a" type="main">MetaCarta at GeoCLEF 2005</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">86D75C4DEB21728E7418E394C7EC1D5B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T00:39+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing</term>
					<term>H.3.3 Information Search and Retrieval</term>
					<term>H.3.4 Systems and Software</term>
					<term>H.3.7 Digital Libraries</term>
					<term>H.2 [Database management]: H.2.4 Textual databases</term>
					<term>H.2.8 Spatial databases and GIS</term>
					<term>I.5 [Pattern Recognition]: I.5.4 Text processing Measurement, Performance, Experimentation Information Extraction, Information Retrieval, Question Answering</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper we divide the processing steps required for the GeoCLEF task into two parts: those that are likely common to all participants and those that are specific to the MetaCarta system. After analyzing the 2005 task we conclude that it has surprisingly little geographic content.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="0">Introduction</head><p>MetaCarta participated in the GeoCLEF 2005 conference on a limited basis: only the English data (Glasgow Herald and LA Times) was used; the German material was only considered in passing. We came away from the evaluation with the impression that this was a keyword search task with little geography-specific ability required from the participating systems.</p><p>In Section 1 we describe the processing steps we used, with special emphasis on whether we consider any given step manual or automatic. In Section 2 we describe the MetaCarta results, and in Section 3 we consider the larger issue of whether the query texts require true geographical capabilities or are answerable by generic systems as well.</p><p>1 System description</p><p>The MetaCarta system was not configured or prepared for this evaluation in any way that differs from the standard setup: no changes were made to the underlying codebase, and no parameters were tuned. The GeoCLEF topics as they stand are not suitable for input to the MetaCarta system; we performed several mechanical conversion steps from one to the other.</p><p>First, we ran the 25 topics through the MetaCarta tagger. On the 124 geographic entities we had a precision of 100% (we had no false positives) and a recall of 96.8%: we missed Scottish Islands (twice), Douglas, and Campeltown. This suggests two evaluation paths: on the discard path missed entities are treated as plain (nongeographic) text, and on the pretend path we pretend the system actually found these.</p><p>Second, we removed meta-guidance such as find information about or relevant documents will describe since the relevant documents will not have the word "relevant" or "document" in them. This step is performed by the defluff.sed script (included with our submission) which, arguably, is closer to manual fluff removal than automated conversion. However, MetaCarta does not encounter the fluffy question style in any context outside GeoCLEF, and it makes no sense for MetaCarta to develop a fully automated module for the defluffing task.</p><p>Third, we removed stopwords (defined as everything that has more than 1% of the frequency of the word "the" in a terabyte corpus we used for frequency analysis). While the script defunc.sed may look ad hoc, it was itself generated by a script based on the "1% of the" criterion and as such we consider this step fully automated.</p><p>Fourth, we removed geographic metawords in a manner similar to defluffing: when the task description asks for countries involved in the fur trade the word "country" will not be in the docs. The degeo.sed script is also included with our submission.</p><p>We believe that these steps (though not necessarily in this order) are generic in the sense that every geographic IR system must perform them one way or another. After the generic steps, the topics (only title and desc fields kept) look as follows (autodetected geographic entities in boldface): Note how well the results of stopword from the desc section approximate the title section: aside from the last three topics, (where the desc section is really narrative) the two are practically identical. Therefore, our first run is based on titles only (see below).</p><formula xml:id="formula_0">GC001</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">MetaCarta-specific steps</head><p>The natural mode of operation for the MetaCarta system is to use the map as a filter: select a region of interest, such as the Trossachs, and type in some keywords such as environmental or pollution and see what documents are displayed as a result.</p><p>To emulate this operation, we created bounding boxes for each of the regions in the topics. While we didn't get around to fully automating the process of querying the database for polygons and creating the bounding boxes automatically, there is nothing in this step that requires human intervention and for the purposes of submission we consider this automated. The following table was used: Asia 25.0 179.9 6.0 Australia 112.9 159.1 -9.1 -54.  Items marked by * did not have a bounding box in the database and reflect manual assignment, a fact that is reflected in our notion of discard versus pretend evaluation.</p><p>Given a fixed collection of documents, such as the English dataset provided for GeoCLEF, a MetaCarta query has three parameters: maxdocs is the maximum number of document IDs we wish to see, typically 10 for "first page" results, bbleft bbright bbtop bbbottom are longitudes and latitudes for the bounding box, and an arbitrary number of keywords, implicitly ANDed together.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Results</head><p>In run 0 we only took the title words, the automatically detected regions, created a query as described in 1.1 with maxdocs set at 200 (since the system returns results in rank order, to create a first page one can just apply head to the result set). When the query implied logical OR rather than AND, we run the queries separately and sorted the results together by relevance.</p><p>Run 0 mimicked a true geographic search where the geographic portion of the query is input through the map interface. Run 1 is a true keyword search where everything (including geographic words) is treated just as a keyword (so the discard and the pretend paths coincide). Running trec eval produces the summaries: </p><formula xml:id="formula_1">def Run 0</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Conclusions</head><p>As can be seen from comparing the two runs, the non-geographic and the geographic results are remarkably close, which supports the conclusion we arrived at from an informal, manual assessment that very little, if any, geographic specialization is required on these tasks.</p><p>First, the selection of geographic entities is limited, and most of them fit in what MetaCarta calls "Tier 1", a small set (2350 entries) of core place names whose approximate locations are known to everyone with a high school education. With the possible exception of the Scottish Islands (a class better defined by listing than by coherent geography) and the Trossachs (whose boundaries are clearly explained in the narrative task) a system with a small post-hoc gazetteer table could handle most of the questions: the only entries missing from the Tier 1 gazetteer are Argyll, Ayr, Callander, Loch Achray, Loch Katrine, Loch Lomond, Perthshire, Scottish Islands and Trossachs, and these do not even appear in the non-narrative sections.</p><p>Given that the problem of avoiding false positives is increasingly hard as we add more and more entities to the gazetteer, a task that encourages the use of trivial gazetteers will not serve the overall evaluation goals well. As it is, MetaCarta has an F-measure of 98.36% which would be quite impressive, were it produced on a more realistic test set.</p><p>Second, even within this limited set, one has the feeling (perhaps unsubstantiated, the guidelines didn't address the issue) that many of the toponyms are used metonymically. In particular, Europe seems to refer to the EU as a political entity rather than to the continent (see in particular items 4 and 8).</p><p>Because the bar is set so low, it is expected that almost all systems will pass the geographic hurdle in GeoCLEF, and their comparison will amount to a comparison of their non-geographic capabilities. To be sure, the issues that come to the fore are classical, and fascinating issues in Information Retrieval:</p><p>1. stopword filtering 2. stemming (morphological analysis) 3. selecting keywords for a concept (vocabulary enrichment) 4. good handling of disjuncts and negation (Booleans)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">fluff removal</head><p>There is little doubt that all of these issues are worthy of formal evaluation, though not necessarily as part of an evaluation focusing on geographic IR.</p><p>1. We believe that our stopword filtering may be overly generous (there are 75 words that meet the "more than 1% of the" criterion) inasmuch as it includes content words like people, part, two, or time, but the task is clearly not geared to test this further.</p><p>2. We have purposely refrained from stemming, which decreased our overall scores significantly (e.g. on GC001 we missed LA041994-0146 "WOMEN BITTEN BY SHARK HAD WON LEUKEMIA BATTLE" since it has the word attacked rather than attack) but did not affect our main point, which is to compare geographically oriented search to true keyword search.</p><p>3. Where conjunction provides too few results, a reasonable approach would be to decrease the number of keywords: e.g. topic GC006 finds nothing with keywords oil accident birds so reducing oil accident to oil would be helpful. The queries display a clear preference for semantic IR, asking e.g. for "consequences", "concerns", and other highly abstract concepts generally considered beyond the ken of mainstream IR techniques. Ideally, one's system should understand the underlying concepts, but in reality even the human judges are on slippery ground (for example, we do not believe that documents such as LA083194-0133 are actually responsive to GC015) and of course MetaCarta has no software to make a semantic decision. While this is clearly a weakness in the state of the art, again we feel the issue has little to do with geography.</p><p>4. To automate booleans a reasonable blind approach would be to take a random word from each conjunct and try the combinations e.g. given oil prospecting and ecological problems form "oil ecological", "oil problems", "prospecting ecological", and "prospecting problems" -we have not implemented this and handle the cases of conjunction in toponyms North and South America by a mechanism that was not exercised by the test. The descriptive queries in particular offer a fascinating glimpse into other problems that are viewed as important research topics such as negation: "Reports regarding canned vegetables, vegetable juices or otherwise processed vegetables are not relevant". We do not deny the importance of these problems, but we doubt the wisdom of burdening this task with these.</p><p>5. As we have stated earlier, our fluff removal (including geographic fluff) was post hoc, and we do not believe that our scripts would generalize particularly well for future tests of a similar nature. However, again we would suggest reengineering the task rather than building infrastructure for the way it stands now.</p><p>However fascinating these problems might be, tacking "in Rwanda" on a question does not make it truly geographic, in fact there is reason to believe that the easy part of geography (continents and countries) is not any different from any other topic hierarchies. GeoCLEF 2005 did not go beyond the easy parts; let's hope next time will be better.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>-8.0 0.0 61.0 55.0 Scottish Highlands -8.0 -2.0 59.3 56.0 * Scottish Islands -8.0 0.0 61.0 56.0 Siberia 60.0 179.9 82.0 48.0 * Trossachs -4.5 -4.25 56.5 56.0 United Kingdom -8.6 2.0 60.8 49.0 United States -125.0 -66.0 49.0 26.0</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Asia child labor Asia proposals eliminate improve working conditions children GC010 Flooding Holland Germany flood disasters Holland Germany 1995 GC011 Roman UK Germany Roman UK Germany GC012 Cathedrals Europe particular cathedrals Europe United Kingdom Russia GC013 Visits American president Germany visits President Clinton Germany GC014 Environmentally hazardous Incidents North Sea environmental accidents hazards North Sea GC015 Consequences genocide Rwanda genocide Rwanda impacts GC016 Oil prospecting ecological problems Siberia and Caspian Sea Oil petroleum development related ecological problems Siberia Caspian Sea GC017 American Troops Sarajevo Bosnia Herzegovina American troop deployment Bosnia Herzegovina Sarajevo GC018 Walking holidays Scotland walking holidays Scotland GC019 Golf tournaments Europe golf tournaments held European GC020 Wind power Scottish Islands electrical power generation using wind power islands Scotland GC021 Sea rescue North Sea rescues North Sea GC022 Restored buildings Southern Scotland restoration historic buildings southern Scotland GC023 Murders violence South-West Scotland violent acts murders South West part Scotland GC024 Factors influencing tourist industry Scottish Highlands tourism industry Highlands Scotland factors affecting GC025 Environmental concerns around Scottish Trossachs environmental issues concerns Trossachs Scotland Preprocessed Queries</figDesc><table><row><cell>Shark Attacks Australia California shark attacks humans</cell></row><row><cell>GC002 Vegetable Exporters Europe exporters fresh dried frozen vegetables</cell></row><row><cell>GC003 AI Latin America Amnesty International human rights Latin America</cell></row><row><cell>GC004 Actions against fur industry Europe USA protests violent acts against fur industry</cell></row><row><cell>GC005 Japanese Rice Imports reasons consequences first imported rice Japan</cell></row><row><cell>GC006 Oil Accidents Birds Europe damage injury birds caused accidental oil spills pollution</cell></row><row><cell>GC007 Trade Unions Europe differences role importance trade unions European</cell></row><row><cell>GC008 Milk Consumption Europe milk consumption European</cell></row><row><cell>GC009 Child Labor</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Bounding Boxes</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Results on Runs 0 and 1</figDesc><table><row><cell></cell><cell></cell><cell>Run 1</cell></row><row><cell>num q</cell><cell>22</cell><cell>15</cell></row><row><cell>num ret</cell><cell>1494</cell><cell>1002</cell></row><row><cell>num rel</cell><cell>895</cell><cell>765</cell></row><row><cell>num rel ret</cell><cell>289</cell><cell>132</cell></row><row><cell>map</cell><cell>0.1700</cell><cell>0.1105</cell></row><row><cell>R-prec</cell><cell>0.2155</cell><cell>0.1501</cell></row><row><cell>bpref</cell><cell>0.1708</cell><cell>0.1148</cell></row><row><cell>recip rank</cell><cell>0.6748</cell><cell>0.6522</cell></row><row><cell>ircl prn.0.00</cell><cell>0.6837</cell><cell>0.6633</cell></row><row><cell>ircl prn.0.10</cell><cell>0.4178</cell><cell>0.2904</cell></row><row><cell>ircl prn.0.20</cell><cell>0.3443</cell><cell>0.2188</cell></row><row><cell>ircl prn.0.30</cell><cell>0.2977</cell><cell>0.1700</cell></row><row><cell>ircl prn.0.40</cell><cell>0.1928</cell><cell>0.1103</cell></row><row><cell>ircl prn.0.50</cell><cell>0.0971</cell><cell>0.0676</cell></row><row><cell>ircl prn.0.60</cell><cell>0.0435</cell><cell>0.0365</cell></row><row><cell>ircl prn.0.70</cell><cell>0.0261</cell><cell>0.0109</cell></row><row><cell>ircl prn.0.80</cell><cell>0.0130</cell><cell>0.0109</cell></row><row><cell>ircl prn.0.90</cell><cell>0.0000</cell><cell>0.0109</cell></row><row><cell>ircl prn.1.00</cell><cell>0.0000</cell><cell>0.0089</cell></row><row><cell>P5</cell><cell>0.4455</cell><cell>0.3467</cell></row><row><cell>P10</cell><cell>0.3182</cell><cell>0.2333</cell></row><row><cell>P15</cell><cell>0.2667</cell><cell>0.1867</cell></row><row><cell>P20</cell><cell>0.2500</cell><cell>0.1867</cell></row><row><cell>P30</cell><cell>0.2182</cell><cell>0.1644</cell></row><row><cell>P100</cell><cell>0.1141</cell><cell>0.0740</cell></row><row><cell>P200</cell><cell>0.0636</cell><cell>0.0410</cell></row><row><cell>P500</cell><cell>0.0263</cell><cell>0.0176</cell></row><row><cell>P1000</cell><cell>0.0131</cell><cell>0.0088</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>Special thanks to Bradley Thompson and the MetaCarta team, very much including Erik Rauch, whose sudden death fills us all with sorrow.</p></div>
			</div>

			<div type="references">

				<listBibl/>
			</div>
		</back>
	</text>
</TEI>
