<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards ontology-based disambiguation of geographical identifiers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Raphael Volz</string-name>
          <email>volz@fzi.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joachim Kleb</string-name>
          <email>kleb@fzi.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wolfgang Mueller</string-name>
          <email>wmueller@fzi.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FZI Research Center for, Information Technologies</institution>
          ,
          <addr-line>Karlsruhe</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2007</year>
      </pub-date>
      <fpage>8</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>Geographic names have always been important identifiers. People typically use names and not coordinates to identify geographic features. Therefore to establish identity beyond coordinates, name disambiguation is required to identify the exact geographic feature that is denoted by a name. This paper introduces an ontology-based approach to disambiguate geographical names in texts. The ontology defines the central conceptual basis of our approach and is used to rank geographic features based on disambiguation rules that take into account structural information contained in the ontology (e.g. population of a town), as well as textual indicators contained in the text at hand. Our first evaluation on a subset of the well-known Reuters 21578 corpus indicates promising results both in terms of precision and recall.</p>
      </abstract>
      <kwd-group>
        <kwd>ontology</kwd>
        <kwd>semantics</kwd>
        <kwd>geographic references</kwd>
        <kwd>disambiguation</kwd>
        <kwd>candidate identification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>H.1.m [Miscellaneous]: ontology; H.3.3 [INFORMATION
STORAGE AND RETRIEVAL]: Information Search
and Retrieval: selection process; I.7.m [DOCUMENT AND
TEXT PROCESSING]: Miscellaneous: Document
Assignment</p>
    </sec>
    <sec id="sec-2">
      <title>General Terms</title>
    </sec>
    <sec id="sec-3">
      <title>INTRODUCTION</title>
      <p>Geography has become a popular theme on the Web.
Applications like Google Maps are easy to use and incentivise
people to publish data tagged with geographical identifiers,
i.e. coordinates in form of longitude and latitude, to
visualize their data in maps and other representations of
geography.</p>
      <p>When talking about geographic identifiers, however,
people use geographic names to denote a certain coordinate
and look up information pertaining to this coordinate using
names. Geographic names, like all names, are often highly
ambiguous. For example, the name San Jose refers to 1724
different coordinates in the two largest publicly available
database of geographic names GeoNet and GNIS (cf.
Figure 1).</p>
      <p>Figure 1 shows the disambiguation of geographic names is
a concrete and relevant task that is needed (I) to attach the
appropriate identifier to data containing a geographic name
(II) to resolve the appropriate identifier from a geographic
name when users retrieve information.</p>
      <p>Our paper presents a novel method to disambiguate
geographical names based on an ontology. The ontology
incorporates data from publicly available geographic gazetteers
- the Geonames databases GNS and GNIS (cf. section 2.1)
- as well as common world / linguistic knowledge that has
been obtained from WordNet. The ontology is used as a
gazetteer to map references to (multiple ambiguous)
geographic feature candidates from text.</p>
      <p>Our disambiguation approach chooses among those
feature candidates by dealing with three types of ambiguity:
(I) multi-referent ambiguity, when two different geographic
locations share the same name; (II) name variant
ambiguity, when the same location has different names; and (III)
geoname-non geoname ambiguity, where a location name
could also stand for some other word such as a person name
or nouns, e.g. Metro as the city in Indonesia vs. Metro as
the subway system.</p>
      <p>Our approach establishes a ranking among feature
candidates recognized in a text. To this extent, we are
leveraging further information about the geographical name in the
given document as well as scoring rules that prefer certain
concepts in the ontology over others, e.g. geographic names
for cities dominate geographic names for forests or lakes.
Thirdly, data available from the ontology, particularly the
population of cities is used to give preference to certain
candidates to prefer large towns over small towns. These rules
reflect natural heuristics as employed by humans. For
example, ”Paris” - without further information - more likely
refers to the capital of France rather than the small town
Paris in Texas.</p>
      <p>The paper is organized as follows: Section 2 describes the
ontology as well as the data sets used to build the
ontology. Section 3 describes our approach of disambiguating
geographic names using the ontology. Section 4 describes
our evaluation corpus and discusses the results of our
evaluation. We conclude with a discussion of related work in
section 5 followed by a summary of our contribution and an
illustration of next steps in Section 6.</p>
    </sec>
    <sec id="sec-4">
      <title>GEONAMES ONTOLOGY</title>
    </sec>
    <sec id="sec-5">
      <title>Geographic data sources</title>
      <p>
        Our ontology is derived from two publicly available data
sources for geographic names:
• The GEOnet Names Server (GNS) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] provides a
database of non-US geographic feature names and is
maintained by two US government authorities (US National
Geospatial-Intelligence Agency (US NGA) and US
Board on Geographic Names (US BGN)). GNS provides
approximately 5.5 million names for roughly 4.0
million geographic features outside the US.
• The Geographic Names Information System (GNIS)
maintained by US Board of Geographic Names
complements GNS with US-specific names. The database
holds the officially recognized name of each feature
and defines the feature location by state, county, US
Geological Survey (USGS) topographic map and
geographic coordinates. Other attributes include
alternate names and alternative spellings for the official
name, feature designations, historical and descriptive
information and - for some features - geometric
boundaries.
      </p>
      <p>These data sets have been frequently used as gazetteers
by the NLP community. Besides providing a list of names,
the data contains several additional information items (that
can be used for disambiguation). All names are classified in
feature types (e.g. city, park, forest, ...) and information
about the containing (political) administrative regions (e.g.
county, state and country ) is provided. For unique
identification, coordinate information stored with the name can be
used.</p>
      <p>Figure 1 visualizes the place name disambiguation
distribution characteristic for the GeoNet- and GNIS databases.
While a geographic name has in average 4.4 different
meanings, some names derived from Spanish saints refer to more
than 1000 different locations. Table 1 summarizes the size
of both data sets in terms of populated places (without
location name variants) and the full names of locations in both
data sets.</p>
      <p>WordNet
nouns
ambiguous with geo names
not ambiguous with geo names
absolute relative
114.648 100%
13.169 11,5%
101.479 88,5%</p>
      <p>WordNet is a well known lexical resource about English
words, which is often used in natural-language processing
and information retrieval applications. The core concept in
WordNet is the synset. A synset groups words with
synonymous meaning. Ambiguity of words is represented by the
mapping a Word to multiple Synsets. We are using WordNet
2.0 which contains 114.648 nouns1. 11,5% of those nouns
intersect with the geographic names in GNIS and GeoNet (cf.
Table 2) and constitute Geo-Non Geo ambiguities.
2.3</p>
    </sec>
    <sec id="sec-6">
      <title>Ontology model</title>
      <p>Our ontology is created from the above data sources and
is represented in the well-known OWL format. We currently
only use a subset of the axioms and class descriptions
possible in OWL. For our purpose the features already present
in RDF Schema suffice. Our ontology uses the following
structures:</p>
      <sec id="sec-6-1">
        <title>Definition 1. A ontology is a structure</title>
        <p>O := (C, ≤C , R, σ)
• two disjoint sets C and R whose elements are called
classes and properties, resp.,
• a partial order ≤C on C, called class hierarchy or
taxonomy,
• a function σ : R → C ×C called signature of a property
All classes are serialized as owl:Class, while all properties
are serialized as owl:ObjectProperty or
owl:DatatypeProperty.</p>
        <sec id="sec-6-1-1">
          <title>Definition 2. For a property r ∈ R, we define its do</title>
          <p>main and its range by dom(r) := π1(σ(r)) and range(r) :=
π2(σ(r)).</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>If c1 ≤C c2, for c1, c2 ∈ C, then c1 is a subclass of c2 and</title>
          <p>c2 is a superclass of c1.</p>
        </sec>
        <sec id="sec-6-1-3">
          <title>If c1 &lt;C c2 and there is no c3 ∈ C with c1 &lt;C c3 &lt;C c2,</title>
          <p>then c1 is a direct subclass of c2 and c2 is a direct superclass
of c1. We note this by c1 ≺ c2.</p>
          <p>Definition 3. A knowledge base is a structure</p>
          <p>
            KB := (CKB , RKB , I, ιC , ιR)
• a set I whose elements are called instances (or
identifiers typically denoted by URI’s),
1[
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] presents further WordNet 2.0 statistics
• two sets SC and SR whose elements are called names
for classes and properties, resp.,
• a property Ref C ⊆ SC × C called lexical reference for
classes, where (c, c) ∈ Ref C holds for all c ∈ C ∩ SC .
• a property Ref R ⊆ SR × R called lexical reference for
properties, where (r, r) ∈ Ref R holds for all r ∈ R ∩
SR.
          </p>
        </sec>
        <sec id="sec-6-1-4">
          <title>Based on Ref C , we define, for s ∈ SC , and, for c ∈ C,</title>
          <p>Ref C (s) := {c ∈ C | (s, c) ∈ Ref C }
Ref C−1(c) := {s ∈ S | (s, c) ∈ Ref C } .</p>
        </sec>
        <sec id="sec-6-1-5">
          <title>Ref R and Ref R−1 are defined analogously.</title>
          <p>An ontology with lexicon is a pair</p>
          <p>(O, Lex )</p>
          <p>IL := (SI , RI )
where O is an ontology and Lex is a lexicon for O.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>Definition 5. An instance lexicon for a knowledge base</title>
        <p>KB := (CKB , RKB , I, ιC , ιR) is a pair
• a set SI whose elements are called names for instances,
• a property RI ⊆ SI × I called lexical reference for
instances.</p>
        <p>A knowledge base with lexicon is a pair</p>
        <p>(KB , IL)
where KB is a knowledge base and IL is an instance lexicon
for KB .</p>
        <p>The lexicon is serialized as RDF triples using rdf:label as
the predicate in the triple.
2Which itself has further subclasses such as Sea, Stream or
River (not visible in Figure 2)
3http://www.w3.org/TR/2006/
WD-WordNet-rdf-20060619/</p>
        <p>The knowledge base of the ontology is created by
iterating over all WordNet noun word senses and populating
the NounWordSense and NounSynset classes with the
exception of word senses of hyponyms of the synset “city” and
“country”4. Additionally we iterate over all GeoNet and
GNIS entries, while instating the appropriate subclass of
GeographicFeature that is indicated by the feature
classification of the entry at hand. At the same time properties
are established to the related classes such as Country,
Coordinate etc.</p>
        <p>During the iterations to establish the knowledge base, the
(instance) lexicon of the ontology IL and Lex is created from
all names (and name variants) in the geographic databases
as well as all nouns in WordNet. Additionally, we have
populated the lexicon with names that reference classes, e.g.
the word ”Sea” establishes a reference to the class ”Sea”, as
well as names that reference instances, e.g. the ISO codes
for countries reference the respective instance that
represents the country.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>GEO-NAME DISAMBIGUATION</title>
      <p>Our approach for disambiguation establishes a ranking
among feature candidates recognized in text. We are
leveraging information about the geographical name in the given
text and use the ontology as a gazetteer for instance
identification. Additionally we utilize scoring rules that prefer
certain concepts in the ontology over others, e.g. geographic
names for cities dominate geographic names for forests or
lakes. The scoring rules reflect natural heuristics as
employed by humans and are based on definitions available in
the ontology.</p>
      <p>The disambiguation process involves several steps:
1. Spotting candidates for geographical identifiers in text
utilizing the ontology as a gazetteer
2. Narrowing candidates using surrounding textual
information (textual disambiguation)
3. Ranking candidates using ontological information
(ontology-based disambiguation)
3.1</p>
    </sec>
    <sec id="sec-8">
      <title>Spotting candidates</title>
      <p>To spot candidates, we first transform a given text into
a bag of word representation. Therefor a document is
represented as a vector D consisting of the terms (t1, . . . , tn)
of the document. The original ordering of the terms within
the document is preserved. In the formation of D we rely on
the built-in gazetteers of the NLP tool, which offer means
to spot person names, organization names and stop words5.
Such elements spotted by the NLP tool are not part of D.
Additionally, we have defined several grammars that
suppress the recognition of consecutive terms. For example,
the recognition of a person’s first name through the NLP
tool will automatically consider the following term as a last
name and not add both terms to D.</p>
      <p>The ontology is used as a gazetteer utilizing the instance
lexicon IL and obtaining references to candidate instances
4Otherwise we would duplicate this information in the
ontology
5We utilize the well-known stop word list of the SMART
project ftp://ftp.cs.cornell.edu/pub/smart/english.
stop
i ∈ I using the relation RI . We obtain the set of candidate
geographic identifiers in the document D using a function
cand(ti) := {i ∈ I|(ti, i) ∈ RI }6. In a similar fashion we
identify a set of concepts by utilizing the signs for concepts
SC in the lexicon Lex : con(ti) := {c ∈ C|(ti, c) ∈ Ref C }.
Note, that cand(ti) will also include references to the
nongeographic names which have been incorporated from
WordNet.</p>
      <p>The second step is textual disambiguation where textual
patterns allow to narrow candidates of ambiguous geographic
names. We first traverse the term vector D with a window
of two consecutive terms (ti,ti + 1) (cf. Figure 3). If
candidates for ti refer to instances of geographic feature and
candidates for ti + 1 refer to instances of administrative
regions or vice versa, the combination of ti and ti+1 is used
to narrow the set of candidates for ti to those geographic
features that are located in the given administrative region
ti+1. For example, the text “Paris, France” will thereby lead
to the selection of the French capital.</p>
      <p>In a next step we traverse the term vector D for a second
time with a larger window of 11 consecutive terms (ti−5, . . . ,
ti, . . . , ti+5) (cf. Figure 4) and narrow candidates based on
geographic feature classes. If candidates for a given term
ti are instances of any class referenced by con(tj), where
i − 5 ≤ j 6= i ≤ i + 5, the set of candidates is narrowed to
the instances of these classes. For example, the text
‘Notthingham forest’ will lead to the selection of the forest in
the UK over the city of Notthingham.
3.3</p>
    </sec>
    <sec id="sec-9">
      <title>Ranking candidates</title>
      <p>The remaining candidate sets are then ranked utilizing
weights that are attached to the classes of the ontology.
These weights are propagated to the instances in the
candidate set and used to rank those instances. The instance
with the highest rank is chosen. If the weight of an instance
is negative, we disregard the term as geographic name.</p>
      <p>Table 3 shows the weights attached to concepts as used in
our evaluation:
6This is a simplified definition. References to instances
constituted by multiple consecutive nouns in the text are
supported by combining those nouns.</p>
      <p>Class weights from table 3 are transitively propagated to
their subclasses via ≤C . Class weights are then propagated
to class instances via the instantiation function ιc. For all
instances of the class PopulatedPlace the number of
inhabitants (divided by thousand) is added to the weight of the
instance.</p>
      <p>Note, that we use negative weights for the WordNet classes.
This ranks names from WordNet down and at the same time
disregards all terms that have only a WordNet meaning.
Finally, we select the instance i ∈ cand(ti) with the maximum
score (or randomly choose an instance if there are several
instances with the equal maximum score).</p>
      <p>For example, the term Lancaster (without any further
information such as county or country) would yield a score of
3129 for Lancaster (CA) (a populated place with 129 tsd.
inhabitants in California) and a score of 3047 for Lancaster
(UK) (the well-known city in the UK with 47 tsd.
inhabitants), while all Lancaster counties in the US would only get
a score of 1000, and the Lancaster trough in the Lancaster
sound in the north of Canada receives -10 points being a
subclass of undersea. Hence, the algorithm chooses
Lancaster, CA as the instance that provides the most likely geo
reference.</p>
      <p>To additionally allow to determine a page focus, i.e. choose
one geographic reference among all geographic references in
the text to describe the main geographic focus of the text,
we simply count term occurrences and multiply the weight of
the instances with the number of occurrences in the text. We
then choose the instance with the maximum weight among
all candidate sets.</p>
    </sec>
    <sec id="sec-10">
      <title>EVALUATION</title>
      <p>
        While Leidner [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] describes a reference corpus of toponym
resolution, the data set is not publicly available. We
therefore compiled two corpora from the well-known Reuters 21578
data set to evaluate our approach to disambiguation of
geographic names:
• Corpus 1: 250 documents hand-annotated with all
geo-references for geo-name disambiguation
• Corpus 2 : 100 documents hand-annotated with one
country reference identifying page focus
      </p>
      <p>We evaluate our approach on both corpora and consider
a result as correct, if it corresponds exactly to the user’s
input.</p>
      <p>Test Run
Precision
Recall
Weights
WordNet senses
Administrative Region
Country
Hypsographic
Locality
PopulatedPlace
Road
Spotplace
Hydrographic
Undersea
Vegetation
0 -20000 -10000
1000 1000 1000
3000 3000 3000</p>
      <p>10 1 10
1000 500 1000
3000 10000 3000</p>
      <p>5 10 5
1000 2500 1000
10 10 10
-10 1 -10
10 20 10</p>
    </sec>
    <sec id="sec-11">
      <title>Evaluation 1 - geo-name disambiguation</title>
      <p>We evaluate the success of our approach using the classical
precision and recall measures.</p>
      <p>• precision = |Gres∩Grel|</p>
      <p>|Gres|
• recall = |Gres∩Grel|</p>
      <p>|Grel|
Here, Gres is the set of geographic references identified by
the algorithm and Grel is the set of relevant geographic
references of the corpus as identified by the annotator. We
have carried out three test runs with different weights.</p>
      <p>The first run did not weight WordNet terms negatively
and therefore leads to a low precision of 40%. This is due to
the fact that many terms are incorrectly considered as
geographic names. These false positives are constituted by
common words, such as the aforementioned example ‘Metro’.
The recall is at an acceptable level of 86,7%. Despite the
use of such a large gazetteer as the one constituted by our
ontology, annotators where (semantically) correctly tagging
several organizations such as the ‘Bank of England’ as well
as tagging adjectives such as ‘French’ with a geo-reference
to the country, while this information is not utilized in our
approach.</p>
      <p>In the second run, we weight all WordNet terms with a
high negative score. Similarly, we choose to increase the
weight of populated places, as cities are very common in the
corpus. This improves the precision tremendously to almost
70%.</p>
      <p>The third run tests the sensitivity of precision to the
weight of populated places and WordNet WordSenses and
indeed the precision goes back slightly when decreasing the
weights of those two classes. Unfortunately, we were not
able to improve the precision above the level of test run II
in several iterations with other weights.</p>
      <p>Looking at the results, we could observe that the
identification of country names has much greater precision due to
absence of ambiguities. Here, only misspellings and other
noise in the data prevent 100% precision.
4.2</p>
    </sec>
    <sec id="sec-12">
      <title>Evaluation 2 - Page focus</title>
      <p>Inspired by the good results for country name assignment,
we were curious about the results for page focus assignment
limited to country names. While we can achieve 100%
precision our recall is limited to 90,9%. Hence, our simple
approach to count the occurances of country names already
leads a very high recall, only in seldom cases the human
annotators chose to identify countries with lower frequency as
the page focus of the annotated texts.</p>
    </sec>
    <sec id="sec-13">
      <title>RELATED WORK</title>
      <p>
        Ontologies have been used in geographic information
system to merge data from different sources [
        <xref ref-type="bibr" rid="ref2 ref6">2, 6</xref>
        ]. However,
when looking at geographic information retrieval, the use
of ontologies is novel. The geographic information retrieval
community has addressed the disambiguation of geographic
names in several papers where tailored algorithms
incorporating heuristics to choose among candidates are described:
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] presents the Web-A-Where System, which uses a
custom gazetteer to identify geographic names which are likely
to have a non-geographic meaning. Geographic names are
identified with a gazetteer that is compiled from cities with
more than 50.000 inhabitants in the GNIS and GeoNet
database and therefore is by far not as extensive as our data set,
which includes all cities and also other geographic features.
Simple heuristics which assign confidence values to
candidates are used. For example, a default heuristic is that a
higher population increases confidence values for candidates.
Confidence values are also increased if other text references
qualify administrative regions (such as countries or states).
If multiple place names occur in one text, the confidence of
candidate locations that share the same administrative
region are increased. The proposed page focus strategy selects
up to four place names that cover most of place names in a
page.
      </p>
      <p>
        The Perseus project [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] disambiguates ambiguous
location names by a series of heuristics based on the qualifiers
in the vicinity (e.g. state name immediately following the
city name), nearby disambiguated place names and general
world knowledge.
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] describes the NewsExplorer application, which
automatically builds up knowledge from newspaper articles in
13 languages and identifies places and other named entities.
They employ an disambiguation process that distinguishes
place names from known person names by consideration of
place importance and by estimation of main countries based
of the minimum kilo metric distance to other places
mentioned in the text.
      </p>
      <p>
        The disambiguation method described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] introduces
several steps for disambiguating wiki-pages. It results in a
so called Disambiguation Pipeline, which consists of
disambiguation steps based on templates, categories, referents and
text heuristics. The templates in Wikipedia are initially used
for disambiguating documents, relating them to classes like
Country or Bibliography etc. The disambiguation by
category offers the possibility to denote associations between
documents. Category tags can identify the country or
continent of an article or indicate an article not referring to a
place, while the referent disambiguation offers the possibility
to express relations between named entities (e.g. between
places and parent places: describing a town, mentioning
county or country). Disambiguation with Text Heuristics
offers options to apply concrete rules between entities e.g.
describing an important related place. Only places of equal
or greater importance are used as referrers.
      </p>
      <p>
        MetaCarta [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is a commercial system that uses NLP
patterns, capitalization convention, place names found in
vicinity, human population and other heuristics to disambiguate
place names. To query web pages using a place name, they
are scored by a function combining confidence values,
positions and prominence of the place name in the web pages.
      </p>
      <p>
        The disambiguation approach of [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is rule-based and
makes use of both contextual information extracted from the
web pages as well as spatial distances between place names.
Several rules analyze contextual information to choose the
appropriate location candidate, e.g. if the administrative
region is found in the context. If no exact place name can
be assigned spatial distance to other recognized place names
in the text are used to rank those candidates higher, which
have the least difference. Similar place name disambiguation
methods have been adopted in other research efforts [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>We have captured many of the ideas for disambiguation
heuristics presented in the above-mentioned papers in our
disambiguation rules. Our use of ontologies generally
increases the flexibility and allows to easily incorporate
further knowledge into the disambiguation process. Similarly,
it allows to flexibly change disambiguation rules that are
defined on top of the ontology, e.g. to provide the recognition
of person names.
6.
6.1</p>
    </sec>
    <sec id="sec-14">
      <title>CONCLUSION</title>
    </sec>
    <sec id="sec-15">
      <title>Summary</title>
      <p>We have presented a novel approach for disambiguation
of geographical names based on ontologies, which allow us
to formulate ranking rules based on concepts and utilize the
lexicalization of the ontology as a gazetteer. The geonames
ontology constructed for our purpose of disambiguation is
used as a gazetteer to map textual references to instances
and classes in the ontology. Disambiguation rules in form
of rankings are based on attaching weights to concepts and
propagating weights to the instances of concepts. Data from
the ontology can be utilized for the ranking as well (such as
the population of towns). The ontology also finally
establishes unique references for geographic identifiers in order
to be reused in other applications in form of URI’s. This
provides a basis for both data integration as well as natural
extension points of the ontology.
6.2</p>
    </sec>
    <sec id="sec-16">
      <title>Outlook</title>
      <p>Going forward we see several next steps: First, we want
to evaluate our approach large scale by disambiguating
geographic references in arbitrary RSS feeds and utilizing this
information to filter news articles by regions referenced.
Technically, RSS feeds will be tagged by geographic identifiers.</p>
      <p>Second, as part of this application, we will incorporate
means for users to provide feedback on the correctness of the
disambiguation. This information will be used to improve
our algorithms using statistical machine learning techniques.
We will also use the ontology beyond its current gazetteer
function and display information items related to ontology
instances (such as country and state located in, neighboring
geographic features, facts such as population, altitude,
alternate spellings in different languages, etc.). We also plan
to utilize the extensibility of the ontology by incorporating
further information sources such as the dbpedia7 project,
which has created an ontology for facts from Wikipedia. The
7http://dbpedia.org/
currently used expressiveness power of the ontology will be
extended by considering more than the currently used object
property relations between concept nodes for determination
of the correct meaning of the geographic identifiers.</p>
      <p>Thirdly, in the process of spotting the meaningful
candidate terms for geographic references, the sliding window
algorithm will be further improved. Additional to the
estimation of an accurate window size, we also consider the
use of further text patterns to improve the introduced
algorithm.</p>
      <p>Concerning the suggested ranking algorithm, we plan a
dynamically adjustment of the ontology weight values. In
a next step we are going to an iteratively refinement of the
initial measure through consideration of already estimated
documents over the corpus. We also plan to improve our
page focus algorithm by consideration of spatial distances
between the identified geographic entities.</p>
      <p>Beyond this application, we will look for further
disambiguation objectives. Many cases of incorrect assignment
of geographic names are due to organization names such as
”Texas Instruments”. We will again try to tackle this
issue by extending the ontology, using it as a gazetteer and
defining appropriate heuristics. We assume this will also
allow us to improve the page focus task in many cases, as our
annotators have assigned page focus based on organization
information, e.g. ”Bank of England” → ”UK” or ”OECD”
→ ”Europe”.</p>
      <p>Acknowledgments. We thank our colleague Frank Kleiner
for his technical support and our students for the annotation
support.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Amitay</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Har'el, R. Sivan, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Soffer</surname>
          </string-name>
          .
          <article-title>Web-a-where: geotagging web content</article-title>
          .
          <source>In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <fpage>273</fpage>
          -
          <lpage>280</lpage>
          , New York, NY, USA,
          <year>2004</year>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Joho</surname>
          </string-name>
          .
          <article-title>Extraction of semantic annotations from textual web pages</article-title>
          .
          <source>Deliverable D15 6201, EU Project: SPIRIT IST-2001-35047</source>
          ,
          <year>April 2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>GEOnet. GEOnet Names</surname>
          </string-name>
          <article-title>Server</article-title>
          . http://earth-info.nga.mil/gns/html/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Leidner</surname>
          </string-name>
          .
          <article-title>Towards a reference corpus for automatic toponym resolution evaluation</article-title>
          .
          <source>In Proceedings of the Workshop on Geographic Information Retrieval held at the 27th Annual International ACM SIGIR Conference (SIGIR</source>
          <year>2004</year>
          ), Sheffield, UK,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Srihari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Niu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Location normalization for information extraction</article-title>
          .
          <source>In Proceedings of the 19th international conference on Computational linguistics</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          , Morristown, NJ, USA,
          <year>2002</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Manov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kiryakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Popov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bontcheva</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Maynard</surname>
          </string-name>
          .
          <article-title>Experiments with geographic knowledge for information extraction</article-title>
          .
          <source>In Workshop on Analysis of Geographic References</source>
          , HLT/NAACL'03, Edmonton, Canada,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Overell</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Magalh˜aes, and S. Ru¨ger. Place disambiguation with co-occurrence models</article-title>
          . In A. Nardi,
          <string-name>
            <given-names>C.</given-names>
            <surname>Peters</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. L</surname>
          </string-name>
          . Vicedo, editors,
          <source>CLEF 2006 Workshop</source>
          , Working notes,
          <year>September 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Pouliquen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kimler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Steinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ignat</surname>
          </string-name>
          , Tamara, Oellinger,
          <string-name>
            <given-names>K.</given-names>
            <surname>Blackler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fuart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zaghouani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Widiger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C.</given-names>
            <surname>Forslund</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Best</surname>
          </string-name>
          .
          <article-title>Geocoding multilingual texts: Recognition, disambiguation and visualisation</article-title>
          .
          <source>In Proceedings of the 5th International Conference on Language Resources</source>
          and
          <article-title>Evaluation (LREC-</article-title>
          <year>2006</year>
          ), May
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rauch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bukatin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Baker</surname>
          </string-name>
          .
          <article-title>A confidence-based framework for disambiguating geographic terms</article-title>
          .
          <source>In Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references</source>
          , pages
          <fpage>50</fpage>
          -
          <lpage>54</lpage>
          , Morristown, NJ, USA,
          <year>2003</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Crane</surname>
          </string-name>
          .
          <article-title>Disambiguating geographic names in a historical digital library</article-title>
          .
          <source>In ECDL '01: Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries</source>
          , pages
          <fpage>127</fpage>
          -
          <lpage>136</lpage>
          , London, UK,
          <year>2001</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jin</surname>
          </string-name>
          .
          <article-title>Statistical Overview of WordNet from 1.6 to 2.0</article-title>
          . In 2nd Global WordNet conference, pages
          <fpage>352</fpage>
          -
          <lpage>357</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.-P.</given-names>
            <surname>Lim</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. H.-L.</given-names>
            <surname>Goh</surname>
          </string-name>
          .
          <article-title>On assigning place names to geography related web pages</article-title>
          .
          <source>In JCDL '05: Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries</source>
          , pages
          <fpage>354</fpage>
          -
          <lpage>362</lpage>
          , New York, NY, USA,
          <year>2005</year>
          . ACM Press.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>