=Paper= {{Paper |id=Vol-3683/CEUR-Template-2col |storemode=property |title=Towards a Scalable Geoparsing Approach for the Web |pdfUrl=https://ceur-ws.org/Vol-3683/paper4.pdf |volume=Vol-3683 |authors=Sheikh Mastura Farzana,Tobias Hecking |dblpUrl=https://dblp.org/rec/conf/ecir/FarzanaH24 }} ==Towards a Scalable Geoparsing Approach for the Web== https://ceur-ws.org/Vol-3683/paper4.pdf
                                Towards a Scalable Geoparsing Approach for the Web
                                Sheikh Mastura Farzana1,† , Tobias Hecking1,†
                                1
                                    German Aerospace Center (DLR), Institute for Software Technology, Linder Höhe, 51147 Cologne, Germany


                                                                         Abstract
                                                                         The ongoing surge in web data generation and storage, coupled with embedded geographic information,
                                                                         holds immense potential for enhancing search applications across diverse domains. However, extracting
                                                                         geographic information for further enhancement of web search remains inadequately explored. This paper
                                                                         addresses a critical gap in the realm of geographic information extraction from web data, emphasizing
                                                                         the absence of unified pipelines for processing such information. In response to this void, we present a
                                                                         pipeline specifically tailored for web data. Furthermore, our contribution extends beyond the development
                                                                         of the pipeline itself to include a comparative analysis of various gazetteer-based geotagging methods in
                                                                         terms of accuracy and scalability along with a sizable corpora of location annotated web documents.

                                                                         Keywords
                                                                         Geoparsing, Geographic Information Extraction, Geotagging, Geocoding




                                1. Introduction
                                Accurate geographic information extraction from web resources is a building block of a geo-
                                enriched search index, which enables a wide range of location aware search applications. One
                                way of achieving this is to apply geoparsing on web data. Geoparsing is a standard way of
                                extracting locations from text (geotagging) and mapping them to their respective coordinates
                                (geocoding) [1].
                                   However, apart from the challenges of the geoparsing process itself [2], scaling up geoparsers
                                to the web comes with additional issues regarding scalability and efficient preprocessing of
                                websites [3].
                                   Currently there is a lack of large scale location annotated web data to test the scalability of
                                available geoparsers. Existing public datasets are heavily domain specific and contain particular
                                content types (eg: social media data), making them unsuitable for evaluating web geoparsers.
                                End-to-end geoparsing pipelines that can efficiently extract geo-coordinates from websites in a
                                single pass are still under development.
                                   This paper outlines a pipeline that can streamline processing web data for geographic infor-
                                mation extraction. Additionally, presents a large corpora of location annotated web contents.


                                GeoExT 2024: Second International Workshop on Geographic Information Extraction from Texts at ECIR 2024, March 24,
                                2024, Glasgow, Scotland
                                ∗
                                    Corresponding author.
                                †
                                    These authors contributed equally.
                                Envelope-Open Sheikh.Farzana@dlr.de (S. M. Farzana); Tobias.Hecking@dlr.de (T. Hecking)
                                Orcid 0000-0003-4242-2458 (S. M. Farzana); 0000-0003-0833-7989 (T. Hecking)
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Furthermore, the paper delves into the advantages and limitations of various gazetteer-based
geoparsers, providing insights for future advancements in the field.


2. Related Work
There are many off-the shelf geoparsers available with different capabilities. However, these
parsers were not intended for web scale geoparsing and display individual limitations. The
Edinburgh Geoparser, designed for standalone use, poses challenges in seamless integration
with diverse data processing frameworks. Its computational efficiency per document remains
ambiguous 1 . CLAVIN, an open-source geotagging and geoparsing tool, excels in context-based
resolution but demands specialized expertise in Java which may be difficult to integrate in
an existing web data processing pipeline 2 . Geoparsepy, utilizing OpenStreetMap, requires a
dedicated PostgreSQL database integration and returns coordinate polygons instead of individual
coordinates which is helpful when projecting locations on a map but less so when precise
coordinates are expected 3 . The newly available spatial clustering-based voting approach that
combines different parsing approaches shows remarkable results for location disambiguation,
however has an incredibly high time consumption rate unsuitable for web scale geoparsing [4].
   here are different types of geoparsers than can be explored to find the appropriate solutions.
Hu et al. explains in his paper several types of geoparsers [1]. Our previous experiments have
shown that learning based geoparsers are significantly slower than simpler parsers.[3] Moreover,
since crawling and indexing web data requires complex framework of it’s own, a geoparsing
component must be capable of seamlessly integrating into these frameworks which is difficult
to achieve with existing geoparsers as they were not intended for such use cases.
   Our studies have shown that among the many approaches of geoparsing, although learning
based and hybrid solutions perform better, gazetteer based solutions are the easiest to integrate
into existing pipelines, have acceptable performance, can be scalable and extend to multiple
languages. Therefore, in this paper we have implemented several possible gazetteer based
geographic information extractors, all of them are capable of integrating into existing web data
processing frameworks. Their performance has been analyzed on different metrics on a large
location annotated web data corpus created by us as well as on another pre-existing location
annotated corpora.


3. Geographic information extraction pipeline
Web content analysis serves as a valuable tool for extracting various types of information,
ranging from content category, language, various meta-information etc. The extraction of
these components relies on different modules, many of which are pre-existing. As a result,
any module for extracting geographic information must be capable of merging with the rest of
the analysis pipeline without added complexity. 1 shows different parts of our proposed web
content analysis pipeline.
1
  https://www.ltg.ed.ac.uk/software/geoparser/
2
  https://github.com/bigconnect/clavin
3
  https://github.com/stuartemiddleton/geoparsepy
Figure 1: Web data processing pipeline.


    • WARC Repository: Web crawlers4 are used to crawl the web and the crawled data is
      stored as WARC (Web ARChive) files, a specialized format for archiving web content5 .
      Crawling can be seen as a process that accesses webpages and downloads their content.
      Therefore, WARC files contain metadata such as the URL, timestamp, and content type,
      along with the actual content.
    • WARC Preprocessing: Each WARC file can contain information of several webpages.
      These files are required to be processed in order to retrieve individual data of each web
      page, mainly plain text and various forms of metadata including microformats, language,
      HTML headers etc. Robust libraries, such as Resiliparse,6 can be employed for efficient
      processing.
    • Geoparsing Component: It is the central element of the web data processing pipeline
      for our case, data extracted in the previous step are fed into specialized modules for
      location information extraction. This component comprises of two key processes:
          – Location Extraction from Microformat: Specific locations and relevant infor-
            mation related to a web article are often embedded within HTML code using mi-
            croformats. This special formatting, supported by conventions within HTML, is
            easily extractable using existing libraries such as extruct7 . We focus on popular
            microformats like microdata 8 and json-ld 9 to extract locations without the need
            for additional disambiguation as the location extracted from here are added by the
            webpage publishers and definitely link to the content of the webpage.

4
  https://stormcrawler.net/
5
  https://iipc.github.io/warc-specifications/
6
  https://resiliparse.chatnoir.eu/en/stable/
7
  https://pypi.org/project/extruct/
8
  https://developer.mozilla.org/en-US/docs/Web/HTML/Microdata
9
  https://schema.org/docs/schemas.html
          – Geoparsing Plain Text: The second approach involves running the extracted plain
            text through first a geotagging module, in our case, a gazetteer based model. From
            the geotagging step we extract possible place names from text and associate them
            with both locations and geographic coordinates using a gazetteer (4.1). Section 4.3
            describes at length the geoparsing models we have employed in this paper.
     • Metadata Table: Extracted geographic information, including locations and coordinates,
       as well as other information relevant to the particular web resource (not relevant for this
       paper) is incorporated into a large table. One possible way of storing such large tables
       is using Apache Parquet file format10 . These files play a crucial role in enriching web
       indexes and enhancing web search results.


4. Geoparsing Module
In order to correctly identify locations, inference of each word in a text is important, followed
by precise disambiguation of words that can mean both location or something else, eg: Paris,
can be a person’s name or a location in France. This step is followed by mapping each extracted
location to it’s geographic coordinates.
   For the sake of simplicity and reduction of processing time of huge web corpora, in this
work we do not apply any location disambiguation. Instead, each identified location is already
annotated with all coordinates along with country codes. Disambiguation can then be performed
on demand in a post-processing step. However, since place name disambiguation and one-to-one
mapping of places to specific coordinates is an essential step of geoparsing, henceforth we will
address our proposed geoparsers as geographic information extractors instead.
   For testing and refining the extraction approaches, we have created a dataset of annotated
web data. The subsequent sections provide detailed explanations of the location gazetteer,
the process of location extraction from microformats, various gazetteer-based geoparsers, and
performance evaluation.

4.1. Location Gazetteer:
A pre-existing database of locations is considered as a location gazetteer.1112 We have analyzed
GeoNames and OpenStreetMap(OSM) data, both publicly available. However, OSM database
structure is unsuitable for our task as it returns polygon for area instead of single coordinate
per location. On the other hand, GeoNames is comprehensive, easy to transform into different
formats and contains necessary coordinate information. Hence, we have chosen GeoNames
data for crafting our location gazetteer; we use the ‘allCountries.txt’ file provided for free public
use by GeoNames.




10
   https://parquet.apache.org/
11
   https://www.geonames.org/
12
   https://www.openstreetmap.org/
4.2. Location Extraction from Microformats
Microformats are formatting conventions intricately integrated into HTML code of web-pages,
commonly applied for structuring data such as contact information, events, and geographic
details, enhancing accessibility and understanding of web content. In this paper, we have only
included JsonLD microformat and we plan to integrate microdata format in future. JsonLD is
embedded as dictionary which can contain many nested dictionaries. During extraction process,
we look for entities which has the key ‘address’. If found we process it to gather all the locations
included inside the tag and search for a match in GeoNames gazetteer 4.1.

4.3. Gazetteer based Place Name Extraction Models
The gazetteer matching mechanism entails cross-referencing location information in text data
with an established location database, in our case the gazetteer mentioned in 4.1. While this
method is generally successful in precisely identifying location names, its efficacy highly depends
on the quality of the gazetteer and the parsing approaches used. Based on the limitations of
existing gazetteer based geoparsers (eg: Geoparsepy) we chose to instead implement different
types of place name extractors namely string matching, GateNLP 13 , Grammar based approach
and Named Entity Recognition (NER).

4.3.1. String matching
The easiest solution for looking for a location in a document is to match each word in that
document against a location gazetteer. Due the simplicity of the approach, it can be incredibly
fast. In our implementation, we first match pairs of consecutive words with the gazetteer to be
able to identify locations such as San Francisco, if found we remove them from the main text
and proceed to look for a match with single worded locations. It is necessary to mention that
we implemented this solution with first four consecutive words look-up (eg: San José del Cabo)
followed by three consecutive words (eg: Andorra La Vella) and then word pair and single
worded cities. However, this solution decreases the recall and precision values with an increase
in time consumption.

4.3.2. GateNLP
GateNLP is an open-source framework for natural language processing and text mining. We
take advantage of this library to extract locations from plain text. GateNLP accepts a specially
formatted gazette and has a look-up function that can match exact parts of strings to the
provided gazette. In our case the gazette is a modified version of the location gazetteer 4.1.
GateNLP is able to retrieve locations that comprises of multiple words. Looking up each location
from gazetteer in the text without this library would be very time consuming and redundant.




13
     https://gatenlp.github.io/python-gatenlp/
4.3.3. Grammar based approach
We use language specific grammar rules to identify locations in a text. Often proper nouns in a
text are locations. We use NLTK14 library to construct a tagged tree from all words of a text
and retrieve the ones tagged as proper noun (NNP). It is possible to add different grammars
from different languages for similar extractions. Here is the grammar equation we have used
for English.
                                      NP : {< NNP > +}
The extracted proper nouns are then matched with the gazetteer to identify locations, corre-
sponding coordinates and country codes.

4.3.4. NER
Named entity recognition is a popular method of identifying the entity types of a text. We use
SpaCy15 to identify location (LOC, GPE) and match them with the gazetteer. A multilingual
SpaCy model (xx-ent-wiki-sm) is used to be able to process documents is several languages.


5. Evaluation
5.1. Dataset
In order to create a large location annotated corpora we required web data. Instead of crawling
random web pages, we applied a reverse parsing technique. We first created a set of queries
using the ‘cities500’ table of the GeoNames database. Further filtering the data by removing all
locations that are outside Germany in order to keep the number of queries to a manageable
amount. For each location of the table we formulate the queries in the format city, state, country
(e.g., Bonn, North-Rhine-Westphalia, Germany), resulting in approximately 11, 500 queries. The
assumption here is that, inclusion of state and country information along with city names
will provide certain levels of disambiguation to the search engine. For example, to provide a
distinction between Frankfurt, Hesse, Germany and Frankfurt, Brandenburg, Germany etc.
   Next, we executed the queries using the ChatNoir search engine [5], on the Common-
Crawl201716 and ClueWeb202217 data indices downloading maximum of top 10 search re-
sults. From each resulting URI we the downloaded plain text using the Resiliparse library and
corresponding microformat data using the Extruct library.
   Afterwards, we annotated the plain texts with only the queried locations it is associated with.
This annotation process resulted in a test data corpus comprising 21, 834 web documents and a
total of 60, 070 annotated locations, average length of the texts are 26, 482 characters.
   Furthermore, we have used the ’LGL corpora’ as referenced by Gritta et al. in his paper
[6]. The dataset was created in 2010 by Lieberman et al.[7] This dataset comprises 557 news
articles, each meticulously annotated with location information, including geo-coordinates,

14
   https://www.nltk.org/
15
   https://spacy.io/
16
   https://commoncrawl.org/search?query=2017
17
   https://lemurproject.org/clueweb22.php/
by experts. Evaluating the performance of the geoparsers on this dataset alongside our larger
corpora ensures the consistency of our results across both datasets.
  The LGL corpora does not include microformat information, subsequently we did not attempt
microformat location extraction during analysis on this dataset.
  It is to be mentioned that both dataset contain only English text.
  Considering the substantial size of our web corpora, we sampled 10, 000 instances to efficiently
analyze time consumption. We had a total of 4140 JsonLD microformat in our 10k sample. 1580
instances of the web corpora returned locations extracted from microformat. For cases where
no location is found from microformat, we proceed to apply different geoparsing approaches,
the configurations of which are explained in Section 4.3, covering a spectrum from naive to
sophisticated implementations.

5.2. Evaluation Metrics
In our case, the most important evaluation metrics are recall rate and time consumption.
Recall shows us the percentage of locations that are successfully retrieved compared to actual
annotations. Since our location gazetteer includes all countries but the Web Corpora 10k
exclusively features German city annotations, during performance analysis, if we apply Precision
and F1-score metrics, although important indicators of performance, the results will be very
inconsistent and incomparable, hence we refrain from applying these metrics on the Web
Corpora 10k dataset.
   However, precision, recall, f1-score, and runtime, all 4 metrics are applied on the LGL corpora
for analyzing performance. We have computed precision, recall, and F1-score (denoted as 𝑃, 𝑅
and 𝐹) over total number of annotated locations across all documents against total number of
retrieved locations across all documents. Runtime, measured in seconds, it is the total amount
of time that was required to process all texts from each dataset, it excludes the time needed to
load datasets or models that have an one-time requirement for the entire process. Our tasks
were executed on an Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz machine.

5.3. Results

Table 1
Performance Metrics for Different Geoparsing Approaches
                                             LGL Corpora                   Web Corpora 10k
                                 𝑃𝑡𝑜𝑡𝑎𝑙   𝑅𝑡𝑜𝑡𝑎𝑙   𝐹𝑡𝑜𝑡𝑎𝑙   𝑅𝑢𝑛𝑡𝑖𝑚𝑒𝑡𝑜𝑡𝑎𝑙   𝑅𝑡𝑜𝑡𝑎𝑙   𝑅𝑢𝑛𝑡𝑖𝑚𝑒𝑡𝑜𝑡𝑎𝑙
              𝑆𝑡𝑟𝑖𝑛𝑔𝑀𝑎𝑡𝑐ℎ𝑖𝑛𝑔     0.07     0.57     0.12        1.47        0.84      231.32
              𝑆𝑡𝑟𝑖𝑛𝑔𝑀𝑎𝑡𝑐ℎ𝑖𝑛𝑔𝑠𝑟   0.09     0.57     0.15        16.03       0.84      3218.34
              𝐺𝑎𝑡𝑒𝑁 𝐿𝑃           0.07     0.77     0.12        5.33        0.98      1349.28
              𝐺𝑎𝑡𝑒𝑁 𝐿𝑃𝑠𝑟         0.08     0.77     0.14        18.81       0.96      4083.0
              𝐺𝑟𝑎𝑚𝑚𝑎𝑟𝑅𝑢𝑙𝑒𝑠       0.27     0.62     0.38        8.06        0.84      2651.78
              𝑁 𝐸𝑅               0.57     0.40     0.47        20.01       0.77      1624.31
              𝑁 𝐸𝑅𝑠𝑟             0.57     0.56     0.56        7.93        0.67      4239.55
   Table 1 shows very interesting results for each approach. Here, sr denotes instances where
stop words were removed. It was an attempt on our part to reduce the amount of text to be
analyzed hoping it will result in reduced time consumption. However, as we can determine
from the table that inclusion of a stop word removal function increases the time consumption
substantially but fails to improve any of the metrics significantly.
   Our fastest model for both dataset is expectedly string matching, it has a runtime requirement
far less than any of the other models. Regardless, this naive approach of extracting locations
from text fails to have acceptable precision values resulting in low F1 Scores as well (LGL
Corpora). We can observe similar performance from GateNLP, it has incredibly high recall
compared to precision in retrieving locations. The precision score is low due to the fact that the
parser finds match in the gazetteer for a lot of regular words such as work, home, road etc.
   Noticing this behavior, we delved into analyzing if our location gazetteer is up to the mark.
What we found is that, there are many locations across the world that are also regular words
such as work, tuesday, move, home, road and many more. We have searched for such odd
locations on a publicly available map and was surprised to see these locations exist bringing
us to the conclusion that the location gazetteer is quite extensive. At the same time, very
naive approaches of extracting geographic locations that do not use any other information such
as sentence structure or context are bound to underperform. Additionally, reflecting on the
necessity of inference and disambiguation techniques while extracting place names from plain
text.
   For the case of Web Corpora 10k, the runtime and recall values of different approaches show
comparable values to the LGL dataset, indicating consistent behavior. We see that GateNLP has
a recall of 0.98, however looking at the precision and f1-score of this parser on the LGL dataset
we can conclude that GateNLP extracts high numbers of non-location words as locations.
   We had expected the grammar rule based approach to be faster than it is in reality, this
approach also suffers from lack of additional context information on extracted proper nouns,
as many proper nouns can be place names as well as other things. The NER has much more
acceptable recall values (LGL Corpora) due to the fact that SpaCy models are trained to be able
to distinguish between different types of entities, providing automatic disambiguation between
place names and other entities. It can be observed NER approach show acceptable recall values
with not very low precision hence The NER based approach is overall the most hopeful one
amongst the four models we have tried based on the metrics.


6. Discussion
Importance of geographic information extraction is undeniable for information linking and
the challenge of geoparsing web content is far from resolved. In this paper we have shown
valuable insights into the scalability and capabilities of gazetteer based parsing solution for web
content with one solution having acceptable performance. However, our current approach has
limitations in-terms of place name disambiguation and one-to-one mapping of places to specific
coordinates. Regardless, more exploratory work is needed in this area in order to achieve an
well performing end-to-end web scale geoparser that has similar accuracy as existing geoparsers
for small texts as well as high robustness when integrated into web data processing pipelines.
Acknowledgment
This work has received funding from the European Union’s Horizon Europe research and
innovation programme under grant agreement No 101070014 (OpenWebSearch.EU, https://-
doi.org/10.3030/101070014).


References
[1] X. Hu, Z. Zhou, Y. Sun, J. Kersten, F. Klan, H. Fan, M. Wiegmann, Gazpne2: A general place
    name extractor for microblogs fusing gazetteers and pretrained transformer models, IEEE
    Internet of Things Journal 9 (2022) 16259–16271.
[2] M. Gritta, M. T. Pilehvar, N. Collier, Which Melbourne? Augmenting Geocoding with
    Maps, in: Proceedings of the 56th Annual Meeting of the Association for Computational
    Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne,
    Australia, 2018, pp. 1285–1296. URL: http://aclweb.org/anthology/P18-1119. doi:10.18653/
    v1/P18- 1119 .
[3] S. M. Farzana, T. Hecking, Geoparsing at web-scale-challenges and opportunities (2023).
[4] X. Hu, Y. Sun, J. Kersten, Z. Zhou, F. Klan, H. Fan, How can voting mechanisms improve
    the robustness and generalizability of toponym disambiguation?, International Journal of
    Applied Earth Observation and Geoinformation 117 (2023) 103191.
[5] J. Bevendorff, B. Stein, M. Hagen, M. Potthast, Elastic ChatNoir: Search Engine for the
    ClueWeb and the Common Crawl, in: L. Azzopardi, A. Hanbury, G. Pasi, B. Piwowarski
    (Eds.), Advances in Information Retrieval. 40th European Conference on IR Research (ECIR
    2018), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2018.
[6] M. Gritta, M. T. Pilehvar, N. Collier, A pragmatic guide to geoparsing evaluation, arXiv
    preprint arXiv:1810.12368 (2018).
[7] M. D. Lieberman, H. Samet, J. Sankaranarayanan, Geotagging with local lexicons to build
    indexes for textually-specified spatial data, in: 2010 IEEE 26th international conference on
    data engineering (ICDE 2010), IEEE, 2010, pp. 201–212.