<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Perdido: Python Library for Geoparsing and Geocoding French Texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ludovic Moncla</string-name>
          <email>ludovic.moncla@insa-lyon.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mauro Gaio</string-name>
          <email>mauro.gaio@univ-pau.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Dublin, Ireland</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Univ Lyon</institution>
          ,
          <addr-line>INSA Lyon, CNRS, UCBL, LIRIS, UMR 5205, F-69621</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Université de Pau et des Pays de l'Adour</institution>
          ,
          <addr-line>LMAP, UMR 5142, Pau</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>This paper introduces the Perdido Python library for geoparsing and geocoding French texts. The architecture of the Perdido Geoparser, which includes three layers: back-ofice, API, and Python library, is outlined. We also provide details on the methods used in the development of the processing chain and the various tasks covered, such as named entity recognition and classification (NERC), and toponym resolution. Lastly, we showcase the diferent features of the Python library and explain how to use it. The library is built as an overlay using API services, enabling users to manipulate, visualize, and export the results of geoparsing and geocoding. A Jupyter notebook1 is also provided to demonstrate all the functionalities implemented in the library.</p>
      </abstract>
      <kwd-group>
        <kwd>Geoparsing</kwd>
        <kwd>geocoding</kwd>
        <kwd>named entity recognition</kwd>
        <kwd>toponym disambiguation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>https://ludovicmoncla.github.io (L. Moncla)</p>
      <p>
        Geocoding methods in the literature are divided into two categories: those that rely on
external resources such as knowledge bases and gazetteers, and those that rely on trained
models [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The former generally yield more accurate results, as the coordinates retrieved from
a gazetteer typically correspond to a real location. However, they also require a disambiguation
step. The latter, on the other hand, requires a large amount of labeled data but do not necessitate
querying gazetteers or dealing with ambiguities. Ambiguities such as metonymy, homonymy,
and name changes over time can also arise in geocoding [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The architecture presented in this article has been developed and enriched during diferent
projects such as itinerary reconstruction from hike descriptions [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], mapping of Paris street
names cited in a corpus of 19th century novels [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and the retrieval and classification of named
entities in encyclopedic articles [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. The architecture</title>
      <p>Perdido Geoparser is implemented in three layers: the back-ofice part hosted on a server, a
REST API that exposes the back-ofice functionalities in the form of web services and the Python
library that ofers an extra layer to query the services and manipulate, visualize and export the
results.</p>
      <sec id="sec-2-1">
        <title>2.1. Back-ofice</title>
        <p>
          Back-ofice implements a processing chain for geoparsing: pre-processing (tokenization,
lemmatization, morpho-syntactic annotation), named entity recognition and classification and toponym
resolution. The pre-processing steps are performed using Treetagger1. Named entity
recognition and spatial information annotation rely on a dual cascade of transducers that use lexical
resources and pattern descriptions (local context-free grammars, morpho-syntactic patterns, …).
The transducers are implemented within the Unitex2 platform and act by insertion to tag named
entities and spatial information in the text. The processing chain produces two output formats,
an XML-TEI3 format [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] file and a GeoJSON file. Figure 1 shows an excerpt of the markup used
to annotate the named entity la rivière d’Arques. The GeoJSON file contains only the geospatial
aspects of the named entity such as its spatial footprint, associated with its name or its nature.
2.2. API
A web service has been developed for each subtask of the processing chain so that they can be
executed autonomously but also combined together by service composition [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. This leaves the
user free to use all or part of the diferent services. In addition to these services, we have also
developed two stand-alone services for geoparsing and geocoding [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Our API is deployed
using FastAPI framework4 and the ASGI Python Uvicorn server5.
1https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
2https://unitexgramlab.org
3https://tei-c.org
4https://fastapi.tiangolo.com
5https://www.uvicorn.org
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. Python library</title>
        <p>The Perdido6 Python library is available as an open-source on GitHub and can also be easily
installed through the PIP package management system7. This makes it convenient to integrate
into a Python environment and use with minimal coding required.</p>
        <p>The library provides three main classes: G e o p a r s e r and G e o c o d e r which allow to call the
corresponding web services of the API and P e r d i d o which allows to manipulate, visualize and
export the results. Other classes are also available, such as the P e r d i d o C o l l e c t i o n class, which
extends the role of the P e r d i d o class for a set of documents processed by Perdido, or the T o k e n ,
E n t i t y , and T o p o n y m classes, which provide various attributes and methods for retrieving and
viewing the objects manipulated by the P e r d i d o class.</p>
        <p>
          The constructor of the G e o p a r s e r class takes several optional arguments in parameter: for
both the geotagging and geocoding stages (these last parameters correspond to those of the
constructor of the G e o c o d e r class). Concerning the geotagging, the v e r s i o n parameter allows to
select which version of the annotation cascades will be used among the two currently existing
versions: Standard (default) and Encyclopedie. The Standard version has been developed for
geotagging texts with a very important spatial dimension, such as descriptions of routes or
hikes [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. As its name indicates, the Encyclopedie version, has been adapted specifically for
the processing of encyclopedic articles and allows annotating certain linguistic constructions
specific to encyclopedic discourse and thus improves the stages of recognition and classification
of named entities compared to the Standard version [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Concerning the geocoding, several
parameters can be specified in order to filter the results and limit ambiguities when querying
gazetteers. As an example it could be specified, the maximum number of locations returned for
each toponym (m a x _ r o w s ), a country code (c o u n t r y _ c o d e ), or a bounding box (b b o x ).
        </p>
        <p>
          The methods p a r s e ( ) and g e o c o d e ( ) of the G e o p a r s e r and G e o c o d e r classes, respectively, call
the geoparsing and geocoding web services of the API and return a P e r d i d o object. These are
6https://github.com/ludovicmoncla/perdido
7https://pypi.org/project/perdido/
the methods that are executed when an instance of the classes G e o p a r s e r or G e o c o d e r is used
as a function. The method p a r s e ( ) takes as parameter the text that we want to geoparser and
the method g e o c o d e takes as parameter a place name (or a list of place names) to geocode.
For disambiguation, the method c l u s t e r _ d i s a m b i g u a t i o n ( ) of the class P e r d i d o implements
a spatial density clustering (DBSCAN) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and makes it possible to remove a great number
of ambiguities when the places of the text are close (an epsilon parameter is used to set the
maximum distance for two points to be grouped within the same cluster).
        </p>
        <sec id="sec-2-2-1">
          <title>2.3.1. Output formats, visualization and export of results</title>
          <p>The P e r d i d o class provides diferent attributes and methods to access the output formats and
propose diferent ways of visualizing the geoparsed results. For example, the attribute t e i allows
to retrieve directly the XML-TEI format returned by the geoparsing web service (see Figure 1).
The method t s v _ f o r m a t ( ) of the class T o k e n allows to retrieve tokens in TSV format according
to the IOB (short for inside, outside, beginning) annotation scheme8. The TSV format allows to
store one token per line and for each token: its index, its form, its lemma, its part of speech and
its semantic category(ies). For display purpose, the t o _ s p a c y _ d o c ( ) method is provided by the
P e r d i d o class. This method transforms a P e r d i d o object into a SpaCy D o c 9 object, allowing to
use the displaCy10 library for NER visualization. Two modes are possible, the first one displays
only named entities (i.e. proper names) (Fig. 2a), the second one displays nested named entities
(Fig. 2b). Perdido provides also the g e t _ f o l i u m _ m a p ( ) method for visualizing results on a map
(Fig 3).</p>
          <p>(a) named entities
(b) nested named entities</p>
          <p>Finally, Perdido proposes several methods to export the results of geoparsing, such as the
method t o _ x m l ( ) , which saves the content of the attribute t e i in an XML file, the method
t o _ g e o j s o n ( ) , which saves the content of the attribute g e o j s o n in a json file, or the method
t o _ i o b ( ) , which saves the results of the annotation of named entities in TSV format according
8IOB/BIO is a common tagging format for tagging tokens in a chunking task in computational linguistics, a token is
annotated B-&lt;tag&gt; if it is the beginning of a chunk, I-&lt;tag&gt; indicates that the tag is inside a chunk. An O-&lt;tag&gt;
indicates that a token belongs to no entity/chunk.
9https://spacy.io/api/doc
10https://spacy.io/universe/project/displacy
to the IOB annotation scheme. These methods take as parameter the path to which the user
wants to save the files.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.3.2. Datasets</title>
          <p>Two datasets are currently available in the library. The first contains 3,385 encyclopedic articles
(corresponding to volume 7 of Diderot and d’Alembert’s Encyclopedia (1751-1772)), provided
by ARTFL11 within the framework of the GEODE12 project. The second one contains 30
descriptions of hikes collected in the framework of the ANR CHOUCAS13 project, where each
description is associated with its GPS track.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Perspectives</title>
      <p>This article describes the overall architecture of the Perdido geoparsing tool and the recent
development of its Python library. The library ofers two main functions: geoparsing and
geocoding of French texts. However, it is still a work in progress, and several improvements are
planned. One proposed improvement is the implementation of a trained model for automatic
annotation of nominal entities (or unnamed entities) upstream of the existing annotation cascade.
11https://artfl-project.uchicago.edu
12https://geode-project.github.io
13http://choucas.ign.fr
Another improvement being considered is the use of machine learning to train models, which
will be integrated with the current approach to make it more versatile for analyzing diverse
texts. Besides technical improvements, we also plan to conduct an evaluation campaign using
benchmarks and our own corpora for the comparison of our approach with baselines.</p>
      <p>Additionally, several other options are being explored for the geocoding step, such as using
centroids, distances, or interpreting the spatial context extracted from the text to improve
toponym disambiguation.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon,
for its financial support within the French program ”Investments for the Future” operated by
the National Research Agency (ANR).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gritta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Pilehvar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Limsopatham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Collier</surname>
          </string-name>
          ,
          <article-title>What's missing in geographical parsing?</article-title>
          ,
          <source>Language Resources and Evaluation</source>
          <volume>52</volume>
          (
          <year>2018</year>
          )
          <fpage>603</fpage>
          -
          <lpage>623</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Leidner</surname>
          </string-name>
          ,
          <article-title>Toponym resolution in text: Annotation, evaluation and applications of spatial grounding</article-title>
          ,
          <source>SIGIR Forum 41</source>
          (
          <year>2007</year>
          )
          <fpage>124</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fize</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moncla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Martins</surname>
          </string-name>
          ,
          <article-title>Deep learning for toponym resolution: Geocoding based on pairs of toponyms</article-title>
          ,
          <source>ISPRS International Journal of Geo-Information</source>
          <volume>10</volume>
          (
          <year>2021</year>
          )
          <fpage>818</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Buscaldi</surname>
          </string-name>
          , Approaches to disambiguating toponyms,
          <source>SIGSPATIAL Special 3</source>
          (
          <year>2011</year>
          )
          <fpage>16</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaio</surname>
          </string-name>
          , L. Moncla,
          <article-title>Geoparsing and geocoding places in a dynamic space context, The Semantics of Dynamic Space in French: Descriptive, experimental and formal studies on motion expression 66 (</article-title>
          <year>2019</year>
          )
          <fpage>354</fpage>
          -
          <lpage>386</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Moncla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Joliveau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-F.</given-names>
            <surname>Le Lay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Boeglin</surname>
          </string-name>
          , P.
          <article-title>-</article-title>
          <string-name>
            <surname>O. Mazagol</surname>
          </string-name>
          ,
          <article-title>Mapping urban ifngerprints of odonyms automatically extracted from french novels</article-title>
          ,
          <source>International Journal of Geographical Information Science</source>
          <volume>33</volume>
          (
          <year>2019</year>
          )
          <fpage>2477</fpage>
          -
          <lpage>2497</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vigier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moncla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Brenon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mcdonough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Joliveau</surname>
          </string-name>
          ,
          <article-title>Classification des entités nommées dans l'encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société</article-title>
          de gens de lettres (
          <volume>1751</volume>
          -
          <fpage>1772</fpage>
          ), in: 7ème Congrès
          <string-name>
            <surname>Mondial de Linguistique Française</surname>
          </string-name>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Moncla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaio</surname>
          </string-name>
          ,
          <article-title>A multi-layer markup language for geospatial semantic annotations</article-title>
          ,
          <source>in: Proceedings of the 9th Workshop on Geographic Information Retrieval</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Halilali</surname>
          </string-name>
          , E. Gouardères,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Devin</surname>
          </string-name>
          ,
          <article-title>Geospatial web services discovery through semantic annotation of wps</article-title>
          ,
          <source>ISPRS International Journal of Geo-Information</source>
          <volume>11</volume>
          (
          <year>2022</year>
          )
          <fpage>254</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Moncla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaio</surname>
          </string-name>
          ,
          <article-title>Services web pour l'annotation sémantique d'information spatiale à partir de corpus textuels</article-title>
          ,
          <source>Revue Internationale de Géomatique</source>
          <volume>28</volume>
          (
          <year>2018</year>
          )
          <fpage>439</fpage>
          -
          <lpage>459</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Moncla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Renteria-Agualimpia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nogueras-Iso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaio</surname>
          </string-name>
          ,
          <article-title>Geocoding for texts with ifne-grain toponyms: an experiment on a geoparsed hiking descriptions corpus</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems</source>
          , Dallas, TX,
          <year>2014</year>
          , p.
          <fpage>183</fpage>
          -
          <lpage>192</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>