<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Geographical and Temporal References from Journey Narratives (demo)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ignatius Ezeani</string-name>
          <email>i.ezeani@lancaster.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul Rayson</string-name>
          <email>p.rayson@lancaster.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ian Gregory</string-name>
          <email>i.gregory@lancaster.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of History, Lancaster University</institution>
          ,
          <addr-line>Lancaster, LA1 4YT</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>UCREL, School of Computing and Communications, InfoLab21, Lancaster University</institution>
          ,
          <addr-line>Lancaster, LA1 4WA</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Previous approaches to understanding geographies in textual sources tend to focus on geoparsing to automatically identify place names and allocate them to coordinates. Such methods are highly quantitative and are limited to named places for which coordinates can be found, and have little concept of time. Yet, as narratives of journeys make abundantly clear, human experiences of geography are often subjective and more suited to qualitative representation. In these cases, “geography” is not limited to named places; rather, it incorporates the vague, imprecise, and ambiguous, with references to, for example, “the camp”, or “the hills in the distance”, and includes the relative locations using terms such as “near to”, “on the left”, “north of” or “a few hours' journey from”. In this demo paper, we describe our research prototype to extract and analyse qualitative and quantitative references to place and time in two corpora of English Lake District travel writing and Holocaust survivor testimonies.</p>
      </abstract>
      <kwd-group>
        <kwd>Named Entity Recognition</kwd>
        <kwd>Imprecise locations</kwd>
        <kwd>English Lake District corpus</kwd>
        <kwd>Holocaust survivor corpus</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Human experiences which were recorded and communicated as historical text are increasingly
available as digital corpora. A major challenge for researchers in the social sciences, humanities
and computer sciences is how to use these texts in interdisciplinary settings to develop cohesive
understandings of the historical experiences described. Understanding geographies in historical
sources has received a significant amount of research interest in recent years across fields
as diverse as geographical information science (GISc), corpus linguistics, natural language
processing (NLP), human geography, literary studies, and digital humanities. The current
state-of-the-art involves using geoparsing to automatically identify the place names in texts
and allocate them to a coordinate [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Once georeferenced in this way, place names can be
read into a geographical information system for mapping and spatial analysis. Analysis can
LGOBE
(Republic of Ireland), 2-April-2023
∗Corresponding author.
(I. Gregory)
      </p>
      <p>
        https://www.lancaster.ac.uk/scc/about-us/people/ignatius-ezeani (I. Ezeani);
https://www.lancaster.ac.uk/scc/about-us/people/paul-rayson (P. Rayson);
https://www.lancaster.ac.uk/staff/gregoryi/ (I. Gregory)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
also be conducted using techniques from corpus linguistics and NLP to see what words or
themes are associated with the place name such as the place being associated with emotional
responses such as being beautiful or inspiring fear. This combination of approaches is known
as geographical text analysis (GTA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>While GTA provides a useful starting point for understanding the geographies within a
corpus, it is highly quantitative, is limited to named places for which coordinates can be found,
and has little concept of time. Journey narratives describing human experiences are typically
subjective, and places are framed not just in terms of named places, but rather in imprecise
terms describing the setting or landscape e.g. “the camp”, “the majestic mountains”, and feature
relative terms e.g. “a quick detour along the lake”, “turn left after the inn”. These non-mappable
qualitative representations can be important features of the narratives but cannot be managed
within geospatial technologies such as GIS. To understand the ways in which humans describe
and relate to the world around them, we need to be able to visually represent and interpret the
geographies authors and interviewees describe in ways that combine the qualitative nature of
described spatial experiences with methods that render them quantitatively analysable.</p>
      <p>In this demo paper, we will illustrate how we are extending current GTA techniques and
have applied them to analyses of two large corpora: one a corpus of travel writing about the
English Lake District, predominantly written in the 18th and 19th centuries; the other, a corpus
of Holocaust survivor testimonies. Although based on very diferent types of journeys, leisure
travel, and forced migration respectively, both corpora represent a collection of unique voices
that coalesce to generate complex cultural and experiential geographies. The NLP research
described here is part of a larger project which also incorporates methods from corpus linguistics,
Qualitative Spatio-Temporal Reasoning (QSTR), GISc, and visual analytics which can help us
understand how authors themselves represented the geographies that surrounded them and
explore the individual and aggregate representation of the sense and experience of place that
these texts contain.</p>
      <p>The overall aim of our project is to develop techniques to learn more about the spatial and
temporal information contained in a piece of writing without any prior knowledge of the
geography of the places mentioned in the text. A starting point will be to automatically identify
references to toponyms (‘Penrith’, ‘Pooley Bridge’, ‘River Lowther’) or geographical feature
nouns (‘the town’, ‘a hill’, ‘the road’). We also want to extract interesting relationships between
places (‘Pooley Bridge is about six miles away from Penrith’) as well as some sense of the place
(‘The scenery around this lake is tame, but pleasing’). The key objective of the NLP methods
shown in this demo is to identify and extract all quantitative and qualitative references to place,
time, and their relationships, as shown in Figure 1.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset and Methods</title>
      <sec id="sec-2-1">
        <title>2.1. Corpus of Lake District Writing</title>
        <p>
          The dataset used for this work is the Corpus of Lake District Writing (CLDW)1 which comprises
eighty texts and around 1.5 million words that describe the Lake District [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The earliest texts
1CLDW and the gold standard dataset [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] are available here: https://github.com/UCREL/LakeDistrictCorpus
are from the seventeenth century and run through to the early twentieth century. It includes
works by well-known Lake Poets such as William Wordsworth and Samuel Taylor Coleridge.
There are also accounts of visits to the Lake District by prominent writers such as Daniel Defoe
and Celia Fiennes and other less well-known writers. There are also a number of tourist guides
stretching from Thomas West’s (1778) “A Guide to the Lakes” to Black’s (1900) “Shilling Guide
to the English Lakes” [5]. While drawn from a variety of styles and genres, the majority of the
corpus comprises tourist guides and travel narratives.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Extraction Pipeline</title>
        <p>This work adopts an extraction pipeline, illustrated in figure 2, that searches for spatial elements
(toponyms, geographical feature nouns, distances, and others) in the text by combining three
text processing techniques in a systematic manner. The first approach is a basic rule-based
method that uses hand-crafted regular expression (regex) 2 rules and a list of spatial items of
interest to extract and tag them in text. A list of 4,227 Lake District place names from the
CLDW project was used to extract the toponyms while a compilation of 1383 geographical
feature nouns were used. Spatial prepositions and locative adverbs (‘aboard’, ‘between’, ‘beyond’,
‘down-town’, ‘northbound’, ‘overseas’, etc) were also extracted and tagged.
2A regular expression [6] is a sequence of characters that specifies a search pattern in the text. See https://en.
wikipedia.org/wiki/Regular_expression
3Inflections and lemmas (i.e. road and roads) extended the list to 262</p>
        <p>However, this rule-based method has limitations. It leaves out some known place names.
This is because the names were not on the list, wrongly spelt, or inconsistently capitalised.
Furthermore, we needed to extract temporal references as well which was not possible with
the regex method. spaCy’s4 named entity recognition (NER) feature was applied to mitigate
these challenges. An NER tool identifies entities based on their context and does not require a
gazetteer. It is also able to capture temporal references as well as other entities in text. The
application of named entity recognition in spatial analysis of unstructured text is quite common.
Amine et al [8] compared the performance of diferent named entity recognition models on the
task of identifying spatial nominal entities (e.g. village, hut, church) from manually Wikipedia
articles. Mehtab Alam Syed et al., [9] demonstrated that NER can be successfully applied to
relative spatial information extraction (e.g. south of Paris, 80km from Rome). Lucie Cadorel et
al., [10] also achieved over 95.7% accuracy on spatial relations extraction using an NER based
extraction model.</p>
        <p>One way to get a sense of the place is to identify elements that indicate sentiments or
emotions expressed and activities mentioned in discussions about a place and this needed further
development beyond the rule-based and NER methods. Hence, UCREL’s5 USAS semantic tagger
[11] was included in the pipeline specifically to identify elements that are semantically tagged
E: ‘emotion’, M: ‘movement, location, travel and transport’ and T: ‘time’.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>The performance of the extraction tool was evaluated (currently only for toponyms) with the
gold standard subset of the CLDW which contains 28 texts that were carefully selected to be
representative of the corpus. It contains 242,000-word tokens i.e. about one-sixth of the entire
corpus. The placenames included the names of a variety of diferent regional, national and
international locations, landmarks and geographical formations marked up with a customised
tag &lt;cdplace&gt;. The evaluation results on the gold-standard data show that our placename
extraction method has a precision score of 100% with an overall average recall score of 93.95%
thereby recording an F1 score of 96.88%.</p>
      <sec id="sec-3-1">
        <title>4spaCy is a free and open-source python library for general NLP [7].</title>
        <p>5UCREL is the University Centre for Computer Corpus Research on Language at Lancaster University. See https:
//ucrel.lancs.ac.uk/usas/ for the description of the top-level tags. Here we used PyMUSAS, an open-source Python
implementation of the semantic tagger for English and other languages: https://pypi.org/project/pymusas/</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this demo paper, we have illustrated our prototype NLP pipeline for extracting imprecise
geographical and temporal references from journey narratives, focussing on travel writing
about the English Lake District. We have made our demo available open source as a series of
Python Notebooks on our project’s GitHub repository.6</p>
      <p>In future work, we will extend the gold standard corpus to include geographical feature
nouns, spatial prepositions, and other non-mappable terms, in order to evaluate our pipeline’s
ability to identify them. We will also evaluate our pipeline on the Holocaust dataset, which we
expect to be more challenging as it consists of interview transcripts and therefore potentially
less amenable to existing methods.</p>
      <p>Also, language understanding in general faces the challenge of establishing the relationship
between the vertices of the semantic triangle - the language (symbol), the object (or referent) in
the real world that it describes, and the human thought (or reference) [12]. However, for spatial
language representation, Stock et al. [13] proposed an extension of the semantic triangle to
a semantic pyramid with the digital face and vertices for knowledge and structured language
representation. Our future work will also explore and compare diferent approaches to spatial
semantic modelling in the annotation of our datasets.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>We thank the anonymous reviewers for their comments on our paper submission. The project
is funded in the UK from 2022 to 2025 by ESRC, project reference: ES/W003473/1. We also
acknowledge the input and advice from the other members of the project team in generating
requirements for our research presented here. More details of the project can be found on the
website: https://spacetimenarratives.github.io/</p>
      <sec id="sec-5-1">
        <title>6https://github.com/SpaceTimeNarratives</title>
        <p>[5] I. Gregory, C. Donaldson, A. Hardie, P. Rayson, Modeling space in historical texts, in: The</p>
        <p>Shape of Data in the Digital Humanities, Routledge, 2018, pp. 133–149.
[6] J. Goyvaerts, Regular Expressions Tutorial, https://web.archive.org/web/20161101212501/
http://www.regular-expressions.info/tutorial.html, 2016. Accessed: 2022-10-10.
[7] M. Honnibal, I. Montani, spaCy 2: Natural language understanding with Bloom
embeddings, convolutional neural networks and incremental parsing, 2017.
[8] A. Medad, M. Gaio, L. Moncla, S. Mustière, Y. Le Nir, Comparing supervised learning
algorithms for spatial nominal entity recognition, AGILE: GIScience Series 1 (2020) 15. URL:
https://agile-giss.copernicus.org/articles/1/15/2020/. doi:10.5194/agile-giss-1-15-2020.
[9] M. A. Syed, E. Arsevska, M. Roche, M. Teisseire, Geoxtag: Relative spatial information
extraction and tagging of unstructured text, AGILE: GIScience Series 3 (2022) 16. URL:
https://agile-giss.copernicus.org/articles/3/16/2022/. doi:10.5194/agile-giss-3-16-2022.
[10] L. Cadorel, A. Blanchi, A. G. Tettamanzi, Geospatial knowledge in housing advertisements:
Capturing and extracting spatial information from text, in: Proceedings of the 11th on
Knowledge Capture Conference, 2021, pp. 41–48.
[11] P. Rayson, D. Archer, S. Piao, T. McEnery, The UCREL Semantic Analysis System, in:
Proceedings of the workshop on Beyond Named Entity Recognition Semantic labelling for
NLP tasks in association with 4th International Conference on Language Resources and
Evaluation (LREC 2004), 2004, pp. 7–12.
[12] C. Ogden, I. Richards, The Meaning of Meaning–A Study of the Influence of Language upon
Thought and of the Science of Symbolism. Magdalene College, University of Cambridge,
1923.
[13] K. Stock, C. B. Jones, T. Tenbrink, Speaking of location: a review of spatial language
research, Spatial Cognition &amp; Computation 22 (2022) 185–224. doi:10.1080/13875868.2022.
2095275.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Grover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tobin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Byrne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Woollard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dunn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ball</surname>
          </string-name>
          ,
          <article-title>Use of the Edinburgh geoparser for georeferencing digitized historical collections</article-title>
          ,
          <source>Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences</source>
          <volume>368</volume>
          (
          <year>2010</year>
          )
          <fpage>3875</fpage>
          -
          <lpage>3889</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Porter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Atkinson</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gregory</surname>
          </string-name>
          ,
          <article-title>Geographical text analysis: A new approach to understanding nineteenth-century mortality</article-title>
          ,
          <source>Health Place 36</source>
          (
          <year>2015</year>
          )
          <fpage>25</fpage>
          -
          <lpage>34</lpage>
          . doi:
          <volume>10</volume>
          .1016/j. healthplace.
          <year>2015</year>
          .
          <volume>08</volume>
          .010.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rayson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reinhold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Butler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Donaldson</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gregory</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Taylor</surname>
          </string-name>
          ,
          <article-title>A deeply annotated testbed for geographical text analysis: The corpus of lake district writing</article-title>
          ,
          <source>in: GeoHumanities'17 Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities, Association for Computing Machinery (ACM)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>15</lpage>
          . doi:
          <volume>10</volume>
          .1145/3149858.3149865.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Taylor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. N.</given-names>
            <surname>Gregory</surname>
          </string-name>
          ,
          <article-title>Deep Mapping the Literary Lake District: A Geographical Text Analysis</article-title>
          , Rutgers University Press,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>