<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Station to Station: Linking and Enriching Historical British Railway Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mariona Coll Ardanuy</string-name>
          <email>mcollardanuy@turing.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kaspar Beelen</string-name>
          <email>kbeelen@turing.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jon Lawrence</string-name>
          <email>j.lawrence3@exeter.ac.uk</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katherine McDonough</string-name>
          <email>kmcdonough@turing.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federico Nanni</string-name>
          <email>fnanni@turing.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joshua Rhodes</string-name>
          <email>jrhodes@turing.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgia Tolfo</string-name>
          <email>giorgia.tolfo@bl.uk</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel C.S. Wilson</string-name>
          <email>dwilson@turing.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Queen Mary University of London</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The Alan Turing Institute</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>The British Library</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>The University of Exeter</institution>
          ,
          <addr-line>Exeter</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <fpage>249</fpage>
      <lpage>265</lpage>
      <abstract>
        <p>The transformative impact of the railway on nineteenth-century British society has been widely recognized, but understanding that process at scale remains challenging because the Victorian rail network was both vast and in a state of constant flux. Michael Quick's reference work Railway Passenger Stations in Great Britain: a Chronology ofers a uniquely rich and detailed account of Britain's changing railway infrastructure. Its listing of over 12,000 stations allows us to reconstruct the coming of rail at both micro- and macro-scales; however, being published originally as a book, this resource was not well suited for systematic linking to other geographical data. This paper shows how such a minimally-structured historical directory can be transformed into an openly available structured and linked dataset, named StopsGB (Structured Timeline of Passenger Stations in Great Britain), which will be of widespread interest across the historical, digital library and semantic web communities. To achieve this, we use traditional parsing techniques to convert the original document into a structured dataset of railway stations, with attributes containing information such as operating companies and opening and closing dates. We then identify a set of potential Wikidata candidates for each station using DeezyMatch, a deep neural approach to fuzzy string matching, and use a supervised classification approach to determine the best matching entity.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;entity linking</kwd>
        <kwd>digital humanities</kwd>
        <kwd>open science</kwd>
        <kwd>toponym resolution</kwd>
        <kwd>railway stations</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The transformative impact of the railway on nineteenth-century British society has been widely
recognized, but understanding that process at scale remains challenging because the Victorian
rail network was both vast and in a state of constant flux. Several machine-readable resources
exist that include information on the British railway system. However, those that are openly
available lack both coverage as well as historical specificity. In contrast, Michael Quick’s
reference work Railway Passenger Stations in Great Britain: a Chronology1 ofers a uniquely
rich and detailed account of Britain’s changing railway station infrastructure. It includes
over 12,000 stations with information such as their opening and closing dates and operating
companies.</p>
      <p>
        Quick’s Chronology has been an important resource for railway enthusiasts and historians.
However, being published originally as a book (with detailed station information in the form of
free text), this resource was not well suited for systematic linking to other geographical data.
In this paper, we turn the text of the Chronology into a structured dataset, linked to Wikidata
and georeferenced. In this process, we distinguish two main steps. First, we use traditional
parsing techniques to convert the minimally structured Word document into a structured
dataset. Then, we link each of the identified stations to the corresponding referent entry in
Wikidata or, if missing, the closest most suitable entry. To achieve this, we use DeezyMatch2
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], a deep neural approach to fuzzy string matching, to identify the set of potential Wikidata
candidates for each station, and use a supervised classification approach to determine the best
matching entity. While the data processing step is dataset-specific, the linking process is largely
generalizable to other structured datasets with metadata fields containing place information
in plain text.
      </p>
      <p>Charting the growth of Britain’s rail network in relation to other geographically rich data
sources will allow us to reconstruct the coming of rail at both micro- and macro-scales, and
understand the railway in fuller context than has been previously possible. We are making the
resulting linked dataset openly available for download, thereby opening new possibilities for
data-driven research on the history of the railway network and its profound impact on society
at large.3</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Linked Open Data, the Semantic Web, and Digital Humanities</title>
        <p>
          Applications of linked open data and semantic web technologies to cultural heritage have
grown substantially and the last decade has seen the appearance of many projects dedicated to
creating and publishing linked historical data sets.4 The fruits of this labour have been intensely
explored by digital humanities (DH) scholars—for whom new types of access have created
novel ways of studying culture and history—but also by libraries, museums and archives. For
research at the interface of humanities and data science, the advantages of applying semantic
technologies are manifold: the interconnected nature of the data lends itself well to qualitative
exploration (facilitating serendipity and storytelling with data5), but also, for quantitative
approaches, it is possible to leverage linked data for more refined modeling of historical and
cultural phenomena [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Candidate Selection and Resolution on Historical Sources</title>
        <p>
          Successfully linking entities in cultural heritage data to a given knowledge base (KB) depends
on many prior decisions. The choice of KB has the most evident impact on the linking
performance: if knowledge contained in the chosen resource is incomplete or faulty, then this is likely
to be reflected in the linking process. The openly available GeoNames geographical database 6
is one of the largest and most commonly-used resources for linking geographical entities [
          <xref ref-type="bibr" rid="ref28">15,
28</xref>
          ]. GeoNames integrates geographical data from many diferent sources and its records are
complemented with volunteered information, resulting in a resource that contains over 11
million unique locations with a total of over 25 million associated geographical names. Resources
based on Wikipedia and other Wikimedia projects have steadily become the most popular for
generic entity linking approaches [
          <xref ref-type="bibr" rid="ref3 ref7">3, 7</xref>
          ], partly due to the fact that they contain encyclopedic
knowledge formulated in natural language. Among these, Wikidata, as the central storage
for the structured data of Wikimedia projects, has in recent years emerged as an exceedingly
valuable resource for linking data across sources from diferent domains [
          <xref ref-type="bibr" rid="ref25">25, 9</xref>
          ].
        </p>
        <p>
          While it has traditionally received little attention in the research community, candidate
selection and ranking (the task of identifying and ranking potential matching entities from a
knowledge base) has been shown to also have a significant impact on the downstream task of
entity linking (see [5] for an overview). Established entity linking systems such as DBpedia
Spotlight [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] and TagMe! [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] employ very basic candidate selection strategies, which perform
sufficiently well on contemporary sources in English, but fail to address the many challenges
of working with historical documents (such as diachronic and spelling variations, OCR errors,
etc. [
          <xref ref-type="bibr" rid="ref22 ref24">18, 22, 24</xref>
          ]). Recent research in DH [
          <xref ref-type="bibr" rid="ref14 ref26">26, 14</xref>
          ] has focused on developing deep learning
approaches. In particular, Hosseini et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] recently introduced DeezyMatch, a Python
open-source library for fuzzy string matching and candidate ranking, that we have employed
in our work.
        </p>
        <p>
          After having identified a set of potential entity candidates based on a string mention, multiple
strategies have been presented to resolve the mention to the correct KB entry [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], such as
deriving relatedness and relevance measures between co-occurring entities from the networked
structure of the knowledge base (starting from [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]) or modeling the similarity of textual
content, when this is available in the KB (see for instance how Wikipedia content could be
used for the task [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]). Given the specificity of our setting, where we have entity mentions
with minimal textual content describing them, we in part follow recent studies in the field [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]
by relying on Transformers-based pre-trained models such as BERT [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] to derive a measure
of text similarity between the mention and the candidate’s description in Wikidata, and we
combine this with more geographically-motivated strategies for entity resolution [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. British Railway Station Data</title>
        <p>Several resources exist that contain information about historical or modern stations in
England, Wales, Scotland, Northern Ireland, and sometimes also Ireland. However, of those that
are openly available, none compares to the rich detail (in terms of additional descriptors) or
extensive coverage for England, Wales, and Scotland found in the Chronology.
6https://www.geonames.org [last accessed 14 September 2021].</p>
        <p>
          Martí-Henneberg et al. [
          <xref ref-type="bibr" rid="ref12 ref13 ref16">17, 12, 13</xref>
          ] released snapshots of railway station data for 1851, 1861,
and 1881 as part of their research with the Cambridge Group for the History of Population
and Social Structure (CAMPOP). These three datasets, henceforth referred to collectively as
Campop, are based on the content of a historical atlas that maps railway tracks and stations
active between 1807-1998 on 1-inch Ordnance Survey maps [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The snapshots available from
the UK Data Service are exports from a time-dynamic GIS of stations and tracks. Each record
contains a unique object ID and point data for each station, but no other attributes such as
names, opening or closing dates, or operators.
        </p>
        <p>
          Another key resource is a subset of the GB1900 gazetteer created through a crowdsourcing
project to transcribe labels on the second edition of the 6-inch-to-one-mile Ordnance Survey
maps for England, Wales, and Scotland [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], which we henceforth refer to as GB1900.7 By
ifltering only those labels containing ‘station’ type labels, we created a useful dataset for
comparison with the Chronology entries. Labels represent stations on map sheets that were
printed between 1888 and 1913. Because GB1900 labels were geolocated using a point in the
bottom left-hand corner of the first character of the label text, this is often not the same as a
station location. GB1900 does not provide the name of the station, as labels were often only
‘Sta.’ or ‘Station’.
        </p>
        <p>Wikidata contains records for both modern and historical railway stations. Station entries
are geolocated and often situated within spatial hierarchies (city, county, state) and
timeframed. They may include details like the ‘operator’ (railway company), and often provide links
to domain-specific knowledge bases (such as the UK Railway Station code from National Rail).
Other interesting properties indicate where a station is located in relation to other stations on
the line, opening and closing dates, connecting lines, number of tracks, and additional external
identifiers. 8 However, overall coverage of rail-specific information in Wikidata is sparse.</p>
        <p>Although other richly documented resources exist online, few of these are amenable to
computational research: the ‘Disused Stations’ website was created to ‘build up a comprehensive
database’ of closed British railway stations (currently 2230 passenger stations and 14 goods
stations);9 ‘RailScot’ and ‘RAIL MAP online: Historic railways, railroads and canals’, and the
‘Register of Closed Railways’ (since 1901) do not currently have mechanisms for sharing their
underlying data.10
3. The Railway Passenger Stations dataset</p>
      </sec>
      <sec id="sec-2-4">
        <title>3.1. The source material</title>
        <p>The Railway and Canal Historical Society (R&amp;CHS) Railway Passenger Stations in Great
Britain: A Chronology was first published privately by Michael Quick in 1996 as a by-product of
his work mapping Britain’s historical railway network. Now in its fifth edition, much expanded
and online only, the Chronology has benefited greatly from the input of local and railway
historians over the past quarter-of-a-century. The Quick et al. Chronology is a directory of
every known passenger railway station in England, Scotland and Wales, past and present.</p>
        <p>7GB1900 is available from https://data.nls.uk/data/map-spatial-data/gb1900/ [last accessed 14 September
2021].</p>
        <p>8For example ‘Stevenage railway station’, https://www.wikidata.org/wiki/Q19970.
9See http://www.disused-stations.org.uk/ [last accessed 14 September 2021].</p>
        <p>10https://www.railscot.co.uk/ and https://www.railmaponline.com/ and https://www.
registerofclosedrailways.co.uk/ [last accessed 14 September 2021].
Importantly, it seeks to understand the railway system ‘from the point of view of the traveller
in times past’, rather than ‘from the companies’ standpoint’, and therefore includes informal
stops used by landowners, workmen, sports enthusiasts and holiday-makers, as well as stations
identified in the railway companies’ public timetables ( Chronology, 6).</p>
        <p>The Chronology began as a document listing the opening dates of British railway stations.
The content expanded significantly and now includes a range of details, such as the principal
service providers, type of station (passenger, goods, worker, private, etc.), disambiguation cues
to help locate the station if more than one station with the same name exists (e.g. ‘Ashton,
near Bristol’), opening and (where applicable) closing dates, station name at opening and
any changes, any additional notes about the station, and a shorthand reference to finding the
station on an OS map. Source information for the above is provided with meticulous detail
and is derived mainly from contemporary, primary sources including company timetables,
company reports and local newspapers, and supplemented with information from secondary
works deemed authoritative.</p>
        <p>The Chronology therefore ofers a uniquely rich insight into the ebb and flow of the British
rail system from its inception to the present day. The Society has established a Railway
Chronology Group co-ordinated by Ted Cheers to collate revisions to the Chronology, which is
available to download as a pdf from its website, but is maintained as an MS Word document.
This latter version was kindly shared with us as part of our data sharing agreement with
the Society, and was used to construct a structured dataset for linking. The Word document
maintains a (mostly) regular structure from station to station, which made it a good candidate
for parsing and transforming into (explicitly) structured data.</p>
      </sec>
      <sec id="sec-2-5">
        <title>3.2. Processing</title>
        <p>Railway stations share certain formatting features in the MS Word document: they always
appear at the beginning of a new paragraph, in bold and upper case, and have the same font
size. When more than one station exists in a town, the Chronology groups them together
under a heading of that town name, underlined and of a larger font size than that of the
comprised stations. For example, the first reference to ‘Aberavon’ in Figure 1 is not a station,
but rather a kind of generic or phantom header name which sometimes lists attributes that
all stations in that place share (in this example, the operating company and a map reference).
The entries listed beneath place headings are railway stations, often with names abbreviated
to their initials when they match the place name. For example, the place Aberavon has the
following stations: A Sea Side and A Town, which should be read as Aberavon Sea Side and
Aberavon Town. The entry ‘Aberayron’ in the same figure, on the other hand, is the only
railway station in the eponymous town and, therefore, appears as a sole entry, and is preceded
by no heading.11</p>
        <p>The regular formatting of the document meant that we could define xpath expressions to
identify both generic places and concrete railway stations, and therefore transform the Word
document into tabular data. Were these not styled in the document, identifying them correctly
would have been extremely laborious, and probably required strong supervision in the form of
human annotations. We used regular expressions to expand the abbreviated names to their full
names, by matching initials to the corresponding tokens of the generic place. These operations
resulted in a structured dataset of 12,676 railway station entries in 9,667 places, each with
11Text in red indicates updates to the document since it was first shared online.
a unique place–station identifier pair. We set apart 491 items to annotate for the linking
experiments (see section 3.3), of which only eight had some parsing error, due to existing, but
rare, formatting inconsistencies in the MS Word document. Table 1 shows the entries in the
newly structured dataset (henceforth StopsGB, for ‘Structured Timeline of Passenger Stations
in Great Britain’) corresponding to those in Figure 1.</p>
        <p>The content of the Chronology entries is rigorously formatted, despite being in free text form.
With the help of punctuation (e.g. squared parentheses for companies and curly brackets for
map information) and other types of markers (e.g. op/clo preceding opening and closing dates)
or formatting options (e.g. capitalized full words indicating alternate station names), we were
able to parse the content with regular expressions. We extracted opening and closing dates,
operating companies, alternate names (names by which the railway station has been known
at diferent moments in time), referenced stations, disambiguators (additional information on
where the station is located), and a reference to an OS map location.12</p>
        <p>12The following scores represent precision and recall respectively, on 219 entries that were manually annotated
to evaluate the parsing: alternate station names: 0.91/0.85; companies: 1.0/1.0; first opening dates: 0.98/0.98;
and last closing dates: 0.97/0.98. Alternatively, we experimented using a deep learning sequential LSTM tagging
approach, which interestingly worked significantly worse (given the limited amount of training data) than the
approach based on regular expressions, which greatly benefited from the very regular formatting of the text
content.</p>
      </sec>
      <sec id="sec-2-6">
        <title>3.3. Annotation</title>
        <p>We manually linked 491 randomly selected entries from the Chronology to Wikidata, of which
217 were used for method development, 219 were used for testing, and the rest were
discarded because they were cross-references or contained parsing errors. Wikidata has
substantial records for current and historical railway stations, even for those long in disuse. Therefore,
a large proportion of these cases could be matched directly to a Wikidata entry. Where the
Chronology entry contained a place header for major settlements rather than a specific railway
station (e.g. ‘Aberavon’ above) we signaled this by prefixing the Wikidata identifier with ppl
for ‘populated place’. The same procedure was followed for small settlements where a Wikidata
identifier could be found only for the town or village, and not for the station.</p>
        <p>There were also a small number of cases where the location of a station with no Wikidata
match could be identified with enough certainty from its name and description to find a nearby,
alternative Wikidata identifier. In these cases the identifier code was prefixed with opl, for
‘other place,’ to indicate that it was a proximate rather than direct link. For instance, there
was no match for Newcastle’s Moor Edge station, but we were able to make a proximate link
with the city’s Town Moor (Q11898308) since we know that this temporary station served race
meetings that were held on Town Moor.13</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Linking experiments and evaluation</title>
      <p>We describe the Wikidata-based resource that we use for linking in Section 4.1. The linking is
performed in two steps. First, given a query (a railway station, a place, or a station alternate
name), we narrow the full set of Wikidata candidates down to those that may potentially be
referred to by this query. This is called candidate selection and is described in Section 4.2.
The next step is to determine the correct entity given the candidates selected in the previous
step. This step is called entity resolution and is addressed in Section 4.3.</p>
      <p>For reference, Figure 2 provides a simplified overview of the linking process that is described
throughout this section, using one entry in the Chronology as an example.</p>
      <sec id="sec-3-1">
        <title>4.1. Linking resource</title>
        <p>We extracted all locations in Wikidata14 by filtering the entries that have a coordinate location
property (P625), i.e. entries that can be located on the Earth’s surface through their
geographical latitude and longitude. For each entry we kept a series of features that describe the entry
(geographically, historically, politically). This resulted in 8,094,093 entries, which we narrowed
down to those located in Great Britain, filtering them by their location within a polygon of
coordinates enclosing the island.15 The resulting dataset (henceforth WikiPlaces gazetteer) is
composed of 671,320 entries. Next, we created a further subset composed of those entries from
the WikiPlaces gazetteer that are either instances of station-related classes or their English
label has the words ‘station’, ‘stop’, or ‘halt’, not preceded by ‘police’, ‘signal’, ‘power’, ‘lifeboat’,
13In total, 55 entries were annotated as populated places and 19 as other places. There were 4 entries for
which no Wikidata match could be provided.</p>
        <p>14We used the 20200925 Wikidata dump from https://dumps.wikimedia.org/wikidatawiki/entities/ and
followed the approach described in https://akbaritabar.netlify.app/how_to_use_a_wikidata_dump to parse the
entities [last accessed 14 September 2021].</p>
        <p>15We have used the Ordnance Survey OpenData Boundary-Line™ ESRI Shapefile from https://osdatahub.
os.uk/downloads/open/BoundaryLine [last accessed 14 September 2021].
‘pumping’, or ‘transmitting’. This procedure leads to the retrieval of many false positives but
at this point we are interested in maximizing recall at the expense of precision: we maximize
precision during the subsequent linking steps (described in sections 4.2 and 4.3). The resulting
dataset is composed of 9,361 entries, henceforth referred to as the WikiStations gazetteer.</p>
        <p>
          We improved the Wikidata-based gazetteers in two ways. First, Wikidata provides
structured and curated sets of alternate names in terms of labels and aliases in diferent languages,
but which are relatively limited when compared to other resources such as Wikipedia or
Geonames. We therefore use the links between Wikidata and Wikipedia16 and between Wikidata
16The Wikipedia link structure has been largely exploited in the past in order to expand the alternate names
and Geonames to expand our gazetteers with alternate names from these resources. Secondly,
we make use of the linking between Wikidata and Wikipedia to obtain—for each Wikidata
entry in our gazetteers—the number of incoming links of the corresponding Wikipedia page, if
available. This measure is traditionally used as a proxy for relevance in entity linking systems
(see for instance [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]).17 The final WikiPlaces has 670,325 entities (after filtering out unlabelled
entries) with 823,304 alternate names; the final WikiStations gazetteer has 9,361 entries with
33,156 alternate names.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Candidate selection</title>
        <p>As discussed in Section 3.2, each entry in StopsGB has a station name and a place name field
and, when available, also a list of alternate names for the station. Because one of the aims of
linking is to geolocate the entries, we decided that, in those cases in which the railway station is
not present in Wikidata (as in the case of New Tredegar Colliery railway station), we provide an
approximated location (i.e. New Tredegar, the location of this station for miners). Therefore,
in this step we aim to retrieve Wikidata entries that are potentially referred to by one of the
query fields (station, place, or alternate names). Both the station and alternate names fields
refer to stations, whereas the place field refers to more generic place names. Therefore, we
retrieve Wikidata candidates for both the station and the alternate names fields by querying
them against the WikiStations gazetteer; and retrieve Wikidata candidates for the generic
place field from the WikiPlaces gazetteer.</p>
        <sec id="sec-3-2-1">
          <title>4.2.1. Approaches</title>
          <p>
            We have experimented with three diferent approaches for candidate selection: (1) exact
match: Wikidata candidates are retrieved if one of the alternate names of the Wikidata entry
is identical to the query; (2) partial match: candidates are retrieved if the query is contained
in one of their alternate names (i.e. there is a string overlap), and are ranked according to
amount of overlap; and (3) deezy match: candidates are retrieved and ranked using
DeezyMatch [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ], an open-source software library for fuzzy string matching and candidate ranking
using neural networks. Both partial and deezy match allow for fuzzy string matching.18 To
have a more extended overview of the impact of this step, we tested candidate selection
considering the set of candidates corresponding to the top ranked one, three and five candidate
name variations of a query (henceforth nv).19
of entities in knowledge bases [
            <xref ref-type="bibr" rid="ref3 ref7">3, 7</xref>
            ]. We use the Wikipedia-based gazetteer described in [6].
          </p>
          <p>
            17We have employed the 20200920 English Wikipedia dump and processed it using WikiExtractor (https://
github.com/attardi/wikiextractor [last accessed 14 September 2021]), to extract single pages and their structure
in sections, as in [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ].
          </p>
          <p>18See [5] for an extensive comparison between DeezyMatch and traditional string similarity measures for
candidate selection.</p>
          <p>19To show this with an example, consider the scenario in which we choose to retrieve three name variations
(nv = 3) per query: given the query ‘PARKGATE’, DeezyMatch returns the following three most similar
candidate strings from Wikidata (scores in parentheses represent cosine distance): ‘Parkgate’ (0.0), ‘Park Gate’
(0.0152), and ‘Parkergate’ (0.0162), which are then expanded to all Wikidata candidate entries that have this
alternate name, i.e. 7 candidate entities for ‘Parkgate’ (such as Q7138469, a village in Cheshire, and Q7138470,
a village in Scotland), 4 candidate entities for ‘Park Gate’, and one for ‘Parkergate’.
4.2.2. Metrics
Given a mention, we assess the performance of each method in generating a ranked list of
name variations of potential entity candidates by reporting precision at nv (either 1, 3 or
5), meaning how many times a name variation of the correct entity appears in the retrieved
results. Note that increasing the number of potential name variations will consequently impact
the precision of the retrieved ranking, which can be taken as a measure of difficulty of the
following resolution step. In addition, we report the mean average precision20 at the same
nv: this will ofer a glance on the quality of the ranking. Finally, we report binary retrieval
to highlight how many times at least one name variation of the correct entity is retrieved at
nv—this will set the skyline for the following resolution step (meaning that if the correct entity
is not retrieved at the selection stage, the mention cannot be resolved correctly).</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>4.2.3. Evaluation</title>
          <p>We report a comparison of the diferent approaches to select and rank potential candidates for
given query inputs in Table 2. We compare two evaluation settings: (1) strict, which assesses
the performance only on those queries for which there exists a Wikidata entry corresponding
to the station (i.e. not preceded by neither ppl nor opl in the annotations), which we use
on queries from the station and alternate name fields of the structured dataset; and (2) appr,
which assesses the performance on all queries, in which case a true positive is not whether the
correct railway station is found, but whether the best possible match on Wikidata (according
to the annotators) has been retrieved.</p>
          <p>The results in Table 2 provide an interesting portrayal of the forthcoming entity resolution
task, described in section 4.3. We see that the gain of allowing more name variations than just
the most similar one is very low (the increase of retr is minimal) compared to the increase in
difficulty of the task (shown by a decrease in precision). MAP, however, stays high, indicating
the importance of string similarity confidence, especially using DeezyMatch. The retrieved
candidates and their confidence score are therefore passed on to the next step, which will
resolve each entry in StopsGB to the best matching Wikidata entity.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>4.3. Entity Resolution</title>
        <p>At this point, for each entry in StopsGB we have up to three sets of candidates: a set of
candidates for the station name, one for the place name, and one for possible alternate names.
The final step of the pipeline, entity resolution, takes the retrieved candidate entities and
returns only one best match per entry. We performed our experiments on candidates selected
with DeezyMatch, because this is the approach that had the highest M AP score overall, and
the largest variation in precision depending on number of retrieved candidates. We performed
experiments with nv = 1 and nv = 5.</p>
        <sec id="sec-3-3-1">
          <title>4.3.1. Features and baselines</title>
          <p>
            We defined several features for each candidate to quantify the compatibility between the
Wikidata candidate and the Chronology entry. The features we used are the following:
20Mean average precision (MAP) is a popular metric in information retrieval that highlights how well the
ranking (overlap score in the case of perfect and partial match, and confidence score in the case of DeezyMatch)
correlates with the labels.
• String confidence: DeezyMatch confidence score between the mention and the
candidate alternate name for (a) stations, (b) places, and (c) station alternate names. We
generated one feature for each.
• Semantic coherence: The semantic similarity between the Wikidata candidate and
the entry in StopsGB, using transformer-based sentence embeddings [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ].21
• Wikipedia relevance: Number of incoming links a Wikidata candidate has on Wikipedia
(as a proxy for entity popularity), normalized against the maximum number of incoming
links in the set of candidates.
• Wikidata class: (a) whether the candidate is an instance of a railway station class, and
(b) whether the candidate is an instance of a populated place.
• Station-to-place and place-to-station geographical compatibility: If the
candidate is a railway station, normalized geographical closeness to the closest place candidate;
if the candidate is a generic place, normalized geographical closeness to the closest station
candidate.
          </p>
          <p>Each candidate is therefore represented as a vector of features, followed by its label (true if it is
the correct entity for a given entry, false otherwise), and whether it is an exact match (i.e. the
railway station) or an approximate match (i.e. the best possible match given that the exact
match does not exist). We use three of these features (string confidence, semantic coherence,
and relevance in Wikipedia) as baseline methods for the task, by selecting the candidate
that has the highest score from the pool of overall retrieved candidates. In the case of the
string confidence baseline, we select the top match amongst railway stations and, only if none
21We use the description, the historical county and administrative region information for the Wikidata
candidate; and the place, disambiguation cues, maps description, alternate names, and references for the StopsGB
entry. We have used the default pre-trained model: paraphrase-distilroberta-base-v1, which is trained on
large scale paraphrase data.
has been retrieved, the top match amongst places. We also compute a skyline, which is the
highest possible score reachable, given the available set of candidates.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>4.3.2. Supervised resolution approaches</title>
          <p>We propose a supervised approach that trains a Support Vector Machine (SVM) on the
development set (i.e. one SVM trained on all query/candidate combinations at once) and learns
whether a candidate is a correct match for a given query or not. We then apply the resulting
classifier on a query basis (i.e. on the set of possible candidates per query only, as in the
baseline methods22), return the probability score instead of returning a label, and select the most
confident match from the subset of possible candidates. We propose two diferent SVM
variations:23 (1) SVM simple trains the SVM on the development set using all features, without
distinguishing between strict and approximate instances; whereas (2) SVM refined is a dual
classification system: it trains an SVM classifier using all features on the subset of queries for
which there is a strict match, and an additional classifier using all features on the subset of
queries for which there is not a strict match. The idea behind SVM refined is that the learning
objective is diferent if the goal is to predict entities of the type ‘station’ or generic places. We
combine the two based on the confidence score of the first classifier (i.e. the station classifier):
if the confidence of a prediction is lower than a certain threshold (found based on experiments
on the development set), we will apply the second classifier (i.e. the generic place classifier).</p>
          <p>
            As a comparison, we employed the same features in a Learning to Rank (L2R) [
            <xref ref-type="bibr" rid="ref15">16</xref>
            ] pipeline,
using RankLib.24 The weight parameter is learned by optimizing for the precision at 1 (P@1)
using coordinate ascent with linear normalization.
          </p>
        </sec>
        <sec id="sec-3-3-3">
          <title>4.3.3. Metrics and evaluation</title>
          <p>Table 3 summarizes the results of our experiments. As in the previous step, we also provide
two evaluation scenarios: strict only accepts exact entities as true (only entities referring to
the correct railway station), whereas approximate accepts place entities if the station does
not exist as an entity in Wikidata. We present the results for the resolution task in terms of
precision (how many times the mention is correctly matched with the correct entity) as well
as approximate accuracy at 1, 5, and 10 km (Acc@km) (i.e. how many times the mention is
correctly geo-located within 1, 5, and 10 km from the gold standard coordinates).</p>
          <p>An analysis of the most indicative features for both classifiers proves our assumption that
predicting stations and generic places are two diferent learning tasks. The most indicative
features for the stations classifier in the strict scenario, where nv = 1, are (ranked from higher
to lower prominence) station name string confidence, station-to-place compatibility, Wikidata
class if candidate is a railway station, and semantic coherence. The most indicative features of
the generic places classifier in the approximate scenario, where nv = 1, are Wikipedia relevance,
place-to-station compatibility, and place string confidence. This distinction between station
and place favours SVM refined especially when nv = 5 and in the approximate setting. The
results in Table 3 confirm that using a larger nv does not compensate for the resulting increased
difficulty of the task. Nevertheless, the good performance of SVM refined when nv = 5
suggests that it is a robust resolution system, which does not sufer from a higher number of
22Note that in all cases the queries will be diferent between the development and the test set.
23Both are linear SVMs, where the C parameter is tuned on the development set.
24https://sourceforge.net/p/lemur/wiki/RankLib/ [last accessed 14 September 2021].</p>
          <p>0.69
0.16
0.32
0.7
0.71
0.7
candidates, in particular in comparison with SVM simple and the Wikipedia relevance and
semantic coherence baselines.25</p>
          <p>Based on the results of our experiments, we applied SVM refined on the full StopsGB dataset
(i.e. 12,676 rows), using nv = 1. For each entry, we provide predictions of Wikidata entries
both for station and place, together with the confidence score of these predictions. We also
provide the Wikidata ID of the selected entity (i.e. the predicted station if the confidence score
is above a certain threshold; the predicted place if not) and its latitude and longitude.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Discussion</title>
      <p>Linking information on railway stations serves the larger aim of enabling historical research
based on heterogeneous, interconnected data sources. This section ofers a quick comparison
with the publicly available Campop data and also showcases some novel research avenues that
emerge from the enrichment and linking of historical information. The goal, here, is to sketch
these opportunities and more elaborate analyses will appear in future work.</p>
      <p>Compared to existing datasets, StopsGB expands our knowledge of historical stations in
many ways. Not only does it fill gaps in the current record, it also extends the time frame,
spanning almost two centuries. To compare the diferences visually, we can map StopsGB and
Campop data. Figure 3 includes all stations opened up to 1999 and compares them to the
combined Campop stations (e.g. the union of 1851, 1861, and 1881 stations): it appears that
StopsGB provides a more complete picture of the station landscape (e.g. red points that are
25The string confidence baseline is a very strong baseline, especially in the strict evaluation scenario, and
indicates that most station names are quite unique. It is worth mentioning that both the string confidence
baseline and RankLib produce diferent results at each run. For this reason, the results reported are averaged
over 5 runs to present a more reliable overview.
not paired with an overlapping or adjacent white point). However, this also points to some
complexities, as neither data set is complete, nor do they overlap in clear ways. Scrutinizing
these diferences and overlaps between Campop and GB1900 (as well as with modern data) is
part of future work.</p>
      <p>To highlight some novel research approaches made possible by StopsGB, we sketch out two
case studies that exploit links between data to understand the place of rail in industrializing
communities. Figure 4 shows how rail is embedded in the urban landscape. Focusing on Bolton,
it plots stations (red) in relation to industrial buildings, churches and schools.26 Blending linked
data with visual information (in this case, historical maps) provides new means to explore the
context of station and rail, both quantitatively and qualitatively. This approach allows us to
explore (using more abstract measures) the spatial distribution of stations, but we can also
zoom in on specific areas for a ‘close reading’ of the spatial context of the rail. Moreover, by
exploiting information on the opening and closing of stations, we can obtain a dynamic and
detailed image of the evolution of the British rail network. Figure 5 shows the spread of the
railway during the nineteenth century.</p>
      <p>26These labels are obtained by matching entries in GB1900. Industrial terms are ‘works’, ‘mill’, ‘mills’,
‘factories’, ‘factory’, ‘workshop’, ‘wks’, ‘manufactory’. ‘Schools’ and ‘sch’ are used for plotting schools. Religious
buildings were captured by ‘church’, ‘ch’, ‘chap’, ‘chapel’ and ‘cathedral’.</p>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusion</title>
      <p>Leveraging the links between Wikidata and the Chronology station descriptions in these
examples demonstrates the power of a station dataset that can be queried not only by location, but
also by date or any other attribute so carefully collected by Quick and other contributors from
the Railway and Canal Historical Society. Our work to translate this exceptional
communitycurated resource into a geolocated dataset is an early step that will allow history and geography
researchers to craft new narratives about the railway, and the process of industrialisation it
accompanied.</p>
    </sec>
    <sec id="sec-6">
      <title>Author contributions</title>
      <p>After the first author, authors are listed in alphabetical order. The names in the following roles
are sorted by amount of contribution and, if equal, alphabetically: Conceptualization: KM,
JL, DW; Methodology: MCA, FN, KB; Implementation: MCA, FN, KB, GT; Reproducibility:
FN, MCA; Historical Analysis: KB, KM, JL, JR, DW; Data Acquisition and Curation: DW,
MCA, GT, FN; Annotation: JL, KM; Project Management: MCA; Writing and Editing: all.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We thank the Railway and Canal Historical Society for sharing the Microsoft Word version of
Railway Passenger Stations in Great Britain: a Chronology by Michael Quick. Work for this
paper was produced as part of Living with Machines. This project, funded by the UK Research
and Innovation (UKRI) Strategic Priority Fund, is a multidisciplinary collaboration delivered
by the Arts and Humanities Research Council (AHRC), with The Alan Turing Institute, the
British Library and the Universities of Cambridge, East Anglia, Exeter, and Queen Mary
University of London.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Acheson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Volpi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Purves</surname>
          </string-name>
          . “
          <article-title>Machine learning for cross-gazetteer matching of natural features”</article-title>
          . In: Ijgis (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Aucott</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Southall</surname>
          </string-name>
          . “
          <article-title>Locating past places in Britain: creating and evaluating the GB1900 Gazetteer”</article-title>
          .
          <source>In: International Journal of Humanities and Arts Computing 13.1</source>
          (
          <issue>2019</issue>
          ), pp.
          <fpage>69</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bunescu</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Paşca</surname>
          </string-name>
          . “
          <article-title>Using encyclopedic knowledge for named entity disambiguation”</article-title>
          .
          <source>In: 11th Conference of the European Chapter of the Association for Computational Linguistics</source>
          . Trento, Italy: Association for Computational Linguistics,
          <year>2006</year>
          . url: https: //www.aclweb.org/anthology/E06-1002.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>[4] [5] [6]</source>
          [9]
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Cobb</surname>
          </string-name>
          .
          <article-title>The railways of Great Britain, a historical atlas at the scale of 1 inch to 1 mile</article-title>
          . Shepperton, Surrey: Ian Allan Pub.,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>M. Coll Ardanuy</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Hosseini</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>McDonough</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Krause</surname>
            ,
            <given-names>D. van Strien</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Nanni</surname>
          </string-name>
          . “
          <article-title>A deep learning approach to geographical candidate selection through toponym matching”</article-title>
          .
          <source>In: Proceedings of the 28th International Conference on Advances in Geographic Information Systems</source>
          .
          <year>2020</year>
          , pp.
          <fpage>385</fpage>
          -
          <lpage>388</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>M. Coll Ardanuy</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>McDonough</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Krause</surname>
            ,
            <given-names>D. C.</given-names>
          </string-name>
          <string-name>
            <surname>Wilson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Hosseini</surname>
            , and
            <given-names>D. van Strien. “</given-names>
          </string-name>
          <article-title>Resolving places, past and present: toponym resolution in historical British newspapers using multiple resources”</article-title>
          .
          <source>In: Proc. of GIR</source>
          .
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cucerzan</surname>
          </string-name>
          . “
          <article-title>Large-scale named entity disambiguation based on Wikipedia data”</article-title>
          .
          <source>In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)</source>
          . Prague, Czech Republic: Association for Computational Linguistics,
          <year>2007</year>
          , pp.
          <fpage>708</fpage>
          -
          <lpage>716</lpage>
          . url: https://www.aclweb.org/anthology/D07-1074.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          . “BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding”</article-title>
          . In: arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Ehrmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Romanello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Flückiger</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Clematide</surname>
          </string-name>
          . “
          <article-title>Extended overview of CLEF HIPE 2020: named entity processing on historical newspapers”</article-title>
          .
          <source>In: CLEF 2020 Working Notes. Conference and Labs of the Evaluation Forum</source>
          . Vol.
          <volume>2696</volume>
          .
          <string-name>
            <surname>Conf</surname>
          </string-name>
          . Ceur.
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          and
          <string-name>
            <given-names>U.</given-names>
            <surname>Scaiella</surname>
          </string-name>
          . “
          <article-title>TagMe: on-the-fly annotation of short text fragments”</article-title>
          .
          <source>In: Proc. of CIKM</source>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hachey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nothman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Honnibal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Curran</surname>
          </string-name>
          . “
          <article-title>Evaluating entity linking with Wikipedia”</article-title>
          .
          <source>In: Artificial intelligence 194</source>
          (
          <year>2013</year>
          ), pp.
          <fpage>130</fpage>
          -
          <lpage>150</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Henneberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Satchell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M. W.</given-names>
            <surname>Shaw-Taylor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Wrigley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Cobb</surname>
          </string-name>
          . 1861 England,
          <article-title>Wales and Scotland railway stations</article-title>
          .
          <year>2018</year>
          . doi:
          <volume>10</volume>
          .5255/ukda-sn-
          <volume>852995</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Henneberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Satchell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M. W.</given-names>
            <surname>Shaw-Taylor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Wrigley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Cobb</surname>
          </string-name>
          . 1881 England,
          <article-title>Wales and Scotland railway stations</article-title>
          .
          <year>2018</year>
          . doi:
          <volume>10</volume>
          .5255/ukda-sn-
          <volume>852996</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nanni</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. Coll</given-names>
            <surname>Ardanuy</surname>
          </string-name>
          . “
          <article-title>DeezyMatch: A flexible deep learning approach to fuzzy string matching”</article-title>
          .
          <source>In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          .
          <year>2020</year>
          , pp.
          <fpage>62</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [16]
          <string-name>
            <surname>T</surname>
          </string-name>
          .-Y. Liu.
          <article-title>Learning to rank for information retrieval</article-title>
          . Springer,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Marti-Henneberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Satchell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M. W.</given-names>
            <surname>Shaw-Taylor</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Wrigley</surname>
          </string-name>
          . 1851 England,
          <article-title>Wales and Scotland railway stations</article-title>
          .
          <year>2018</year>
          . doi:
          <volume>10</volume>
          .5255/ukda-sn-
          <volume>852994</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>M. D. Lieberman</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Samet</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Sankaranarayanan</surname>
          </string-name>
          . “
          <article-title>Geotagging with local lexicons to build indexes for textually-specified spatial data”</article-title>
          .
          <source>In: 2010 IEEE 26th international conference on data engineering (ICDE</source>
          <year>2010</year>
          ). Ieee.
          <year>2010</year>
          , pp.
          <fpage>201</fpage>
          -
          <lpage>212</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>K. McDonough</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Moncla</surname>
          </string-name>
          , and M. van de Camp. “
          <article-title>Named entity recognition goes to Old Regime France: geographic text analysis for early modern French corpora”</article-title>
          .
          <source>In: International Journal of Geographical Information Science</source>
          <volume>33</volume>
          .12 (
          <year>2019</year>
          ), pp.
          <fpage>2498</fpage>
          -
          <lpage>2522</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakob</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Garcı́a-</article-title>
          <string-name>
            <surname>Silva</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bizer</surname>
          </string-name>
          . “
          <article-title>DBpedia Spotlight: shedding light on the web of documents”</article-title>
          .
          <source>In: Proc. of SEMANTiCS</source>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Meroño-Peñuela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ashkpour</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Van Erp</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Mandemakers</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Breure</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Scharnhorst</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Schlobach</surname>
            , and
            <given-names>F. Van Harmelen. “</given-names>
          </string-name>
          <article-title>Semantic technologies for historical research: A survey”</article-title>
          .
          <source>In: Semantic Web 6.6</source>
          (
          <issue>2015</issue>
          ), pp.
          <fpage>539</fpage>
          -
          <lpage>564</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>F.</given-names>
            <surname>Nanni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Ponzetto</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Dietz</surname>
          </string-name>
          . “
          <article-title>Entity-aspect linking: providing fine-grained semantics of entities in context”</article-title>
          .
          <source>In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries</source>
          .
          <year>2018</year>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Olieman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Beelen</surname>
          </string-name>
          , M. van
          <string-name>
            <surname>Lange</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kamps</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Marx</surname>
          </string-name>
          . “
          <article-title>Good applications for crummy entity linkers? The case of corpus selection in digital humanities”</article-title>
          .
          <source>In: Proc. of SEMANTiCS</source>
          .
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          and
          <string-name>
            <given-names>I. Gurevych.</given-names>
            “
            <surname>Sentence-BERT</surname>
          </string-name>
          :
          <article-title>Sentence embeddings using siamese BERT-networks”</article-title>
          .
          <source>In: Proc. of EMNLP</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rovera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nanni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Ponzetto</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Goy</surname>
          </string-name>
          . “
          <article-title>Domain-specific named entity disambiguation in historical memoirs”</article-title>
          .
          <source>In: Proc. of CLIC</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Patel</surname>
          </string-name>
          , and M.-E. Vidal. “
          <article-title>Falcon 2.0: An entity and relation linking tool over Wikidata”</article-title>
          .
          <source>In: Proceedings of the 29th ACM International Conference on Information &amp; Knowledge Management</source>
          .
          <year>2020</year>
          , pp.
          <fpage>3141</fpage>
          -
          <lpage>3148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>R.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Murrieta-Flores</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Martins</surname>
          </string-name>
          . “
          <article-title>Learning to combine multiple string similarity metrics for efective toponym matching”</article-title>
          .
          <source>In: International journal of digital earth</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>W.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and J. Han. “
          <article-title>Entity linking with a knowledge base: issues, techniques, and solutions”</article-title>
          .
          <source>In: IEEE Transactions on Knowledge and Data Eng</source>
          . (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>R.</given-names>
            <surname>Simon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Barker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Isaksen</surname>
          </string-name>
          , and P. de Soto Cañamares. “
          <article-title>Linking early geospatial documents, one place at a time: annotation of geographic documents with Recogito”</article-title>
          .
          <source>In: e-Perimetron 10.2</source>
          (
          <issue>2015</issue>
          ), pp.
          <fpage>49</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          and
          <string-name>
            <surname>D. N. Milne. “</surname>
          </string-name>
          <article-title>An efective, low-cost measure of semantic relatedness obtained from Wikipedia links”</article-title>
          . In: (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>