<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Integrating Visual and Textual Inputs for Searching Large-Scale Map Collections with CLIP</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jamie Mahowald</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Charles GermainLee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information School, University of Washington</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The Archer Center &amp; Department of Mathematics, The University of Texas at Austin</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>528</fpage>
      <lpage>547</lpage>
      <abstract>
        <p>Despite the prevalence and historical importance of maps in digital collections, current methods of navigating and exploring map collections are largely restricted to catalog records and structured metadata. In this paper, we explore the potential for interactively searching large-scale map collections using natural language inputs (“maps with sea monsters”), visual inputs (i.e., reverse image search), and multimodal inputs (an example map + “more grayscale”). As a case study, we adopt 562,842 images of maps publicly accessible via the Library of Congress's API. To accomplish this, we use the mulitmodal Contrastive Language-Image Pre-training (CLIP) machine learning model to generate embeddings for these maps, and we develop code to implement exploratory search capabilities with these input strategies. We present results for example searches created in consultation with staf in the Library of Congress's Geography and Map Division and describe the strengths, weaknesses, and possibilities for these search queries. Moreover, we introduce a fine-tuning dataset of 10,504 map-caption pairs, along with an architecture for fine-tuning a CLIP model on this dataset. To facilitate re-use, we provide all of our code in documented, interactive Jupyter notebooks and place all code into the public domain. Lastly, we discuss the opportunities and challenges for applying these approaches across both digitized and born-digital collections held by galleries, libraries, archives, and museums.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;maps</kwd>
        <kwd>Library of Congress</kwd>
        <kwd>computing cultural heritage</kwd>
        <kwd>multimodal machine learning</kwd>
        <kwd>exploratory search</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Maps represent a central collecting focus for cultural heritage institutions, comprising large
fractions of collections across the world. For example, the Library of Congress alone holds
over 5.5 million maps4[2]. Eforts to digitize maps have resulted in new possibilities for
access for a wide range of patrons, from scholars to politicians to the public. However, current
methods for searching historic map collections are largely limited to structured metadata and
keyword search over extracted text via optical character recognition (OCR). As described in a
2020 survey of metadata for topographic maps, metadata is “not often connected with the way
in which users search for maps,” and metadata standards vary across institutio1n7][.
Furthermore, enriching metadata requires staf time and expertise, which is not always feasible.</p>
      <p>
        In recent years, scholars and practitioners within cultural heritage, the computational
humanities, and the digital humanities have begun exploring the application of computer vision
methodologies to historic maps for a wide range of tasks, ranging from metadata enrichment
via classification [
        <xref ref-type="bibr" rid="ref27">33</xref>
        ] to the semantic identification of visual markers such as railroad tracks
[
        <xref ref-type="bibr" rid="ref12">13</xref>
        ]. However, much remains to be explored surrounding methods for facilitating exploration
and sensemaking of map corpora using machine learning.
      </p>
      <p>In this paper, we take on this challenge by exploring possibilities for searching large-scale
map collections using multimodal machine learning. As a case study, we adopt as our collection
of choice 562,842 images of maps publicly available through the Library of Congress’s API.
To facilitate multimodal search and discovery, we generate and release embeddings for these
images using OpenAI’s CLIP model 3[2]. Significantly, CLIP and other multimodal approaches
have seen increasing adoption in the computational humanities community, showing great
promise for use with digital collection3s].[We build on this work to show the possibilities for
maps in particular. Namely, we leverage the shared image- and text-embedding space enabled
by CLIP to implement three diferent forms of interactive search: with natural language inputs
(“maps with sea monsters”), with visual inputs (i.e., reverse map search), and with multimodal
inputs (an example map + “more grayscale”). Our search code is highly responsive, capable
of searching half a million images and returning results on a consumer-grade GPU (e.g., using
a personal laptop) in less than a second. We also introduce a dataset of 10,504 map-caption
pairs, as well as code for fine-tuning CLIP with this dataset. To facilitate re-use, all of our code
is released into the public domain in the form of documented Jupyter notebooks that others
can run on their own machines. These notebooks can be found at our GitHub repository:
https://github.com/j-mahowald/clip-loc-map.sAn overview of our search implementation
can be found in Figure1.</p>
      <p>Working in collaboration with the Library of Congress’s Geography and Map division, we
present a number of example searches using our search implementation and describe the strengths
and limitations of this approach. Given the shared challenges surrounding discoverability
across digital collections, we discuss the extensibility of these results to other cultural
heritage collections, ranging from digitized materials to born-digital content. In order to ensure
that our work has been conducted ethically and responsibly, we describe our adoption of the
LC Labs AI Planning Framework throughout our research process.</p>
      <p>In summary, our paper ofers five central contributions:
1. We introduce CLIP embeddings for 562,842 images of 56,554 map items held by the
Library of Congress and made available through the loc.gov API.
2. We introduce a dataset of 10,504 map-caption pairs, as well as an architecture for
finetuning a CLIP model on this dataset.
3. In consultation with the Geography and Map Division at the Library of Congress, we
demonstrate the utility of these embeddings for a range of search &amp; discovery tasks,
including natural language search, reverse image search, and multimodal search.
4. We release all of our code as re-usable Jupyter notebooks and place the notebooks into
the public domain. These notebooks include our pipeline for generating the CLIP
embeddings, our search implementation for all three methods, and the code for fine-tuning
CLIP. Our code can be found in ourGitHub repositor.y
5. We discuss potential ways that CLIP embeddings could be used to improve
discoverability across digital collections.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In recent years, the “Collections as Data” movement and related eforts have demonstrated
the value of applying artificial intelligence (AI) to digital collections held by galleries, libraries,
archives, and museums in a range of contexts2[
        <xref ref-type="bibr" rid="ref22 ref6">7, 29</xref>
        ]. Of particular relevance to this paper
is work that has applied computer vision to digital collections in the context of search and
discovery [
        <xref ref-type="bibr" rid="ref10 ref16 ref17 ref18 ref33 ref4 ref5 ref7">6, 23, 24, 5, 11, 38, 8, 2, 40, 21, 22</xref>
        ]. Likewise, the MapReader project and others
have demonstrated value in applying machine learning to historic maps for classification and
other tasks [
        <xref ref-type="bibr" rid="ref12 ref27 ref30 ref31 ref32">33, 13, 39, 36, 37</xref>
        ]. In this paper, we pursue the intersection of these bodies of
work in order to explore the application of machine learning methods to search and discovery
for historic maps. Surrounding the use of multimodal machine learning approaches, we build
on work by Smits &amp; Wevers 3[
        <xref ref-type="bibr" rid="ref4">5</xref>
        ], Smits &amp; Kestemont [
        <xref ref-type="bibr" rid="ref28">34</xref>
        ], and Barancová et al. 3[]. This
collective work has applied OpenAI’s CLIP mode3l2[] to digital collections, and we follow suit,
focusing on the application to map collections in particular. Though work such as PIGE1O0N] [
and StreetCLIP [
        <xref ref-type="bibr" rid="ref8">9</xref>
        ] have applied multimodal machine learning approaches to maps, the focus
has been on contemporary, born-digital maps, whereas we consider historic maps and give
attention to the specific context of cultural heritage.
      </p>
      <p>
        Significantly, much work has explored the availability and usability of metadata for historic
map collections1[
        <xref ref-type="bibr" rid="ref6">7, 15, 18, 26</xref>
        ]. In this paper, we ask how search and discovery can be enriched
beyond existing metadata; we refer to this literature for further reading on the strengths and
limitations of existing metadata practices.
      </p>
      <p>
        Lastly, our work is situated within a landscape of research actively engaging with the
responsible and ethical dimensions of applying AI to cultural heritage collections. We note that
many frameworks and guidelines exist for pursuing this wor2k8,[
        <xref ref-type="bibr" rid="ref1 ref13 ref15 ref3">4, 20, 14, 1</xref>
        ]. In this paper, we
adopt the LC Labs AI Planning Framework in particular because we utilize Library of Congress
maps for our case study1[
        <xref ref-type="bibr" rid="ref8">9</xref>
        ]. In Section 3.1, we describe our dataset in more detail.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology 1: Generating CLIP Embeddings for 562,842</title>
    </sec>
    <sec id="sec-4">
      <title>Images of Maps</title>
      <sec id="sec-4-1">
        <title>3.1. Our Dataset of Library of Congress Maps</title>
        <p>The Library of Congress has made publicly available over 56,000 map items comprising over
563,000 segments (images), a figure that continues to grow regularly. The map items are largely
from the Geography and Map (G&amp;M) Division and are vastly varied in relation to the
number of constituent images included, with some containing only one image (e.g., maps of small
towns, or standalone illustrations), while others, such as atlases or set maps, contain orders
of magnitude more. For example, the Texas General Highway Map item contains over 10,000
sheets. Of the 563,696 segments we attempted to process, 562,842 (99.85%) returned valid
requests through the International Image Interoperability Framework (IIIF) when we queried
them for our purposes. Each item is associated with one or more resources, onto which
individual segments add an identifying sufÏx. For instance, the resourceg4031pm.gct00608,
which represents the first 2,999 sheets of a map set named “General highway map ... Texas,”
includesg4031pm.gct00608.cs000150 representing a particular sheet showing highways in
Aransas County.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Generating Embeddings</title>
        <p>We introduce a pipeline that leverages multiprocessing to efÏciently generate embeddings for
the Library of Congress maps in our dataset, while retaining their metadata and structure
(Figure 2). Our embeddings generation pipeline can process over half a million images on an M3
MacBook Pro with 18GB memory in under 24 hours.</p>
        <p>To generate the embeddings using the base CLIP model, we tested a range of image widths
and patch sizes, settling on widthw=2000px and the base-size Vision Transformer (ViT) with
32x32px patches to optimize download times while retaining sufÏciently high image resolution.
We then built out our pipeline around these specifications.</p>
        <p>
          Of the several forms of identification that belong to each Library of Congress object, we focus
on the IIIF ID for image processing and unique identification and the resource ID for metadata
extraction. The data framemerged_files.csv –– accessible via Zenodo [
          <xref ref-type="bibr" rid="ref19">25</xref>
          ] –– gives the
loc.gov resource URL and IIIF image URL for each object. It also provides information on
an image’s file size and its context in the collection. The pipeline reads row by row from this
CSV file, creating a metadata dictionary for each image that includes API metadata. Using
the defined preprocessor, model, and IIIF request, embeddings are generated, normalized, and
appended to this dictionary as a n= 512 -tuple.
        </p>
        <p>Our GitHub repository contains two script versionesm:bed_full.py, which incorporates
extensive metadata from theloc.gov API, ideal for further fine-tuning, andembed_stripped.py,
which includes only the IIIF image URL and its embedding. Each IIIF ID can be used to derive
its corresponding resource ID (the converse is not true), so only the IIIF ID is strictly needed
to carry the pipeline forward. The JSON files are then written to a local directory and named
for their IIIF ID to ensure uniqueness and easy derivation.</p>
        <p>After each JSON is downloaded, create_beto.py generates beto (“big embedding tensor
object”), a PyTorch tensor of size[(, ), ] for  image embeddings of dimension . Though
only two-dimensional, the tensor formulation facilitates indexing and serves as an input for a
search query. A corresponding-tuple, beto_idx, is created to associate each embedding in
beto with its respective IIIF URL by index. Diagrammatically,</p>
        <p>⎡⎡  11 ⎤ ⎡  12 ⎤ ⎡  1 ⎤⎤
beto = ⎢⎢⎢⎢⎢⎢  .2..1 ⎥⎥⎥ , ⎢⎢⎢  .2..2 ⎥⎥⎥ , ..., ⎢⎢⎢  .2.. ⎥⎥⎥⎥⎥⎥ ,
⎣⎣ 1 ⎦
⎣ 2 ⎦
⎣  ⎦⎦
beto_idx = [ 1,  2, ...,   ]
(1)
where the column vector[ 1 ,  2 , ...,   ] represents the -tuple embedding for th eth image,
and   represents the th image’s corresponding IIIF URL.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. A Dataset &amp; Architecture for Fine-tuning</title>
        <p>
          To complement the CLIP embeddings that we have generated and released, we introduce a
dataset of 10,504 map-caption pairs for fine-tuning CLIP (available inZenodo[
          <xref ref-type="bibr" rid="ref19">25</xref>
          ]), along with
code for performing this fine-tuning (available in ouGritHub repositor)y.
        </p>
        <p>One central goal in fine-tuning is to provide the CLIP model with a large set of map-caption
(i.e., image-text) pairs from which it can contrastively learn relevant information such as styles,
locations, dates, and other visual features. For each resource ID, the Library of Congress catalog
record yields several descriptors useful for systematically training a model on map-caption
pairs. We smooth the capitalization and punctuation through a few simple functions, and we
use this metadata to generate a descriptive, natural language caption for each map. An example
of the process is shown in Figure3, and three resulting examples are shown in Figur4e. In
total, we include 10,504 map-caption pairs by initially generating 10,000 maps with a single
associated image, adding 2,000 randomly sampled images from the Sanborn Maps collection
(which represents a disproportionate fraction of map images made publicly available online by
the Library of Congress), adding an additional 227 maps covering every present-day country
and U.S. state, and discarding a total of 1,723 samples with unresponsive image requests or
lowquality captions (for instance, those with nonsensical characters, arbitrary changes in language,
or no feature descriptors).</p>
        <p>Our initial experiments performing our fine-tuning yielded mixed results. In SectAioinn the
Appendix, we describe these experiments. We also elaborate on our choice of our fine-tuning
dataset.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Methodology 2: Implementing Search &amp; Discovery with CLIP</title>
    </sec>
    <sec id="sec-6">
      <title>Embeddings</title>
      <p>In this section, we outline our implementation of our three diferent search methods using CLIP
embeddings: 1) text-input search, 2) image-input search, and 3) text- &amp; image-input search.</p>
      <sec id="sec-6-1">
        <title>4.1. Text-input Search</title>
        <p>Given a text query and a specified number of desired result s, we employ the following process
to search beto and beto_idx, as defined in Equation 1. First, we generate a normalized text
embedding for the query utilizing the same CLIP configuration used in encoding the larger
collection. This embedding is then used to compute cosine similarity scores with each embedding
in beto. We then identify the  largest scores and their corresponding indices. The similarity
scores of these top results are normalized using the softmax function, and the resulting scores,
along with their respective identifying links frobmeto_idx, are displayed. Cosine similarity
is extremely efÏcient to compute, making it possible to identify the top scores among over
half a million images nearly instantaneously.</p>
      </sec>
      <sec id="sec-6-2">
        <title>4.2. Image-input Search</title>
        <p>In this strategy, the user can input an image URL and a desired number of resulttso conduct
this image-input search, alternatively known as reverse image search. After the URL request is
received, the process is identical to the one outlined in text-input search because CLIP embeds
images and text in a common embedding space. The query image is embedded on the spot
as part of the search script, meaning that the user can input any image of any size and is not
limited to those from the Library of Congress catalog.</p>
      </sec>
      <sec id="sec-6-3">
        <title>4.3. Text- &amp; Image-input Search</title>
        <p>
          We introduce an experimental search strategy that accepts both a text string and image
input as a search query. The CLIP model embeds the text string and the image to the same
 -dimensional embedding space. The engine then accepts a scaling facto r that determines
how much relative weight should be assigned to the text and image inputs. We assign:
c =
(1 − ) ⋅ q + (1 + ) ⋅ t
2
,
 ∈ [
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ]
(2)
where q and t are the  -dimensional embedding for the image and text queries, respectively,
and c is the combined weighted embedding (intuitively, this is a weighted centroid in the
embedding space whose weights are determined by the scaling factor). Introducing this scaling
factor satisfies the desired qualities that:
1. an input of  = 0 weighs each term equally,
2. the weight produced by a scaling term is equal to the reciprocal of the weight produced
by the negative of that scaling term (i.e., a positive inpu t0 weighs in favor of the text
input exactly as much as− 0 weighs in favor of the image input), and
3. c limits to the sole input oqf ort as  approaches -1 or 1, respectively.
        </p>
        <p>Our search engine then computes cosine similarity scores between this combined embedding
and each embedding inbeto, returning the top scores as input by the user.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>5. Results &amp; Discussion</title>
      <p>In this section, we introduce example search results for all three strategies described in Section
4 and reflect on the strengths and limitations of our implementation. We also describe our
utilization of the LC Labs AI Planning Framework throughout our research process.</p>
      <p>Text-input Search</p>
      <sec id="sec-7-1">
        <title>5.1. Search Results</title>
        <p>We begin by presenting example search results for all three approaches and reflecting on
strengths and weaknesses. These observations are derived from our conversations with staf
in the Library of Congress Geography and Map Division, who have experimented with our
search implementation and have ofered feedback based on their experiences with patron
requests. As a brief observation in regard to the performance of our search implementation, we</p>
        <p>Image-input Search
note that our search implementation processes a query, searches over all half a million images,
and finds the most relevant results in less than a second on an M3 MacBook Pro with 18GB
memory. This indicates that our implementation is both responsive and scalable.</p>
        <sec id="sec-7-1-1">
          <title>5.1.1. Text-input Search</title>
          <p>In Figure5, we show three examples of text-input searches with natural language queries:
“tattered and worn map,” “old panoramic map surrounded by images of buildings,” and “a map with
illustrations of 19th-century ships.” We chose these three examples in order to demonstrate a
range of diferent searches that can be performed, from specific content within the maps (e.g.,
ships), to map styles (e.g., panoramic maps) and layouts (e.g., surrounded by images of
buildings), to time periods (e.g., 19th-century), to the material properties of the maps (e.g., tattered
and worn).</p>
          <p>Significantly, these approaches complement the metadata found in the catalog records for
these maps. For example, a text search for “celestial map” yields eight relevant celestial charts
out of the top ten results. Given that some but not all of the loc.gov JSON records for these maps
include the word “celestial” (e.g., “celestial chart,” “celestial sphere”), our search implementation
enables the user to retrieve more relevant examples than what is possible when restricted solely
to the existing metadata.</p>
          <p>Conversely, a text search for “map with cartouche” yields more mixed results. Of the
returned images, the second and third results are maps with cartouches, and the fourth result
has a more general cartouche; however, other results do not have cartouches. A text search
in https://www.loc.gov/mapsfor “cartouche” returned 199 results, which generally have
cartouches depicted. In this case, existing metadata proves more useful (though it should be
noted that our search implementation can easily be extended to include metadata search as
well). Another example where metadata proves more useful is “watercolor map.” Examples
where both our search implementation and metadata-based search do not have high precision
include “maps with drawings of people” and “hand drawn maps” (though the latter could be
partially constrained by searching terms in the metadata such as “pencil,” “ink,” or “watercolor”).</p>
        </sec>
        <sec id="sec-7-1-2">
          <title>5.1.2. Image-input Search</title>
          <p>In Figure6, we show two examples of image-input, reverse-map searches. As with the
textinput search examples, we have chosen these two maps to reflect distinct styles. This type of
search can be useful in a number of settings. For example, a user may not know the proper
vocabulary for specific visual features or styles, or relevant information may not be present in a
map’s metadata. As representative of a valuable use case, staf ofered successful reverse-image
searches with portolan charts – a type of early nautical map that is visually recognizable by
diagonal lines often referred to as rhumb lines or windrose lines.</p>
          <p>Additionally, though the text-input search of “map with cartouche” in Secti5o.n1.1 yielded
mixed results, a reverse image search of a map with a cartouche performed significantly better
(see the top search in Figure6), returning nine relevant maps out of the top ten maps returned
for one example and seven relevant maps out of the top ten maps for another. In general,
image-input searches typically yield more accurate results than text-based searches. Indeed,
when measuring the similarity between query and results using raw cosine similarity scores
(without applying softmax normalization), image-based searches achieve scores that are almost
triple those of text-based searches.</p>
          <p>A weakness with image-input search is that the user cannot constrain what in particular
about the specified map input is most important to them. Indeed, this was a significant
motivating factor for our implementation of joint text- &amp; image-input search.</p>
        </sec>
        <sec id="sec-7-1-3">
          <title>5.1.3. Text- &amp; Image-input Search</title>
          <p>In Figure7, we introduce an example of joint text- &amp; image-input search with a map and the
natural language query “more grayscale,” along with three diferent scaling facto r=s:0, 0.3,
and −0.5. As with image-input search, this method could be particularly valuable when a user
does not know the proper vocabulary, but this method ofers the added afordance of enabling
the user to specify natural language in order to tune the search. Because the user can quickly
refine searches in an interactive fashion, we believe this afordance for specifying feedback is
a promising one for exploratory search with maps.</p>
          <p>Interestingly, a combined search of a map with a cartouche along with the text input “map
with cartouche” yielded better results than the equivalent searches with text-input only and
image-input only. Using the two diferent example maps tested in Section5.1.2, along with
the text input “map with cartouche” and a scaling factor of= 0 , the inputs returned ten and
nine relevant maps out of the top ten returned results, respectively. However, this method
of reinforcing the search via both text and image did not improve results for other searches
such as “hand drawn maps,” “tattered map,” or “watercolor map,” likely owing to the lack of
additional information provided to the model across the two modes.</p>
        </sec>
      </sec>
      <sec id="sec-7-2">
        <title>5.2. Applying the LC Labs AI Planning Framework</title>
        <p>
          Throughout our research, we have adhered to the LC Labs AI Planning Framework in order
to engage with the responsible and ethical dimensions of this wor1k9[]. We selected this
framework in accordance with our use of Library of Congress materials. Created by LC Labs
at the Library of Congress in 2023, the AI Planning Framework articulates three distinct
elements (data, models, and people) across three phases (understanding, experimenting, and
implementing) of a project’s development3[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. To ofer relevant considerations and facilitate
documentation during a project’s development, the LC Labs AI Planning Framework provides
three worksheets on data privacy and transparency19[]:
1. Use Case Risk Worksheet, “to assist staf in assessing the risk profile of an AI use case.”
2. Phase II Risk Analysis, “to articulate success criteria, measures, risks, and benefits for an
        </p>
        <p>AI Use Case.”
3. Data Readiness Assessment, “to assess readiness and availability of data for the proposed
use case.”
We have completed all three worksheets and included them in oGuirtHub repositor.yBased on
our reflections during our completion of the worksheets, we note a few salient points. Because
all training and search data are obtained from the Library of Congress, the overwhelming
majority of the maps included are in the public domain (for any questions pertaining to a particular
map’s copyright, its included metadata can be consulted). We note that our fine-tuning dataset,
taken directly from the larger corpus of maps described in Sect3io.1n,is used to fine-tune the
model and evaluate performance, as described in Secti3o.n3 and SectionA in the Appendix. In
our worksheets, we describe the requirements and evaluations by Geography and Map staf at
the Library of Congress, who serve as proxy evaluators for the intended end-use researchers.
Lastly, we note that the absence of personally identifiable information, the low cost of mistakes
in search and discovery, and the rigorous evaluations of our process make our application a
low-risk use case according to the AI Planning Framework.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>6. Conclusion &amp; Future Work</title>
      <sec id="sec-8-1">
        <title>6.1. Conclusion</title>
        <p>In this paper, we have asked a central question within the computational humanities: how
might emerging methods from multimodal machine learning be utilized to facilitate searching
large-scale map collections? To address this question, we have built out a search
implementation for 562,842 images of maps publicly available via the Library of Congress’s API. In
particular, we have produced CLIP embeddings for all 562,842 images and introduced a search
implementation that enables three diferent kinds of search inputs: 1) natural language,
textbased inputs, 2) visual, image-based inputs, and 3) multimodal, combined text- and image-based
inputs. In consultation with staf in the Library of Congress Geography and Map Division, we
have explored example searches and demonstrated the utility of these search methods.
Moreover, we have demonstrated a commitment to responsible and ethical AI practices by following
the LC Labs AI Planning Framework. For further work with historic maps, we have released
a dataset of 10,504 map-caption pairs, along with an architecture for fine-tuning CLIP on this
dataset. To facilitate transparency and re-usability for our code by end-users such as scholars
and practitioners, we have released all of our code into the public domain as Jupyter notebooks.
In what remains, we explore the extensibility of our approaches to other digital collections held
by galleries, libraries, archives, and museums, as well as describe other future work.</p>
      </sec>
      <sec id="sec-8-2">
        <title>6.2. Toward Improved Discoverability in GLAM Collections with CLIP</title>
        <p>Digital collections continue to grow at enormous rates. Developing methods of facilitating
search and discovery are more important than ever in order to contend with the challenge of
scale. Our search implementation leveraging CLIP has demonstrated the potential for
searching maps beyond their catalog records and existing metadata. Here, natural language inputs
facilitate interactive navigation, an important component of exploratory search.</p>
        <p>With the marked improvements in multimodal machine learning over the past few years,
it is clear that there are manifold opportunities to improve access through the application of
these methods. Significantly, these methods are extensible to a wide range of digitized and
born-digital collections currently only searchable via metadata and text search. In the case of
born-digital collections such as web archives, the lack of structured metadata at the webpage
level necessitates the exploration of these methodologies. Example searches that would be
enabled by applying CLIP-like approaches are as far-ranging as finding heavily redacted pages
in born-digital government documents to identifying specific motifs in rare book illustrations.
We have shown that our implementation can render half a million images searchable on a
single laptop, demonstrating that such approaches are scalable to millions of items with little
modification.</p>
        <p>The application of these methodologies presents challenges as well. Digital collections are
incredibly heterogeneous, spread across time periods, languages, media types, and beyond,
with diferent metadata fields. Consequently, ensuring that these approaches surface relevant
facets, and doing so responsibly and ethically, must be primary considerations for this work.
As one example, developing fine-tuned models for this work is important, but recognizing
the limitations and failure modes of these approaches – and when to use machine
learningbased approaches to begin with – is just as important. We therefore advocate that researchers
continue to adopt frameworks such as the LC Labs AI Planning Framework during all stages
of a project when applying AI to digital collections.</p>
      </sec>
      <sec id="sec-8-3">
        <title>6.3. Future Work</title>
        <p>Many directions of future work remain of interest for us. First, we would like to continue the
ifne-tuning experiments described in Section3.3, as well as SectionA of the Appendix. Though
our initial experiments showed mixed results, we believe additional experiments surrounding
careful implementation of a fine-tuned model to the search engine could reduce some of the
noise from inaccurate searches. Indeed, given the demonstrated utility of fine-tuned
embeddings for a range of downstream tasks in machine learning more generally, we believe that
a fine-tuned CLIP model for historic maps in particular could be beneficial to the
computational humanities community. Along these lines, we are interested in exploring additional
approaches to training and fine-tuning multimodal models, such as ones that do not utilize
contrastive learning and are not restricted by the contrastive fine-tuning mechanism41[].</p>
        <p>
          Moreover, we plan to build a proper search interface for our implementation with the goal
of hosting an exploratory search system that can be publicly accessed. Given the importance
of front-end afordances and considerations from human-computer interaction, we believe
detailed analysis surrounding the best interaction mechanisms warrants further stu1d2y]. [This
is especially important for maps, where afordances for browsing must take into consideration
the specificity of viewing and interacting with the digital objects themselves, which are often
large and often span multiple sheets [
          <xref ref-type="bibr" rid="ref6">7</xref>
          ]. User studies would be beneficial for building a
system that would be most valuable to patrons. We also plan to incorporate metadata search into
this interface, with the understanding that combining our search implementation with existing
search fields would be complementary.
        </p>
        <p>Lastly, as described in Section6.2, we believe that these multimodal, CLIP-style approaches
to search and discovery are useful for a wide range of digital collections. As a result, we have
begun exploring extensibility to other document types including web archives, born-digital
documents, digitized books, and digitized newspapers. Indeed, given the ongoing, worldwide
eforts surrounding the creation and stewardship of both digitized and born-digital collections,
continuing to refine methods for improving discoverability will only grow in importance over
the coming years.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>This work was supported in part by the Library of Congress Kluge Fellowship in Digital
Studies (BCGL), as well as the Archer Fellowship Program and the University of Texas System (JM).
At the Library of Congress, we would like to thank Rachel Trent, Amelia Raines, and Sundeep
Mahendra in the Geography and Map Division for their many-month collaboration
surrounding querying the digitized maps held by the LC, as well as their invaluable advice and input
surrounding relevant searches for map patrons. We would also like to thank Abigail Potter
and Brian Foo in LC Labs for their guidance surrounding the Library of Congress’s AI
Planning Framework. Lastly, we thank Michael Stratmoen and Travis Hensley in the Kluge Center
for their support of the internship and fellowship programs that made this collaboration and
paper possible.
[18]</p>
      <p>M. Kuzma and A. Moscicka. Evaluation of Metadata Describing Topographic Maps in a
National Library. 2020. url: https://www.proquest.com/working-papers/evaluation-me
tadata-describing-topographic-maps/docview/2532311981/se-.2
M. Kuźma and A. Mościcka. “Metadata evaluation criteria in respect to archival maps
description: A systematic literature review”. ITnh:e Electronic Library 37.1 (2019), pp. 1–
27. url: https://www.proquest.com/scholarly-journals/metadata-evaluation-criteria-re
spect-archival/docview/2535184851/se-.2
[31]
[32]</p>
    </sec>
    <sec id="sec-10">
      <title>A. Appendix: A Description of Experiments for Fine-tuning</title>
    </sec>
    <sec id="sec-11">
      <title>CLIP</title>
      <p>
        This section serves to describe our experiments fine-tuning CLIP on map-caption pairs, which
have yielded mixed results to date. In our fine-tuning script available in oruerpository, we
initialize the pre-trained model and processor as introduced in the Hugging Face transformers
library o(penai/clip-vit-base-patch32) [
        <xref ref-type="bibr" rid="ref23">30</xref>
        ]. We define an image-text pair dataset
inheriting from the PyTorch Dataset class, into which we load lists of image paths and their
corresponding captions. This dataset is then fed into a PyTorch DataLoader with a custom collate
function that opens images, converts them to RGB, and uses the CLIP processor to batch and
preprocess texts and images together. We then define the optimizer under Adam with
traditional hyperparameters (1 = 0.9,  2 = 0.98,  = 1 ⋅ 10 −6,  = 0.2 ) in Kingma &amp; Ba [16] and a
scheduled learning rate.
      </p>
      <p>The script initiates a 16-epoch training loop. Each epoch processe1s5000//32 + 1 = 469
batches of 32 image-text pairs, where gradient computations are reset and inputs are prepared
and passed through the model for each batch. The model computes logits for images and
texts and calculates a symmetric InfoNCE (noise-cross estimation) loss, a common choice for
contrastive learning like with CLIP, which encourages the model to align the embeddings of
matching texts and images while distinguishing non-matching ones. InfoNCE loss is selected
against cross-entropy loss, on which the original model is trained, in light of the increased
“noise” associated with the larger caption pool in our fine-tuning set. The gradients are
backpropagated, and the optimizer updates the model’s weights. After each epoch, the average and
total losses are calculated and recorded. The model state, optimizer state, and training loss are
saved to a checkpoint file, allowing for training or evaluation to be resumed later.</p>
      <p>For our first training regimen, we selected a random sample o=f 50, 000 standalone
imagetext pairs to generate and feed into the fine-tuning model. Becauseloc.gov metadata are
typically written at the item level, we limited our sample to images that are part of items with
fewer than 10 segments to avoid vague or potentially inaccurate metadata-derived captions.
This configuration resulted in a loss reduction of about 50% over 16 epochs, with the
logarithmic decline suggesting that the average error would decrease below one only after 35 to 40
epochs. We choose neither to pursue this path nor to increase the dataset size out of caution
for overfitting (a training set of 50,000 already represents 10% of the entire collection) and for
potential inaccuracies in the metadata-derived caption generation. The accuracy for the
specific task of search and discovery across the entire collection sufers qualitatively with this
ifne-tuned model as compared to the base model.</p>
      <p>
        For our second training regimen, we utilize the dataset introduced in Sect3i.o1n. From the
ifrst iteration, we recognized that the broader map collection is heavily skewed toward the
Division’s collection of Sanborn fire insurance maps, which biased the fine-tuning data. The
dataset of 10, 504 map-captions was constructed with this consideration in mind. We then
record the loss reduction across five epochs for an array of learning rates and batch sizes,
ifnding little significant reduction (and, at times, increase) in validation loss. This occurred
across several (8, 16, 32, 64) batch sizes and learning rates, though the decline was more modest
for smaller batch sizes. This phenomenon owes partially to the regressive nature of machine
learning for search and discovery. Whereas contrastive models tasked with classifying across
a discrete set of outputs (for instance, a list of possible years during which a photograph was
taken [
        <xref ref-type="bibr" rid="ref2">3</xref>
        ]) largely benefit from several cycles of supervised learning, the infinite label space of
a regression problem requires the model to interpolate or extrapolate beyond the finite set of
examples seen during training.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Alkemade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Claeyssens</surname>
          </string-name>
          , G. Colavizza,
          <string-name>
            <given-names>N.</given-names>
            <surname>Freire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Neudecker</surname>
          </string-name>
          , G. Osti, and
          <string-name>
            <surname>D. van Strien. “</surname>
          </string-name>
          <article-title>Datasheets for Digital Cultural Heritage Datasets”</article-title>
          .
          <source>JIonu: rnal of Open Humanities Data</source>
          (
          <year>2023</year>
          ). doi: https://doi.org/10.5334/johd.12.
          <issue>4</issue>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Anadol</surname>
          </string-name>
          .Archive Dreaming. http://refikanadol.com/works/archive-dreamin.
          <source>g2/020.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barancová</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wevers</surname>
          </string-name>
          , and N. van NoordB.
          <source>lind Dates: Examining the Expression of Temporality in Historical Photographs</source>
          .
          <year>2023</year>
          . arXiv:
          <volume>2310</volume>
          .06633 [cs.CV].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>CordellM</surname>
          </string-name>
          .achine Learning +
          <source>Libraries: A Report on the State of the Field</source>
          .
          <year>2020</year>
          . url: https://labs.loc.gov/static/labs/work/reports/Cordell-LOC-
          <article-title>ML-repor</article-title>
          .t.pdf
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Diagne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Barradeau</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Douryt</surname>
          </string-name>
          .-SNE Map. https://experiments.withgoogle.com /t-sne-map.
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Duhaime</surname>
          </string-name>
          . PixPlot. https://github.com/YaleDHLab/pix-plo.
          <year>t2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Edelson</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Ferster</surname>
          </string-name>
          . “
          <article-title>MapScholar: A Web Tool for Publishing Interactive Cartographic Collections”</article-title>
          .
          <source>InJ:ournal of Map &amp; Geography Libraries 9</source>
          .
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          (
          <year>2013</year>
          ), pp.
          <fpage>81</fpage>
          -
          <lpage>107</lpage>
          . doi: https://doi.org/10.1080/15420353.
          <year>2012</year>
          .
          <volume>74746</volume>
          3.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Foo</surname>
          </string-name>
          .
          <article-title>Visualizing AMNH Image Collection with Machine Learning</article-title>
          . https://github.com
          <article-title>/a mnh-sciviz/image-collectio</article-title>
          .
          <year>n2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Haas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Alberti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Skreta</surname>
          </string-name>
          .
          <article-title>Learning Generalized Zero-Shot Learners for OpenDomain Image Geolocalization</article-title>
          .
          <year>2023</year>
          . arXiv:
          <volume>2302</volume>
          . 00275 [cs.CV]. url: https : / / arxiv . o rg/abs/2302.00275.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Haas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Skreta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Alberti</surname>
          </string-name>
          , and
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>FinnP</article-title>
          .IGEON: Predicting Image Geolocations.
          <year>2023</year>
          . arXiv:
          <volume>2307</volume>
          .05845 [cs.CV].
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Hanssen-Bauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bognerud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hensten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. B.</given-names>
            <surname>Pedersen</surname>
          </string-name>
          , E. Westvang,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Øygard.</surname>
          </string-name>
          t-SNE Map. https://github.com/nasjonalmuseet/propinquit.y2018.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hearst</surname>
          </string-name>
          .
          <article-title>Search User Interfaces</article-title>
          . Cambridge University Press,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. C. S.</given-names>
            <surname>Wilson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Beelen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>McDonough</surname>
          </string-name>
          . “
          <article-title>MapReader: a computer vision pipeline for the semantic exploration of maps at scale”</article-title>
          .
          <source>IPnr:oceedings of the 6th ACM SIGSPATIAL International Workshop on Geospatial Humanities</source>
          . GeoHumanities '
          <volume>22</volume>
          . , Seattle, Washington, Association for Computing Machinery,
          <year>2022</year>
          , pp.
          <fpage>8</fpage>
          -
          <lpage>19</lpage>
          .
          <year>do1i0</year>
          :.11 45/3557919.3565812.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E.</given-names>
            <surname>Jakeway</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Algee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Allen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ferriter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mears</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Potter</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K.
          <source>ZwaarMda.chine Learning + Libraries Summit Event Summary</source>
          .
          <year>2020</year>
          . url: https://labs.loc.gov/static/labs /meta/ML-
          <string-name>
            <surname>Event-Summary-Final-</surname>
          </string-name>
          2020
          <string-name>
            <surname>-</surname>
          </string-name>
          02-13.pd.f [
          <volume>15</volume>
          ] [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Janes</surname>
          </string-name>
          . “
          <article-title>Of Maps and Meta-Records: Eighty-Five Years of Map Cataloguing at The National Archives of the United Kingdom”</article-title>
          .
          <source>InA:rchivaria 74</source>
          (
          <year>2012</year>
          ), pp.
          <fpage>119</fpage>
          -
          <lpage>165</lpage>
          . url: https://archivaria.ca/index.php/archivaria/article/view/134.09
          <string-name>
            <given-names>D.</given-names>
            <surname>Kingma</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          . “
          <article-title>Adam: A Method for Stochastic Optimization”</article-title>
          .
          <source>InI:nternational Conference on Learning Representations (ICLR)</source>
          . San Diega, CA, USA,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>L.</given-names>
            <surname>Labs. LC Labs AI Planning Framework - LC Labs AI Planning Framework</surname>
          </string-name>
          <article-title>- libraryofcongress</article-title>
          .github.io. https://libraryofcongress.github.io/labs-ai
          <source>-framewo.r2k0/23.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [20]
          <string-name>
            <surname>B. C. G. Lee. “</surname>
          </string-name>
          <article-title>The “Collections as ML Data”' checklist for machine learning and cultural heritage”</article-title>
          .
          <source>In: Journal of the Association for Information Science and Technology</source>
          (
          <year>2023</year>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . doi: https://doi.org/10.1002/asi.24765.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>B. C. G.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mears</surname>
          </string-name>
          , E. Jakeway,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ferriter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Adams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yarasavage</surname>
          </string-name>
          , D. Thomas,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zwaard</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Weld</surname>
          </string-name>
          .
          <article-title>“The Newspaper Navigator Dataset: Extracting Headlines and Visual Content from 16 Million Historic Newspaper Pages in Chronicling America”</article-title>
          .
          <source>In: Proceedings of the 29th ACM International Conference on Information &amp; Knowledge Management. Cikm '20</source>
          .
          <string-name>
            <surname>Virtual</surname>
            <given-names>Event</given-names>
          </string-name>
          , Ireland: Association for Computing Machinery,
          <year>2020</year>
          , pp.
          <fpage>3055</fpage>
          -
          <lpage>3062</lpage>
          . doi: https://doi.org/10.1145/3340531.341276 7.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>B. C. G.</given-names>
            <surname>Lee</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Weld</surname>
          </string-name>
          .
          <article-title>“Newspaper Navigator: Open Faceted Search for 1.5 Million Images”</article-title>
          .
          <source>In: Adjunct Publication of the 33rd Annual ACM Symposium on User Interface Software and Technology. UIST '20 Adjunct</source>
          . Virtual Event, USA: Association for Computing Machinery,
          <year>2020</year>
          , pp.
          <fpage>120</fpage>
          -
          <lpage>122</lpage>
          . doi:https://doi.org/10.1145/3379350.341614 3.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [23]
          <string-name>
            <surname>I. di Lenardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. L. A.</given-names>
            <surname>Seguin</surname>
          </string-name>
          , and F.
          <source>KaplanV. isual Patterns Discovery in Large Databases of Paintings</source>
          .
          <year>2016</year>
          . url: http://infoscience.epfl.ch/record/22063.8 [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lincoln</surname>
          </string-name>
          , G. Levin,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Conell</surname>
          </string-name>
          , and L. HuangN. ational Neighbors:
          <article-title>Distant Viewing the National Gallery of Art's Collection of Collections</article-title>
          . https://nga-neighbors.library.cmu.ed.u
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mahowald</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>clip-loc-maps.</article-title>
          <string-name>
            <surname>Zenodo</surname>
          </string-name>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .5281/zenodo.11538437. url: https://doi.org/10.5281/zenodo.1153843.7 [26]
          <string-name>
            <given-names>M.</given-names>
            <surname>Oehrli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pridal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zollinger</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Siber</surname>
          </string-name>
          . “
          <article-title>MapRank: Geographical Search for Cartographic Materials in Libraries”</article-title>
          .
          <source>InD: Lib Mag</source>
          .
          <volume>17</volume>
          (
          <year>2011</year>
          ). doi: https://doi.org/10.1045/s eptember2011-oehrl.i
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>T.</given-names>
            <surname>Padilla</surname>
          </string-name>
          . “
          <article-title>Collections as data: Implications for enclosure”</article-title>
          .
          <source>CIonl:lege &amp; Research Libraries News 79.6</source>
          (
          <issue>2018</issue>
          ), p.
          <fpage>296</fpage>
          . doi: https://doi.org/10.5860/crln.79.6.29.6url: https://c rln.acrl.org/index.php/crlnews/article/view/17.003
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>T.</given-names>
            <surname>Padilla</surname>
          </string-name>
          .
          <article-title>Responsible Operations: Data Science, Machine Learning, and</article-title>
          AI in Libraries.
          <year>2020</year>
          . doi: https://doi.org/10.25333/xk7z-
          <fpage>9g9</fpage>
          7.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>T.</given-names>
            <surname>Padilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Allen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Frost</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Potvin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Russey</given-names>
            <surname>Roke</surname>
          </string-name>
          , and S.
          <source>VarnFeirn. al Report - Always Already Computational: Collections as Data. Version 1</source>
          .
          <year>2019</year>
          . doi:https://doi.org /10.5281/zenodo.3152935.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>S.</given-names>
            <surname>Patil</surname>
          </string-name>
          .Clip.
          <year>2021</year>
          . url: https://huggingface.co/docs/transformers/en/model%5C%5Fdoc /clip.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Potter</surname>
          </string-name>
          .
          <source>Introducing the LC Labs Artificial Intelligence Planning Framework</source>
          .
          <year>2023</year>
          . url: https://blogs.loc.gov/thesignal/2023/11/
          <article-title>introducing-the-lc-labs-artificial-intelligence-p lanning-framework</article-title>
          ./ A. Radford,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , G. Krueger,
          <string-name>
            <surname>and I. Sutskever.</surname>
          </string-name>
          “
          <article-title>Learning Transferable Visual Models From Natural Language Supervision”</article-title>
          .
          <source>InP:roceedings of the 38th International Conference on Machine Learning</source>
          ,
          <string-name>
            <surname>ICML</surname>
          </string-name>
          <year>2021</year>
          ,
          <volume>18</volume>
          -
          <issue>24</issue>
          <year>July 2021</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          . Ed. by
          <string-name>
            <given-names>M.</given-names>
            <surname>Meila</surname>
          </string-name>
          and
          <string-name>
            <surname>T.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          Zhang. Vol.
          <volume>139</volume>
          .
          <source>Proceedings of Machine Learning Research. Pmlr</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>doi: https://doi.org/10.48550/arXiv.2103.0002.0</mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>R.</given-names>
            <surname>Schnürer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sieber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmid-Lanter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Öztireli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Hurni</surname>
          </string-name>
          . “
          <article-title>Detection of Pictorial Map Objects with Convolutional Neural Networks”</article-title>
          .
          <source>TInhe: Cartographic Journal 58.1</source>
          (
          <issue>2021</issue>
          ), pp.
          <fpage>50</fpage>
          -
          <lpage>68</lpage>
          . doi: https://doi.org/10.1080/00087041.
          <year>2020</year>
          .
          <volume>173811</volume>
          2.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [34]
          <article-title>Towards multimodal computational humanities : using CLIP to analyze late-nineteenth century magic lantern slides</article-title>
          .
          <source>Ceur-ws ; 2989. CEUR-WS.org</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>149</fpage>
          -
          <lpage>158</lpage>
          . url:https: //hdl.handle.net/10067/1833300151162165141.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>T.</given-names>
            <surname>Smits</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Wevers</surname>
          </string-name>
          . “
          <article-title>A multimodal turn in Digital Humanities. Using contrastive machine learning models to explore, enrich, and analyze digital visual historical collections”</article-title>
          .
          <source>In: Digital Scholarship in the Humanities 38.3</source>
          (
          <issue>2023</issue>
          ), pp.
          <fpage>1267</fpage>
          -
          <lpage>1280</lpage>
          . doi: https://d oi.
          <source>org/10</source>
          .1093/llc/fqad00.8eprint:https://academic.oup.com/dsh/article-pdf/38/3/1267 /51309490/fqad008.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Uhl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Leyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-Y.</given-names>
            <surname>Chiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Duan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Knoblock</surname>
          </string-name>
          . “Map Archive Mining:
          <article-title>Visual-Analytical Approaches to Explore Large Historical Map CollectionsI”S</article-title>
          .PInR:S
          <source>International Journal of Geo-Information 7.4</source>
          (
          <year>2018</year>
          ). doi: https://doi.org/10.3390/ijgi7040 148.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Uhl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Leyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-Y.</given-names>
            <surname>Chiang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Knoblock</surname>
          </string-name>
          . “
          <article-title>Towards the automated large-scale reconstruction of past road networks from historical maps”</article-title>
          .
          <source>CIonm:puters, environment and urban systems 94</source>
          (
          <year>2022</year>
          ). url: https://api.semanticscholar.org/CorpusID:24670584.8 [38]
          <string-name>
            <given-names>O.</given-names>
            <surname>Vane</surname>
          </string-name>
          .
          <source>Visualising the Royal Photographic Society collection: Part</source>
          <volume>2</volume>
          . https://www.vam.a c.
          <article-title>uk/blog/digital/visualising-the-royal-photographic-</article-title>
          <string-name>
            <surname>society-</surname>
          </string-name>
          collection-p.
          <fpage>a2r0t1</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>V.</given-names>
            <surname>Vitale</surname>
          </string-name>
          . “
          <article-title>Searching maps by words: how machine learning changes the way we explore map collections”</article-title>
          .
          <source>In:Journal of Cultural Analytics 8.1 (Apr. 21</source>
          ,
          <year>2023</year>
          ). doi: https://doi.or g/10.22148/001c.
          <fpage>74293</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wevers</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Smits</surname>
          </string-name>
          . “
          <article-title>The visual digital turn: Using neural networks to study historical images”</article-title>
          .
          <source>In:Digital Scholarship in the Humanities 35.1</source>
          (
          <issue>2019</issue>
          ), pp.
          <fpage>194</fpage>
          -
          <lpage>207</lpage>
          . doi: https://doi.org/10.1093/llc/fqy08.5
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Clifton</surname>
          </string-name>
          . “
          <article-title>Multimodal Learning With Transformers: A Survey”</article-title>
          .
          <source>In: IEEE Transactions on Pattern Analysis &amp; Machine Intelligence</source>
          <volume>45</volume>
          .10 (
          <year>2023</year>
          ), pp.
          <fpage>12113</fpage>
          -
          <lpage>12132</lpage>
          . doi: https://doi.org/10.1109/TPAMI.
          <year>2023</year>
          .
          <volume>327515</volume>
          6.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          . “From Washington to the world:
          <article-title>maps and digital archives at the Library of Congress”</article-title>
          .
          <source>In:International Journal of Humanities and Arts Computing</source>
          <volume>6</volume>
          .
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          (
          <year>2012</year>
          ), pp.
          <fpage>100</fpage>
          -
          <lpage>110</lpage>
          . doi: https://doi.org/10.3366/ijhac.
          <year>2012</year>
          .
          <volume>0041</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>