<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Explainable Search and Discovery of Visual Cultural Heritage Collections with Multimodal Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Taylor Arnold</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lauren Tilton</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CHR 2024: Computational Humanities Research Conference</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Data Science &amp; Linguistics, University of Richmond</institution>
          ,
          <country country="US">U.S.A</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Rhetoric &amp; Communication Studies, University of Richmond</institution>
          ,
          <country country="US">U.S.A</country>
        </aff>
      </contrib-group>
      <fpage>559</fpage>
      <lpage>574</lpage>
      <abstract>
        <p>Many cultural institutions have made large digitized visual collections available online, often under permissible re-use licences. Creating interfaces for exploring and searching these collections is difÏcult, particularly in the absence of granular metadata. In this paper, we introduce a method for using stateof-the-art multimodal large language models (LLMs) to enable an open-ended, explainable search and discovery interface for visual collections. We show how our approach can create novel clustering and recommendation systems that avoid common pitfalls of methods based directly on visual embeddings. Of particular interest is the ability to ofer concrete textual explanations of each recommendation without the need to preselect the features of interest. Together, these features can create a digital interface that is more open-ended and flexible while also being better suited to addressing privacy and ethical concerns. Through a case study using a collection of documentary photographs, we provide several metrics showing the efÏcacy and possibilities of our approach.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;explainable AI</kwd>
        <kwd>multimodal large language models (LLMs)</kwd>
        <kwd>recommender system</kwd>
        <kwd>cultural heritage</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Overview</title>
      <p>
        Numerous cultural organizations have digitized extensive visual collections and ofered them
online with licenses allowing flexible reuse [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. These include national archives, major art
museums such as the Rijksmuseum and the Louvre, and private institutions such as the Getty
Museum and the Metropolitan Museum of Art [
        <xref ref-type="bibr" rid="ref12 ref15">12, 15</xref>
        ]. Third-party institutions, such as the
MediaWiki project, the Google Art Project, and the Internet Archive, have also led eforts to
produce visual corpora of cultural artifacts. These eforts correspond with movements within
academic research to move beyond textual analysis toward visual and multimodal methods
[
        <xref ref-type="bibr" rid="ref20 ref32 ref8">8, 20, 32, 47</xref>
        ]. Searching for keywords or individual works of art within (and across) these
extensive collections according to existing structure metadata is relatively straightforward. But
how do institutions help the public explore the breadth and depth of large visual collections as
visual archives [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]?
      </p>
      <p>
        It is quite an undertaking to build generous interfaces — what Whitelaw describes as “rich,
browsable interfaces that reveal the scale and complexity of digital heritage collections” [48] —
for visual cultural heritage collections [
        <xref ref-type="bibr" rid="ref45">49</xref>
        ]. Unlike digitized textual records, visual data does
not come with the kinds of built-in search and similarity metrics that can be dervied from
word and n-gram counts [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. One of two methods is typically used to overcome this difÏculty.
The first approach starts by selecting a set of pre-specified tags to describe each image. For
example, we might tag images with their dominant colors, the number of people in the frame,
or a list of the detected objects. These tags can be generated by manual tagging, crowd-sourced
methods, or, more commonly, through the automatic application of computer vision algorithms
[
        <xref ref-type="bibr" rid="ref15 ref17">15, 17, 40</xref>
        ]. Alternatively, abstract objects known as image embeddings can be used to associate
each image with a sequence of numbers [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. While each of the numbers is not individually
meaningful, images with similar sequences of numbers will share common features [
        <xref ref-type="bibr" rid="ref37">36</xref>
        ]. Image
embeddings are most commonly built using the internal representations of images within deep
learning models built for object recognition [
        <xref ref-type="bibr" rid="ref19 ref36">19, 35</xref>
        ].
      </p>
      <p>Distance metrics derived from either of these methods can be used to produce generous
interfaces through the use of approaches such as cluster analysis and recommender systems.
Building a generous interface from explicitly produced tags has the benefit of being able to
explain the resulting structures. For example, suppose we tag images with the number of
people present in the frame of the image. In that case, we can allow users to select images by
the number of people in the image and expose this as an option in a faceted search interface.
Using image embeddings, on the other hand, benefits by finding novel connections that can
cut across existing categorization methods. However, relationships determined by image
embeddings do not correspond to an immediately available description of why a set of images
are associated with one another, making it challenging to use image embeddings for faceted
search. Embedding-based connections also have the potential to produce connections between
images that suggest or reinforce stereotypes and other implicit biases.</p>
      <p>
        Recent advances in multimodal models ofer the possibility of avoiding the choice between
using fixed but explainable image annotations and flexible but abstract representations of visual
data as embeddings. For example, Smits and Weavers recently showed the power of zero-shot
learning for exploring historic collections [
        <xref ref-type="bibr" rid="ref42">41</xref>
        ]. They used the CLIP model to build classification
algorithms for arbitrary tags without specifically training a model for a given category [ 37].
While the focus of their case studies was the analysis of specific subcategories (indoor/outdoor,
family-based tags, and scene detection), they note the potential for a “new kind of bottom-up
access to visual collections” through the application of multimodal models without the need
for extensive manual annotations [
        <xref ref-type="bibr" rid="ref42">41</xref>
        ].
      </p>
      <p>
        Over the past twelve months (mid-2023 through mid-2024), the integration of large language
models (LLMs) and generative computer vision models has allowed for a radical increase in the
capabilities of multimodal methods [
        <xref ref-type="bibr" rid="ref1 ref24 ref46">1, 24, 50</xref>
        ]. Current iterations of multimodal LLMs, such
as Google’s Gemini, OpenAI’s GPT-4-Turbo and GTP-4o, and Apple’s FERRT, allow users to
submit an image and a textual prompt and receive a free-text response in return [53]. The
results are not entirely free of errors [45], however the outputs have been shown to meet or
exceed human annotations on a variety of sub-tasks, even without the need for customized
finetuning [
        <xref ref-type="bibr" rid="ref21 ref48">21, 46, 52</xref>
        ]. Importantly, these multimodal LLMs far outperform previous methods for
automatically captioning images and photographs [
        <xref ref-type="bibr" rid="ref25 ref4 ref40">4, 39, 25, 34, 38</xref>
        ]. This opens the possibilty
for combining the benefit of explainable tag-based methods and unconstrained open-ended
embedding-based methods for exploring large collections of digitized images.
      </p>
      <p>In this paper, we present a general approach to using multimodal LLMs to search and
discover vast image repositories. Our method first generates a set of automated captions for each
image in the collection. Then, classical techniques from textual analysis are used to generate
meaningful descriptions of the connections between images. We introduce a case study to
evaluate how multimodal-based captions compare to those generated by visual embeddings. In the
next section, we describe our approach in more detail and outline how we applied it to our
selected collection. Then, in the following three sections, we ofer qualitative and quantitative
analyses of our approach by comparing it to image embedding-based techniques and showing
the ability of the multimodal models to generate explainable connections. We conclude with a
brief discussion showing how our approach can be extended and generalized.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <p>
        A typical workflow for working with an extensive collection of images is to use computer vision
to either map each image into structured annotations (e.g., the number of people present) or to
directly map the image into an abstract embedding space [
        <xref ref-type="bibr" rid="ref2 ref9">2, 9</xref>
        ]. Our method takes an
alternative approach by using multimodal LLMs to produce rich captions as an intermediate surrogate.
Text-based algorithms can then be applied to the resulting captions to produce similarity
metrics, text-embeddings, and other summarizations. Conceptually, this can be described by the
following flow of information:
      </p>
      <p>image → caption → text embedding + top terms
A significant amount of customization can be applied to this framework based on the needs of
particular applications. The captions, for instance, could be exposed through a digital interface
to allow for full-text search and increase accessibility. Or, if there is a concern that
automatically generated captions may not be up to the metadata standards of the institution, they can
be hidden from view and used only as the backend underlying a clustering analysis or
recommender system. Diferent information can also be captured through prompt engineering and
the choices of the models used.</p>
      <p>
        In the remainder of this article, we show how this general approach can be applied to a
collection of nearly sixteen thousand digitized documentary photographs created during the
1970s by the U.S. federal government as part of the Documerica project [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For our case-study,
we used the OpenAI API for the caption creation and the text embedding. The total cost of
producing the results in this paper were $287. The costs should scale linearly with the number
of images and could be reduced by a factor of four or more by using the batch-based API and
replacing intermediate steps with local techniques.
      </p>
      <p>
        We started by taking each of the images in the collection and scaling them to have the largest
dimension no greater than 1024 pixels and the smallest dimension no greater than 768. These
sizes were chosen to optimize the price of the API request while being close to the maximum
allowed size (testing suggested that smaller resolutions of the images produced much less
accurate captions). We then made an API request using the GPT-4 Turbo model (version 2024-04-09)
by submitting the image along with the query “Provide a detailed plain-text description of the
objects, activities, people, background and/or composition of this photograph” [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The specific
query was manually engineered after some trial-and-error using a test set of 25 images to get
a complete description of diferent aspects of the image with a minimal amount of subjective
commentary. We requested that the captions be a maximum of 500 tokens. Finally, we
submitted the automatically generated captions to the OpenAI text embedding API (version 3). The
API generated textual embeddings in a 3072-dimensional space. We then generated similarity
scores between pairs of images using the cosine similarity between the textual embeddings.
To provide a point of comparison, we also passed each image through the EfÏcentNet
embedding using an open-source implementation [
        <xref ref-type="bibr" rid="ref7">44, 7</xref>
        ], generating a similar set of cosine similarity
scores based only on the visual image.
      </p>
      <p>Ultimately, we generated a rich caption and associated embedding for each image in the
collection using a multimodal LLM. Using these embeddings, we were able to measure the
distance between any pair of images. In the following section, we compared these with distances
generated through an embedding generated directly from the image.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Qualitative Analysis and Global Structure</title>
      <p>
        We ran the entire set of Documerica images through the method described in the previous
section. Our analysis used the color-corrected images that account for the degradation of the
online digitized photos [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. On average, the automatically generated captions used 236 tokens
(sd=47.1), corresponding to 197 words (sd=38.5). Two of the images had captions that could
not fit within the 500 token limit specified in the search query. We also had two images that
triggered the following warning message: “Your input image may contain content that is not
allowed by our safety system,” with no further output. One of the rejected images showed a
scene with heavy fog. The other was a small object floating in a pool of a purple-colored liquid.
      </p>
      <p>Two examples of the generated captions are shown in Fig. 1 and Fig. 2. The displayed
captions are indicative of those found for all of the images. Captions typically start with a
onesentence overview of the scene shown in the photograph. Then, several sentences dive into
specific objects, activities, and lighting conditions. When the model needs to make an
inference based on partial information, the output often includes hedge phrases such as “appears
to be” or “possibly”. Over 80% of the captions include at least one of these phrases. Towards
the end of the caption, the algorithm becomes more subjective, here giving comments about
the “utilitarian” and “gloomy or overcast” ambiance of the photographs. Also, as seen in these
examples, over half of the captions end with a summarizing statement that sums up what the
algorithm believes to be the main message of the image. While most of text included in the
captions appear to be both relevant and accurate, they are by no means foolproof. For example,
the caption in Fig. 1 predicts that the worker is female, despite that not being at all clear from
the image. The same caption also describes the objects in the foreground as “plastic”, despite
being made of glass.1</p>
      <p>
        One way to understand the global structure of an embedding in a large vector space is to
1The entire set of captions can be downloaded for further analysis from our website: https://distantviewing.org/d
ownloads.
plot the output in a smaller dimension using dimensionality reduction techniques. A
common choice for this is the UMAP dimensionality reduction projection. This algorithm tries to
approximate the local structure of points in a high-dimensional space (here, the embedding
space) in a lower-dimensional space [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. Fig. 3 shows two-dimensional UMAP projections for
the multimodal LLM and the embeddings directly derived from the visual input. The visual
embedding displays larger continuous blocks of points, in contrast to the multimodal embedding,
which has more corners, bridges, and distinct islands. These features indicate that the
multimodal embedding identifies more distinct features. In the following section, we will investigate
quantitative ways of measuring the diferences between the two sets of recommendations.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Recommender System</title>
      <p>
        How can we use the information in a set of embeddings to increase the access and
discoverability of large collections? One common approach that has generally produced promising results
across many collections is recommender systems [
        <xref ref-type="bibr" rid="ref13 ref22 ref29 ref3 ref47">3, 13, 22, 29, 51</xref>
        ]. Typically, recommender
systems work by first allowing a user to pick an image (or providing one at random), and then
suggesting a set of additional thumbnails of other related photos that may also be of interest.
Clicking on a thumbnail shows a full version of the selected image and a new set of
recommendations. Moving iteratively through a sequence of recommendations provides a unique,
user-generated tour of a curated subset of a collection. At their best, the recommendations
provide meaningful connections between images while avoiding getting users stuck within a
small subset of the collection.
      </p>
      <p>
        One way to build a recommender system is to provide recommendations based on the most
similar images defined through similarity scores [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. We already have two diferent sets of
embeddings, those based on the captions and those from the visual embedding. We can create
similar scores by computing the cosine similarity between the embedding vectors. These allow
us to generate a set of  recommendations for each image using the  most similar images
for any positive integer  [33]. Building a recommendation system for a large set of images is
an unsupervised learning task. There is no specific metric that we are trying to optimize for
or ground truth that we are trying to reproduce. Therefore, we cannot reduce the summary
between our two recommendation methods to a single number. Instead, we examine at several
indirect measurements to compare the image-based and multimodal recommendation systems.
      </p>
      <p>Fig. 4 shows six sets of example recommendations. The photographs on the left-hand side
show the starting images, with the five most similar multimodal recommendations on the top
row and the five most similar text-based recommendations on the bottom row. Both sets of
recommendations yield reasonably interesting results for these six selected images. The
recommendations for the final image of a bird, for example, are very similar. However, the
multimodal results generally ofer recommendations that are both more precise and more diverse.
For example, the fourth set starts with an image of three people with bicycles looking of into
the distance. The visual recommendations only pick up on the bicycles, whereas the
multimodal model also finds images with water in the background, including one image that does
not even include bicycles. Similarly, for the fith image of a house, the visual
recommendations include rows of houses and a church; the multimodal recommendations only include
single houses with similar architecture.</p>
      <p>Another method of measuring the structure of the recommendations is to look at how
often we have symmetric recommendations. In other words, if a specific image  recommends
an image  , we want to know how likely it is that image  will recommend back to image
 . Having symmetric recommendations is generally a good feature because it indicates that
the distance metric is meaningful and that we have a fairly uniform set of recommendations.
Table 4 shows the proportion of symmetric recommendations for the two models based on the
number of recommendations made. These proportions increase as the number of neighbors
increase because there are more chances for them to map back into the original. In general, the
image-based recommendations have a lower percentage of symmetric recommendations, with
rates ranging from 22-29%, compared to the 36-50% rates of the multimodal recommendations.
These correspond with the visualization shown in Fig. 3, which shows that the multimodal
recommendations have many more tightly connected corners and clusters while still able
ability to bridge between diferent parts of the corpus. These results indicate that the multimodal
recommendations do a better job of finding tightly associated clusters. For this corpus, it finds
these clusters without becoming too stuck in one particular part of the collection.</p>
      <p>We can also directly compare how often the image-based and multimodal recommendations
overlap. In Table 4, we show the proportion of the recommendations from each of the two
methods that are the same as a function of the total number of neighbors. As we saw in the
small set of examples in Fig. 4, there are a small number of overlapping recommendations.
When using a recommendation size of ten, we average just over one matching recommendation.
At the same time, the recommendations are not entirely disjoint. When we use a size of
twentyifve, only 13.9% of images have no overlapping recommendations, with an average overlap of
about 3.5. Based on these metrics, we see that the caption-based method produces noticeably
diferent results from the image-based technique while preserving some similar structures.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Explainable Recommendations</title>
      <p>A significant advantage of using captions as an intermediate step in the embeddings behind
a recommender system is that we can use the captions to describe the rationale for
associating two images. Specifically, once we have selected a fixed number of recommendations for
each image, we can use the generated captions to produce a label that describes the set of
relationships. Our approach was to first run the captions through an open-source NLP pipeline
that performed tokenization, lemmatization, and part-of-speech tagging [43]. Then, we used
log-likelihood scores to identify the nouns that most strongly diferentiated the set of
recommendations from the remainder of the corpus [42]. We selected the top five most strongly
associated terms to label each set of recommendations.</p>
      <p>We ran an experiment to test how well the generated labels correspond to the connections.
First, we took a random set of 120 images and found the five closest recommendations for each
from both methods. Then, we constructed the five most indicative terms for each set, creating
separate sets for both recommendations. Then, we manually classified the proportion of
recommendations that accurately corresponded to one of the terms. As a comparison baseline, we
also took a random set of the generated terms from our set and counted the proportion of 500
randomly selected images that matched a given term. The results are shown in Table 5. The
image-based tags matched at rates in the high 80s, whereas the multimodal tags matched in the
high-to-mid 90s. These are all significantly higher than the randomly selected tags, indicating
that the matches are not primarily a result of simply supplying generic terms. The biggest
diference between the two recommendation systems is shown in the final column. Nearly 4%
of the images fail to have any associated matching term. Only two of the multimodal-based
terms match none of the terms. These results show that the top terms produced by the
captions are relatively accurate and precise. While they can be used to add context to image-based
recommendations, they perform noticeably better when applied to recommendations based on
the captions’ embeddings.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Clustering Analysis</title>
      <p>Whereas recommender systems ofer a way to explore similar images within a collection, how
can the output of multimodal LLMs enable understanding the general themes within a
collection of visual objects in the first place? Another application of caption text embeddings is to
apply clustering algorithms that group together similar captions. Clustering has the advantage
of being connected to the recommender system in the sense that images within a given cluster
will tend to recommend other images within the same cluster. Also, similar to the approach
in the previous section, we can use natural language processing techniques to find key terms
that distinguish one cluster from all the others [42].</p>
      <p>
        We applied a hierarchical clustering algorithm to the complete set of captions generated by
our multimodal LLM [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. The algorithm produced a set of 32 clusters, each tagged with the
six terms that most distinguished it from all of the other clusters. These are shown in Table 6.
The benefit of hierarchical clustering is that it allows us to generate a global structure on the
clusters. Clusters in the table are ordered hierarchically so that clusters near each other on the
table are more closely related than those farther away from one another. Those at either end
of the table are the most unique and furthest away from the others.
      </p>
      <p>Reading through the generated topics, starting at the top of Table 6, gives an understanding
of the general structure of the Documerica collection. At the top are clusters associated with
the detrimental efects of humans on the environment, such as pollution, waste, and junkyards.
Then, we move to forms of transportation and into more productive transformations of the
earth in the form of agriculture. We then transition into pure nature photos (cluster 15). Next,
we see landscapes showing urban skylines and cityscapes. These move into other ways humans
interact directly in their environment, such as hiking outdoors (cluster 28) and skiing (cluster
29). The final clusters correspond to particular shooting sets from parades, within laboratories,
and photographs of trains and train stations.</p>
      <p>ID</p>
      <p>Cluster Description
landfill; environmental; waste; pollution; debris; garbage
old; decay; junkyard; car; destruction; scrapyard
helicopter; airport; urban; rainy; aircraft; cockpit
train; railway; track; railroad; station; maintenance
aerial; landscape; river; view; waterfall; natural
industrial; facility; large; smoke; treatment; aerial
outdoor; man; activity; group; people; picnic
man; elderly; portrait; older; technical; middle
man; elderly; conversation; candid; moment; couple
agricultural; rural; field; farm; crop; farming
flower; close; plant; up; cluster; delicate
sign; gas; store; market; billboard; advertisement
architectural; building; church; house; cemetery; story
car; parking; lot; vehicle; vintage; garage
bird; flight; close; surface; rock; deer
coastal; serene; beach; tranquil; lakeside; picturesque
landscape; forest; tree; sunset; dramatic; mountainous
urban; cityscape; bridge; city; high; view
aerial; suburban; area; view; coastal; development
residential; house; suburban; street; story; neighborhood
highway; street; urban; busy; trafic; bustling
fishing; fish; underwater; net; water; coral
boat; sailboat; sailing; maritime; water; marina
beach; lakeside; day; activity; people; sunny
fountain; pool; public; park; urban; plaza
child; young; boy; girl; moment; playground
construction; industrial; site; mining; worker; machinery
woman; hiker; outdoor; young; individual; park
ski; resort; winter; snowy; snow; hockey
event; parade; street; public; vibrant; people
laboratory; room; woman; indoor; scientific; elderly
train; subway; station; interior; indoor; bus
Num. Photos
438
371
140
220
1261
1354
534
595
240
445
406
514
613
395
660
651
1311
611
172
223
628
375
624
343
87
351
592
420
89
544
335
369</p>
      <p>The clusters generated here can be integrated into a digital platform that provides a generous
interface for exploring the Documerica collection. Imagine, for example, a grid of thumbnails
showing one image randomly selected from each cluster along with the associated keywords.
Clicking on the thumbnail would create a page with a larger image version, archival metadata,
and the recommender system described in the previous section. An option to return to the grid
of clusters would be included prominently somewhere on the page. Such an interface would
allow users to explore the expanse of the collection through each of the clusters while seeing
the diversity within a cluster through the recommender system. Iteratively exploring the
collection through these global and local connections would allow for a better understanding of
the structure and overall message conveyed through the archive.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>
        There are enormous possibilities for increasing modes of access, discovery, and analysis for
visual collections through the automated generation of textual descriptions using multimodal
LLMs. In this paper, we have introduced a general framework by which images can be
converted into textual descriptions and text-based embeddings, opening them up to previously
unavailable techniques. We applied an LLM, generated a certain kind of caption, and then
used a recommender system and image clustering based on the text embedding of the caption.
We showed how this approach could be applied to a collection of documentary photographs to
produce an explainable recommender system and clustering-based descriptions of the themes
within the collection. The present study is just one straightforward application of rich
LLMbased multimodal methods. We expect to see a wide range of further applications of this general
approach in the coming years, particularly as open-source models follow their usual pattern
of catching up to the current state-of-the-art results currently attainable through closed,
commercial systems [
        <xref ref-type="bibr" rid="ref10 ref26 ref48">10, 26, 52</xref>
        ].
      </p>
      <p>
        We close with two specific extensions that highlight potential avenues of application for our
framework. First, it is possible to add additional layers of safeguards to the recommendations,
an important task when building interfaces to cultural heritage collections [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] This can be
done through further prompt engineering or the filtering (or replacing) of terms before the
embedding step. For example, we noticed that many terms in the captions, such as ‘man’ and
‘girl’, are gendered. As a result, the recommender system has a tendency to associate photos
of people that it believes are the same gender, which in the case of people in the background is
frequently based on inaccurate stereotypes [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Associations such as these can be mitigated,
though never entirely avoided, by automated replacing gendered terms with neutral terms
before running the text embedding. A second extension that can be implemented with the
automatically generated captions would be ofering an interface for a full-text search, allowing
for new modes of accessibility [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Full-text search could be implemented to avoid the (not
entirely correct) full captions themselves, or could expose these to end users along with a
disclaimer about their autogenerated nature.
[33]
[34]
[37]
[38]
[40]
[42]
[43]
[44]
      </p>
      <p>A. Stefanowitsch. Corpus linguistics: A guide to the methodology. Language Science Press,
2020.</p>
      <p>M. Straka, J. Hajic, and J. Straková. “UDPipe: trainable pipeline for processing
CoNLLU files performing tokenization, morphological analysis, pos tagging and parsing”. In:
Proceedings of the Tenth International Conference on Language Resources and Evaluation
(LREC’16). 2016, pp. 4290–4297.</p>
      <p>M. Tan and Q. Le. “EfÏcientNet: Rethinking model scaling for convolutional neural
networks”. In: International conference on machine learning. Pmlr. 2019, pp. 6105–6114.
[45] S. Tong, Z. Liu, Y. Zhai, Y. Ma, Y. LeCun, and S. Xie. “Eyes wide shut? exploring the
visual shortcomings of multimodal llms”. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 2024, pp. 9568–9578.
[46]</p>
      <p>A. Verma, A. K. Yadav, M. Kumar, and D. Yadav. “Automatic image caption generation
using deep learning”. In: Multimedia Tools and Applications 83.2 (2024), pp. 5309–5325.</p>
      <p>H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, and Y. Yang.
“Ferret: Refer and Ground Anything Anywhere at Any Granularity”. In: arXiv preprint
arXiv:2310.07704 (2023).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Akkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Aleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Altenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anadkat</surname>
          </string-name>
          , et al.
          <source>“GPT-4 technical report”</source>
          .
          <source>In: arXiv preprint arXiv:2303.08774</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>M. M. Adnan</surname>
            ,
            <given-names>M. S. M.</given-names>
          </string-name>
          <string-name>
            <surname>Rahim</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rehman</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Mehmood</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Saba</surname>
            , and
            <given-names>R. A.</given-names>
          </string-name>
          <string-name>
            <surname>Naqvi</surname>
          </string-name>
          . “
          <article-title>Automatic image annotation based on deep learning models: a systematic review and future challenges”</article-title>
          .
          <source>In: IEEE Access 9</source>
          (
          <year>2021</year>
          ), pp.
          <fpage>50253</fpage>
          -
          <lpage>50264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Afzal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghani</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Hittawe</surname>
            ,
            <given-names>S. F.</given-names>
          </string-name>
          <string-name>
            <surname>Rashid</surname>
            ,
            <given-names>O. M.</given-names>
          </string-name>
          <string-name>
            <surname>Knio</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hadwiger</surname>
            ,
            <given-names>and I. Hoteit.</given-names>
          </string-name>
          “
          <article-title>Visualization and visual analytics approaches for image and video datasets: A survey”</article-title>
          .
          <source>In: ACM Transactions on Interactive Intelligent Systems 13.1</source>
          (
          <issue>2023</issue>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Anitha Kumari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mouneeshwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Udhaya</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Jasmitha</surname>
          </string-name>
          . “
          <article-title>Automated image captioning for flickr8k dataset”</article-title>
          .
          <source>In: Proceedings of International Conference on Artificial Intelligence, Smart Grid and Smart City Applications: AISGSC 2019</source>
          . Springer.
          <year>2020</year>
          , pp.
          <fpage>679</fpage>
          -
          <lpage>687</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>U. S. N.</given-names>
            <surname>Archives</surname>
          </string-name>
          . DOCUMERICA:
          <article-title>The Environmental Protection Agency's Program to Photographically Document Subjects of Environmental Concern,</article-title>
          <year>1972</year>
          -
          <fpage>1977</fpage>
          . https://catalog.ar chives.gov/id/542493.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Arnold</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Tilton</surname>
          </string-name>
          . “
          <article-title>Automated Image Color Mapping for a Historic Photographic Collection”</article-title>
          .
          <source>In: CHR 2024: Computational Humanities Research Conference. CEUR Workshop Proceedings</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Arnold</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Tilton</surname>
          </string-name>
          . “
          <article-title>Distant viewing toolkit: A python package for the analysis of visual culture”</article-title>
          .
          <source>In: Journal of Open Source Software</source>
          <volume>5</volume>
          .45 (
          <year>2020</year>
          ), p.
          <year>1800</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Arnold</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Tilton</surname>
          </string-name>
          . “Distant Viewing:
          <article-title>Analyzing Large Visual Corpora”</article-title>
          .
          <source>In: Digital Scholarship in the Humanities 34.Supplement_1</source>
          (
          <issue>2019</issue>
          ), pp.
          <fpage>i3</fpage>
          -
          <lpage>i16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Arnold</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Tilton</surname>
          </string-name>
          . Distant Viewing:
          <article-title>Computational Exploration of Digital Images</article-title>
          . MIT Press,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ravaut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Joty</surname>
          </string-name>
          . “
          <article-title>ChatGPT's Oneyear Anniversary: Are Open-Source Large Language Models Catching up?”</article-title>
          <source>In: arXiv preprint arXiv:2311.16989</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C. N.</given-names>
            <surname>Coleman</surname>
          </string-name>
          . “
          <article-title>Managing bias when library collections become data”</article-title>
          .
          <source>In: International Journal of Librarianship 5.1</source>
          (
          <issue>2020</issue>
          ), pp.
          <fpage>8</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cuntz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Heald</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahli</surname>
          </string-name>
          . “
          <article-title>Digitization and Availability of Artworks in Online Museum Collections”</article-title>
          .
          <source>In: World Intellectual Property Organization (WIPO) Economic Research Working Paper Series</source>
          <volume>75</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Deal</surname>
          </string-name>
          . “
          <article-title>Visualizing digital collections”</article-title>
          .
          <source>In: Technical Services Quarterly 32.1</source>
          (
          <issue>2015</issue>
          ), pp.
          <fpage>14</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Ç. Demiralp</surname>
            ,
            <given-names>C. E.</given-names>
          </string-name>
          <string-name>
            <surname>Scheidegger</surname>
            ,
            <given-names>G. L.</given-names>
          </string-name>
          <string-name>
            <surname>Kindlmann</surname>
            ,
            <given-names>D. H.</given-names>
          </string-name>
          <string-name>
            <surname>Laidlaw</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Heer</surname>
          </string-name>
          . “
          <article-title>Visual embedding: A model for visualization”</article-title>
          .
          <source>In: IEEE Computer Graphics and Applications</source>
          <volume>34</volume>
          .1 (
          <issue>2014</issue>
          ), pp.
          <fpage>10</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>I. Di</given-names>
            <surname>Lenardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. L. A.</given-names>
            <surname>Seguin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          . “
          <article-title>Visual patterns discovery in large databases of paintings”</article-title>
          .
          <source>In: Digital Humanities</source>
          <year>2016</year>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dıa</surname>
          </string-name>
          <article-title>́z-Rodrıǵuez and</article-title>
          <string-name>
            <surname>G. Pisoni.</surname>
          </string-name>
          “
          <article-title>Accessible cultural heritage through explainable artiifcial intelligence”</article-title>
          .
          <source>In: Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization</source>
          .
          <year>2020</year>
          , pp.
          <fpage>317</fpage>
          -
          <lpage>324</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>B.</given-names>
            <surname>Flueckiger</surname>
          </string-name>
          and
          <string-name>
            <surname>G. Halter.</surname>
          </string-name>
          “
          <article-title>Methods and Advanced Tools for the Analysis of Film Colors in Digital Humanities</article-title>
          .”
          <source>In: DHQ: Digital Humanities Quarterly 14.4</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>K. C. Fraser</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kiritchenko</surname>
            ,
            <given-names>and I. Nejadgholi. “</given-names>
          </string-name>
          <article-title>A friendly face: Do text-to-image systems rely on stereotypes when the input is under-specified?”</article-title>
          <source>In: arXiv preprint arXiv:2302.07159</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gefen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saint-Raymond</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Venturini</surname>
          </string-name>
          . “
          <article-title>AI for digital humanities and computational social sciences”</article-title>
          .
          <source>In: Reflections on Artificial Intelligence for Humanity</source>
          (
          <year>2021</year>
          ), pp.
          <fpage>191</fpage>
          -
          <lpage>202</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hiippala</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Bateman</surname>
          </string-name>
          . “
          <article-title>Semiotically-grounded distant viewing of diagrams: insights from two multimodal corpora”</article-title>
          .
          <source>In: Digital Scholarship in the Humanities 37.2</source>
          (
          <issue>2022</issue>
          ), pp.
          <fpage>405</fpage>
          -
          <lpage>425</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bharani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. H.</given-names>
            <surname>Yeo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Samaan</surname>
          </string-name>
          . “
          <article-title>GPT-4V passes the BLS and ACLS examinations: An analysis of GPT-4V's image recognition capabilities”</article-title>
          .
          <source>In: Resuscitation</source>
          <volume>195</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>I.</given-names>
            <surname>Klinkert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>McDonnell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Luxembourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Maarten</given-names>
            <surname>Altelaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. R.</given-names>
            <surname>Amstalden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Piersma</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Heeren</surname>
          </string-name>
          . “
          <article-title>Tools and strategies for visualization of large image data sets in high-resolution imaging mass spectrometry”</article-title>
          .
          <source>In: Review of scientific instruments 78.5</source>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>B. C. G. Lee. “</surname>
          </string-name>
          <article-title>The “Collections as ML Data” checklist for machine learning and cultural heritage”</article-title>
          .
          <source>In: Journal of the Association for Information Science and Technology</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Shan</surname>
          </string-name>
          . “
          <article-title>LICO: explainable models with language-image consistency”</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cui</surname>
          </string-name>
          , W. Ma, and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          . “
          <article-title>Feature fusion via multi-target learning for ancient artwork captioning”</article-title>
          .
          <source>In: Information Fusion</source>
          <volume>97</volume>
          (
          <year>2023</year>
          ), p.
          <fpage>101811</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          . “
          <article-title>Visual instruction tuning”</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>L.</given-names>
            <surname>McInnes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Healy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Melville</surname>
          </string-name>
          . “UMAP:
          <article-title>Uniform manifold approximation and projection for dimension reduction”</article-title>
          . In: arXiv preprint arXiv:
          <year>1802</year>
          .
          <volume>03426</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>C.</given-names>
            <surname>Meinecke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hall</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Jänicke</surname>
          </string-name>
          . “
          <article-title>Towards enhancing virtual museums by contextualizing art through interactive visualizations”</article-title>
          .
          <source>In: ACM Journal on Computing and Cultural Heritage 15.4</source>
          (
          <issue>2022</issue>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>J.-P.</given-names>
            <surname>Moreux</surname>
          </string-name>
          . “
          <article-title>Intelligence artificielle et indexation des images”</article-title>
          . In:
          <article-title>Journées du patrimoine écrit:“L'image aura-t-elle le dernier mot? Regards croisés sur les collections iconographiques en bibliothèques”</article-title>
          .
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>C.</given-names>
            <surname>Morse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Landau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lallemand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wieneke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Koenig</surname>
          </string-name>
          . “
          <article-title>From #museumathome to #athomeatthemuseum: Digital museums and dialogical engagement beyond the COVID19 pandemic”</article-title>
          .
          <source>In: ACM Journal on Computing and Cultural Heritage (JOCCH) 15.2</source>
          (
          <issue>2022</issue>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>F.</given-names>
            <surname>Murtagh</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Legendre</surname>
          </string-name>
          . “
          <article-title>Ward's hierarchical agglomerative clustering method: which algorithms implement Ward's criterion?</article-title>
          ”
          <source>In: Journal of classification 31</source>
          (
          <year>2014</year>
          ), pp.
          <fpage>274</fpage>
          -
          <lpage>295</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>R.</given-names>
            <surname>Paiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chefer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Wolf</surname>
          </string-name>
          . “
          <article-title>No token left behind: Explainability-aided image classification and generation”</article-title>
          .
          <source>In: European Conference on Computer Vision</source>
          . Springer.
          <year>2022</year>
          , pp.
          <fpage>334</fpage>
          -
          <lpage>350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Petukhova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Matos-Carvalho</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>N.</given-names>
            <surname>Fachada</surname>
          </string-name>
          . “
          <article-title>Text clustering with LLM embeddings”</article-title>
          .
          <source>In: arXiv preprint arXiv:2403.15112</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Puscasiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fanca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-I.</given-names>
            <surname>Gota</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Valean</surname>
          </string-name>
          . “
          <source>Automated image captioning”</source>
          .
          <source>In: 2020 IEEE international conference on automation, quality and testing</source>
          ,
          <source>robotics (AQTR)</source>
          . Ieee.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khorram</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Fuxin</surname>
          </string-name>
          . “
          <article-title>Embedding deep networks into visual explanations”</article-title>
          .
          <source>In: Artificial Intelligence</source>
          <volume>292</volume>
          (
          <year>2021</year>
          ), p.
          <fpage>103435</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qi</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>“Learning explainable embeddings for deep networks”</article-title>
          .
          <source>In: NIPS Workshop on Interpretable Machine Learning</source>
          . Vol.
          <volume>31</volume>
          .
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al. “
          <article-title>Learning transferable visual models from natural language supervision”</article-title>
          .
          <source>In: International conference on machine learning. Pmlr</source>
          .
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <string-name>
            <surname>A. M. Rinaldi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Russo</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Tommasino</surname>
          </string-name>
          . “
          <article-title>Automatic image captioning combining natural language processing and deep neural networks”</article-title>
          .
          <source>In: Results in Engineering</source>
          <volume>18</volume>
          (
          <year>2023</year>
          ), p.
          <fpage>101107</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sheng and M.-F. Moens</surname>
          </string-name>
          . “
          <article-title>Generating captions for images of ancient artworks”</article-title>
          .
          <source>In: Proceedings of the 27th ACM international conference on multimedia. 2019</source>
          , pp.
          <fpage>2478</fpage>
          -
          <lpage>2486</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <string-name>
            <given-names>N.</given-names>
            <surname>Siddiqui</surname>
          </string-name>
          . “
          <article-title>Cutting the Frame: An In-Depth Look at the Hitchcock Computer Vision Dataset”</article-title>
          .
          <source>In: Journal of open humanities data 10.1</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>T.</given-names>
            <surname>Smits</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Wevers</surname>
          </string-name>
          . “
          <article-title>A multimodal turn in Digital Humanities. Using contrastive machine learning models to explore, enrich, and analyze digital visual historical collections”</article-title>
          .
          <source>In: Digital Scholarship in the Humanities 38.3</source>
          (
          <issue>2023</issue>
          ), pp.
          <fpage>1267</fpage>
          -
          <lpage>1280</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Wevers</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Smits</surname>
          </string-name>
          . “
          <article-title>The visual digital turn: Using neural networks to study historical images”</article-title>
          .
          <source>In: Digital Scholarship in the Humanities 35.1</source>
          (
          <issue>2020</issue>
          ), pp.
          <fpage>194</fpage>
          -
          <lpage>207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Whitelaw</surname>
          </string-name>
          . “
          <article-title>Generous interfaces for digital cultural collections”</article-title>
          .
          <source>In: Digital humanities quarterly 9</source>
          .1 (
          <issue>2015</issue>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>F.</given-names>
            <surname>Windhager</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Federico</surname>
          </string-name>
          , G. Schreder,
          <string-name>
            <given-names>K.</given-names>
            <surname>Glinka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dörk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Miksch</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Mayr.</surname>
          </string-name>
          “
          <article-title>Visualization of cultural heritage collection data: State of the art and future challenges”</article-title>
          .
          <source>In: IEEE transactions on visualization and computer graphics 25.6</source>
          (
          <issue>2018</issue>
          ), pp.
          <fpage>2311</fpage>
          -
          <lpage>2330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          . “
          <article-title>GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?”</article-title>
          <source>In: arXiv preprint arXiv:2311.15732</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Huang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Zeng</surname>
          </string-name>
          . “
          <article-title>VISAtlas: An image-based exploration and query system for large visualization collections via neural image embedding”</article-title>
          .
          <source>In: IEEE Transactions on Visualization and Computer Graphics</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>and E. Chen. “</surname>
          </string-name>
          <article-title>A Survey on Multimodal Large Language Models”</article-title>
          .
          <source>In: arXiv preprint arXiv:2306.13549</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>