Explainable Search and Discovery of Visual Cultural Heritage Collections with Multimodal Large Language Models Taylor Arnold1,∗,† , Lauren Tilton2,† 1 Data Science & Linguistics, University of Richmond, U.S.A 2 Rhetoric & Communication Studies, University of Richmond, U.S.A Abstract Many cultural institutions have made large digitized visual collections available online, often under per- missible re-use licences. Creating interfaces for exploring and searching these collections is difÏcult, particularly in the absence of granular metadata. In this paper, we introduce a method for using state- of-the-art multimodal large language models (LLMs) to enable an open-ended, explainable search and discovery interface for visual collections. We show how our approach can create novel clustering and recommendation systems that avoid common pitfalls of methods based directly on visual embeddings. Of particular interest is the ability to offer concrete textual explanations of each recommendation with- out the need to preselect the features of interest. Together, these features can create a digital interface that is more open-ended and flexible while also being better suited to addressing privacy and ethical concerns. Through a case study using a collection of documentary photographs, we provide several metrics showing the efÏcacy and possibilities of our approach. Keywords explainable AI, multimodal large language models (LLMs), recommender system, cultural heritage 1. Overview Numerous cultural organizations have digitized extensive visual collections and offered them online with licenses allowing flexible reuse [30]. These include national archives, major art museums such as the Rijksmuseum and the Louvre, and private institutions such as the Getty Museum and the Metropolitan Museum of Art [12, 15]. Third-party institutions, such as the MediaWiki project, the Google Art Project, and the Internet Archive, have also led efforts to produce visual corpora of cultural artifacts. These efforts correspond with movements within academic research to move beyond textual analysis toward visual and multimodal methods [8, 20, 32, 47]. Searching for keywords or individual works of art within (and across) these extensive collections according to existing structure metadata is relatively straightforward. But how do institutions help the public explore the breadth and depth of large visual collections as visual archives [28]? CHR 2024: Computational Humanities Research Conference, December 4–6, 2024, Aarhus University, Denmark ∗ Corresponding author. † These authors contributed equally. £ tarnold2@richmond.edu (T. Arnold); ltilton@richmond.edu (L. Tilton) ç https://statsmaths.github.io (T. Arnold); https://laurentilton.com/ (L. Tilton) ȉ 0000-0003-0576-0669 (T. Arnold); 0000-0003-4629-8888 (L. Tilton) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 559 CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings It is quite an undertaking to build generous interfaces — what Whitelaw describes as “rich, browsable interfaces that reveal the scale and complexity of digital heritage collections” [48] — for visual cultural heritage collections [49]. Unlike digitized textual records, visual data does not come with the kinds of built-in search and similarity metrics that can be dervied from word and n-gram counts [9]. One of two methods is typically used to overcome this difÏculty. The first approach starts by selecting a set of pre-specified tags to describe each image. For example, we might tag images with their dominant colors, the number of people in the frame, or a list of the detected objects. These tags can be generated by manual tagging, crowd-sourced methods, or, more commonly, through the automatic application of computer vision algorithms [15, 17, 40]. Alternatively, abstract objects known as image embeddings can be used to associate each image with a sequence of numbers [14]. While each of the numbers is not individually meaningful, images with similar sequences of numbers will share common features [36]. Image embeddings are most commonly built using the internal representations of images within deep learning models built for object recognition [19, 35]. Distance metrics derived from either of these methods can be used to produce generous interfaces through the use of approaches such as cluster analysis and recommender systems. Building a generous interface from explicitly produced tags has the benefit of being able to explain the resulting structures. For example, suppose we tag images with the number of peo- ple present in the frame of the image. In that case, we can allow users to select images by the number of people in the image and expose this as an option in a faceted search interface. Using image embeddings, on the other hand, benefits by finding novel connections that can cut across existing categorization methods. However, relationships determined by image em- beddings do not correspond to an immediately available description of why a set of images are associated with one another, making it challenging to use image embeddings for faceted search. Embedding-based connections also have the potential to produce connections between images that suggest or reinforce stereotypes and other implicit biases. Recent advances in multimodal models offer the possibility of avoiding the choice between using fixed but explainable image annotations and flexible but abstract representations of visual data as embeddings. For example, Smits and Weavers recently showed the power of zero-shot learning for exploring historic collections [41]. They used the CLIP model to build classification algorithms for arbitrary tags without specifically training a model for a given category [37]. While the focus of their case studies was the analysis of specific subcategories (indoor/outdoor, family-based tags, and scene detection), they note the potential for a “new kind of bottom-up access to visual collections” through the application of multimodal models without the need for extensive manual annotations [41]. Over the past twelve months (mid-2023 through mid-2024), the integration of large language models (LLMs) and generative computer vision models has allowed for a radical increase in the capabilities of multimodal methods [1, 24, 50]. Current iterations of multimodal LLMs, such as Google’s Gemini, OpenAI’s GPT-4-Turbo and GTP-4o, and Apple’s FERRT, allow users to submit an image and a textual prompt and receive a free-text response in return [53]. The results are not entirely free of errors [45], however the outputs have been shown to meet or exceed human annotations on a variety of sub-tasks, even without the need for customized fine- tuning [21, 46, 52]. Importantly, these multimodal LLMs far outperform previous methods for automatically captioning images and photographs [4, 39, 25, 34, 38]. This opens the possibilty 560 for combining the benefit of explainable tag-based methods and unconstrained open-ended embedding-based methods for exploring large collections of digitized images. In this paper, we present a general approach to using multimodal LLMs to search and dis- cover vast image repositories. Our method first generates a set of automated captions for each image in the collection. Then, classical techniques from textual analysis are used to generate meaningful descriptions of the connections between images. We introduce a case study to eval- uate how multimodal-based captions compare to those generated by visual embeddings. In the next section, we describe our approach in more detail and outline how we applied it to our selected collection. Then, in the following three sections, we offer qualitative and quantitative analyses of our approach by comparing it to image embedding-based techniques and showing the ability of the multimodal models to generate explainable connections. We conclude with a brief discussion showing how our approach can be extended and generalized. 2. Method A typical workflow for working with an extensive collection of images is to use computer vision to either map each image into structured annotations (e.g., the number of people present) or to directly map the image into an abstract embedding space [2, 9]. Our method takes an alterna- tive approach by using multimodal LLMs to produce rich captions as an intermediate surrogate. Text-based algorithms can then be applied to the resulting captions to produce similarity met- rics, text-embeddings, and other summarizations. Conceptually, this can be described by the following flow of information: image → caption → text embedding + top terms A significant amount of customization can be applied to this framework based on the needs of particular applications. The captions, for instance, could be exposed through a digital interface to allow for full-text search and increase accessibility. Or, if there is a concern that automati- cally generated captions may not be up to the metadata standards of the institution, they can be hidden from view and used only as the backend underlying a clustering analysis or recom- mender system. Different information can also be captured through prompt engineering and the choices of the models used. In the remainder of this article, we show how this general approach can be applied to a collection of nearly sixteen thousand digitized documentary photographs created during the 1970s by the U.S. federal government as part of the Documerica project [5]. For our case-study, we used the OpenAI API for the caption creation and the text embedding. The total cost of producing the results in this paper were $287. The costs should scale linearly with the number of images and could be reduced by a factor of four or more by using the batch-based API and replacing intermediate steps with local techniques. We started by taking each of the images in the collection and scaling them to have the largest dimension no greater than 1024 pixels and the smallest dimension no greater than 768. These sizes were chosen to optimize the price of the API request while being close to the maximum allowed size (testing suggested that smaller resolutions of the images produced much less accu- rate captions). We then made an API request using the GPT-4 Turbo model (version 2024-04-09) 561 by submitting the image along with the query “Provide a detailed plain-text description of the objects, activities, people, background and/or composition of this photograph” [1]. The specific query was manually engineered after some trial-and-error using a test set of 25 images to get a complete description of different aspects of the image with a minimal amount of subjective commentary. We requested that the captions be a maximum of 500 tokens. Finally, we submit- ted the automatically generated captions to the OpenAI text embedding API (version 3). The API generated textual embeddings in a 3072-dimensional space. We then generated similarity scores between pairs of images using the cosine similarity between the textual embeddings. To provide a point of comparison, we also passed each image through the EfÏcentNet embed- ding using an open-source implementation [44, 7], generating a similar set of cosine similarity scores based only on the visual image. Ultimately, we generated a rich caption and associated embedding for each image in the collection using a multimodal LLM. Using these embeddings, we were able to measure the dis- tance between any pair of images. In the following section, we compared these with distances generated through an embedding generated directly from the image. 3. Qualitative Analysis and Global Structure We ran the entire set of Documerica images through the method described in the previous section. Our analysis used the color-corrected images that account for the degradation of the online digitized photos [6]. On average, the automatically generated captions used 236 tokens (sd=47.1), corresponding to 197 words (sd=38.5). Two of the images had captions that could not fit within the 500 token limit specified in the search query. We also had two images that triggered the following warning message: “Your input image may contain content that is not allowed by our safety system,” with no further output. One of the rejected images showed a scene with heavy fog. The other was a small object floating in a pool of a purple-colored liquid. Two examples of the generated captions are shown in Fig. 1 and Fig. 2. The displayed cap- tions are indicative of those found for all of the images. Captions typically start with a one- sentence overview of the scene shown in the photograph. Then, several sentences dive into specific objects, activities, and lighting conditions. When the model needs to make an infer- ence based on partial information, the output often includes hedge phrases such as “appears to be” or “possibly”. Over 80% of the captions include at least one of these phrases. Towards the end of the caption, the algorithm becomes more subjective, here giving comments about the “utilitarian” and “gloomy or overcast” ambiance of the photographs. Also, as seen in these examples, over half of the captions end with a summarizing statement that sums up what the algorithm believes to be the main message of the image. While most of text included in the captions appear to be both relevant and accurate, they are by no means foolproof. For example, the caption in Fig. 1 predicts that the worker is female, despite that not being at all clear from the image. The same caption also describes the objects in the foreground as “plastic”, despite being made of glass.1 One way to understand the global structure of an embedding in a large vector space is to 1 The entire set of captions can be downloaded for further analysis from our website: https://distantviewing.org/d ownloads. 562 Figure 1: Documerica image (ID: 558353) with an automatically generated caption: “This photograph captures a scene inside an industrial or agricultural processing facility, possibly a dairy. The image focuses on a female worker engaged in operating machinery used for filling milk bottles. She is wearing glasses and a black top, and appears to be adjusting or handling the bottles or the machine. The worker is surrounded by a complex arrangement of stainless steel machinery and equipment. On the right side of the frame, there’s a large metal vat or container, possibly holding milk or another liquid, with its lid partially visible. In the foreground, we see a series of white plastic milk bottles, each adorned with a red label. These bottles are lined up on a conveyor belt or a filling station, which seems to be part of a mechanized assembly line designed for packaging. The machinery includes various knobs, valves, metal structures, and hoses participating in the bottling process. The interior setting is lit with what appears to be fluorescent lighting, adding a utilitarian ambience to the workspace. The overall scene conveys a moment of daily operations wherein the worker is focused on ensuring the efficiency and correctness of the bottling process.” Figure 2: Documerica image (ID: 548671) with an automatically generated caption: “This photograph captures an open outdoor scene in a flat, barren park-like area with sparse vegetation. In the foreground, a wooden picnic table is situated prominently. The surface on which the table rests appears to be gravel, and it is located near a few scattered small bushes and very young trees. Beyond the picnic table, the landscape opens into a wider, desolate area that features numerous similar young trees and scattered bushes, all of which are relatively small and sparse, suggesting relatively young plantings in a possibly harsh environment. The soil looks rocky, further implying tough growing conditions. The sky above is cloudy, contributing to a somewhat gloomy or overcast ambiance. Overall, the setting appears to be a designed park area undergoing early stages of development, with the focus on newly planted vegetation in an otherwise empty, flat terrain.” 563 Figure 3: UMAP projection showing the structure of the text-based and multimodal embeddings. Each point corresponds to an image in the corpus; points near one another in the plot are also close together in the corresponding high-dimensional embedding spaces. plot the output in a smaller dimension using dimensionality reduction techniques. A com- mon choice for this is the UMAP dimensionality reduction projection. This algorithm tries to approximate the local structure of points in a high-dimensional space (here, the embedding space) in a lower-dimensional space [27]. Fig. 3 shows two-dimensional UMAP projections for the multimodal LLM and the embeddings directly derived from the visual input. The visual em- bedding displays larger continuous blocks of points, in contrast to the multimodal embedding, which has more corners, bridges, and distinct islands. These features indicate that the multi- modal embedding identifies more distinct features. In the following section, we will investigate quantitative ways of measuring the differences between the two sets of recommendations. 4. Recommender System How can we use the information in a set of embeddings to increase the access and discoverabil- ity of large collections? One common approach that has generally produced promising results across many collections is recommender systems [3, 13, 22, 29, 51]. Typically, recommender systems work by first allowing a user to pick an image (or providing one at random), and then suggesting a set of additional thumbnails of other related photos that may also be of interest. Clicking on a thumbnail shows a full version of the selected image and a new set of recom- mendations. Moving iteratively through a sequence of recommendations provides a unique, user-generated tour of a curated subset of a collection. At their best, the recommendations provide meaningful connections between images while avoiding getting users stuck within a small subset of the collection. One way to build a recommender system is to provide recommendations based on the most similar images defined through similarity scores [23]. We already have two different sets of embeddings, those based on the captions and those from the visual embedding. We can create 564 Figure 4: Six example images are shown on the left-hand side of the figure. Directly to the right of each of these images are the five closest recommendations based on the multimodal search. Below these recommendations are the five nearest recommendations based on the image-based recommendations. 565 Table 1 Percentage of symmetric recommendations—the recommendation to one image will include a back recommendation to the starting image—based on the total number of recommendations made. Num. Recommendations: 1 5 10 15 20 25 Image-Based 22.4% 24.9% 26.2% 27.3% 28.2% 28.9% Multimodal-Based 36.5% 45.5% 47.9% 48.6% 49.3% 49.8% Table 2 Correspondence between the image-based and multimodal recommendations as a function of the num- ber of recommendations. The table shows the average number of recommendations that are the same, the proportion of images that have no overlap between the two sets, and the proportion that have at most one overlap. Num. Recommendations Avg. Overlap No-Overlap Overlap <= 1 1 0.11 88.8% 100.0% 5 0.59 58.7% 86.7% 10 1.24 38.5% 67.0% 15 1.97 26.3% 51.4% 20 2.74 19.1% 39.5% 25 3.57 13.9% 30.7% similar scores by computing the cosine similarity between the embedding vectors. These allow us to generate a set of 𝑁 recommendations for each image using the 𝑁 most similar images for any positive integer 𝑁 [33]. Building a recommendation system for a large set of images is an unsupervised learning task. There is no specific metric that we are trying to optimize for or ground truth that we are trying to reproduce. Therefore, we cannot reduce the summary between our two recommendation methods to a single number. Instead, we examine at several indirect measurements to compare the image-based and multimodal recommendation systems. Fig. 4 shows six sets of example recommendations. The photographs on the left-hand side show the starting images, with the five most similar multimodal recommendations on the top row and the five most similar text-based recommendations on the bottom row. Both sets of recommendations yield reasonably interesting results for these six selected images. The rec- ommendations for the final image of a bird, for example, are very similar. However, the mul- timodal results generally offer recommendations that are both more precise and more diverse. For example, the fourth set starts with an image of three people with bicycles looking off into the distance. The visual recommendations only pick up on the bicycles, whereas the multi- modal model also finds images with water in the background, including one image that does not even include bicycles. Similarly, for the fifth image of a house, the visual recommenda- tions include rows of houses and a church; the multimodal recommendations only include single houses with similar architecture. Another method of measuring the structure of the recommendations is to look at how of- ten we have symmetric recommendations. In other words, if a specific image 𝐴 recommends an image 𝐵, we want to know how likely it is that image 𝐵 will recommend back to image 566 Table 3 Proportion of terms related to the referenced image based on the position of the term (top 5), with randomly selected images used as a reference point. The final column counts the proportion of recom- mendations that match at least one of the terms. Term # 1 Term # 2 Term # 3 Term # 4 Term # 5 Any Image-Based 87.0% 86.3% 88.5% 88.8 % 84.5 % 96.2% Multimodal-Based 97.8% 96.1% 95.8% 94.3 % 93.0 % 99.6% Baseline (Random) 5.8% 5.8% 3.3% 0.8 % 5.0 % ⋅ 𝐴. Having symmetric recommendations is generally a good feature because it indicates that the distance metric is meaningful and that we have a fairly uniform set of recommendations. Table 4 shows the proportion of symmetric recommendations for the two models based on the number of recommendations made. These proportions increase as the number of neighbors increase because there are more chances for them to map back into the original. In general, the image-based recommendations have a lower percentage of symmetric recommendations, with rates ranging from 22-29%, compared to the 36-50% rates of the multimodal recommendations. These correspond with the visualization shown in Fig. 3, which shows that the multimodal recommendations have many more tightly connected corners and clusters while still able abil- ity to bridge between different parts of the corpus. These results indicate that the multimodal recommendations do a better job of finding tightly associated clusters. For this corpus, it finds these clusters without becoming too stuck in one particular part of the collection. We can also directly compare how often the image-based and multimodal recommendations overlap. In Table 4, we show the proportion of the recommendations from each of the two methods that are the same as a function of the total number of neighbors. As we saw in the small set of examples in Fig. 4, there are a small number of overlapping recommendations. When using a recommendation size of ten, we average just over one matching recommendation. At the same time, the recommendations are not entirely disjoint. When we use a size of twenty- five, only 13.9% of images have no overlapping recommendations, with an average overlap of about 3.5. Based on these metrics, we see that the caption-based method produces noticeably different results from the image-based technique while preserving some similar structures. 5. Explainable Recommendations A significant advantage of using captions as an intermediate step in the embeddings behind a recommender system is that we can use the captions to describe the rationale for associat- ing two images. Specifically, once we have selected a fixed number of recommendations for each image, we can use the generated captions to produce a label that describes the set of re- lationships. Our approach was to first run the captions through an open-source NLP pipeline that performed tokenization, lemmatization, and part-of-speech tagging [43]. Then, we used log-likelihood scores to identify the nouns that most strongly differentiated the set of recom- mendations from the remainder of the corpus [42]. We selected the top five most strongly associated terms to label each set of recommendations. 567 We ran an experiment to test how well the generated labels correspond to the connections. First, we took a random set of 120 images and found the five closest recommendations for each from both methods. Then, we constructed the five most indicative terms for each set, creating separate sets for both recommendations. Then, we manually classified the proportion of rec- ommendations that accurately corresponded to one of the terms. As a comparison baseline, we also took a random set of the generated terms from our set and counted the proportion of 500 randomly selected images that matched a given term. The results are shown in Table 5. The image-based tags matched at rates in the high 80s, whereas the multimodal tags matched in the high-to-mid 90s. These are all significantly higher than the randomly selected tags, indicating that the matches are not primarily a result of simply supplying generic terms. The biggest difference between the two recommendation systems is shown in the final column. Nearly 4% of the images fail to have any associated matching term. Only two of the multimodal-based terms match none of the terms. These results show that the top terms produced by the cap- tions are relatively accurate and precise. While they can be used to add context to image-based recommendations, they perform noticeably better when applied to recommendations based on the captions’ embeddings. 6. Clustering Analysis Whereas recommender systems offer a way to explore similar images within a collection, how can the output of multimodal LLMs enable understanding the general themes within a collec- tion of visual objects in the first place? Another application of caption text embeddings is to apply clustering algorithms that group together similar captions. Clustering has the advantage of being connected to the recommender system in the sense that images within a given cluster will tend to recommend other images within the same cluster. Also, similar to the approach in the previous section, we can use natural language processing techniques to find key terms that distinguish one cluster from all the others [42]. We applied a hierarchical clustering algorithm to the complete set of captions generated by our multimodal LLM [31]. The algorithm produced a set of 32 clusters, each tagged with the six terms that most distinguished it from all of the other clusters. These are shown in Table 6. The benefit of hierarchical clustering is that it allows us to generate a global structure on the clusters. Clusters in the table are ordered hierarchically so that clusters near each other on the table are more closely related than those farther away from one another. Those at either end of the table are the most unique and furthest away from the others. Reading through the generated topics, starting at the top of Table 6, gives an understanding of the general structure of the Documerica collection. At the top are clusters associated with the detrimental effects of humans on the environment, such as pollution, waste, and junkyards. Then, we move to forms of transportation and into more productive transformations of the earth in the form of agriculture. We then transition into pure nature photos (cluster 15). Next, we see landscapes showing urban skylines and cityscapes. These move into other ways humans interact directly in their environment, such as hiking outdoors (cluster 28) and skiing (cluster 29). The final clusters correspond to particular shooting sets from parades, within laboratories, and photographs of trains and train stations. 568 Table 4 Clusters derived from a hierarchical cluster analysis using the text embeddings of the automatically generated captions. The labels are generated from log probabilities of the terms within each cluster. Clusters are ordered hierarchically so that clusters near each other on the table are more closely related than those farther away. ID Cluster Description Num. Photos 1 landfill; environmental; waste; pollution; debris; garbage 438 2 old; decay; junkyard; car; destruction; scrapyard 371 3 helicopter; airport; urban; rainy; aircraft; cockpit 140 4 train; railway; track; railroad; station; maintenance 220 5 aerial; landscape; river; view; waterfall; natural 1261 6 industrial; facility; large; smoke; treatment; aerial 1354 7 outdoor; man; activity; group; people; picnic 534 8 man; elderly; portrait; older; technical; middle 595 9 man; elderly; conversation; candid; moment; couple 240 10 agricultural; rural; field; farm; crop; farming 445 11 flower; close; plant; up; cluster; delicate 406 12 sign; gas; store; market; billboard; advertisement 514 13 architectural; building; church; house; cemetery; story 613 14 car; parking; lot; vehicle; vintage; garage 395 15 bird; flight; close; surface; rock; deer 660 16 coastal; serene; beach; tranquil; lakeside; picturesque 651 17 landscape; forest; tree; sunset; dramatic; mountainous 1311 18 urban; cityscape; bridge; city; high; view 611 19 aerial; suburban; area; view; coastal; development 172 20 residential; house; suburban; street; story; neighborhood 223 21 highway; street; urban; busy; traffic; bustling 628 22 fishing; fish; underwater; net; water; coral 375 23 boat; sailboat; sailing; maritime; water; marina 624 24 beach; lakeside; day; activity; people; sunny 343 25 fountain; pool; public; park; urban; plaza 87 26 child; young; boy; girl; moment; playground 351 27 construction; industrial; site; mining; worker; machinery 592 28 woman; hiker; outdoor; young; individual; park 420 29 ski; resort; winter; snowy; snow; hockey 89 30 event; parade; street; public; vibrant; people 544 31 laboratory; room; woman; indoor; scientific; elderly 335 32 train; subway; station; interior; indoor; bus 369 The clusters generated here can be integrated into a digital platform that provides a generous interface for exploring the Documerica collection. Imagine, for example, a grid of thumbnails showing one image randomly selected from each cluster along with the associated keywords. Clicking on the thumbnail would create a page with a larger image version, archival metadata, and the recommender system described in the previous section. An option to return to the grid of clusters would be included prominently somewhere on the page. Such an interface would allow users to explore the expanse of the collection through each of the clusters while seeing 569 the diversity within a cluster through the recommender system. Iteratively exploring the col- lection through these global and local connections would allow for a better understanding of the structure and overall message conveyed through the archive. 7. Conclusions There are enormous possibilities for increasing modes of access, discovery, and analysis for visual collections through the automated generation of textual descriptions using multimodal LLMs. In this paper, we have introduced a general framework by which images can be con- verted into textual descriptions and text-based embeddings, opening them up to previously unavailable techniques. We applied an LLM, generated a certain kind of caption, and then used a recommender system and image clustering based on the text embedding of the caption. We showed how this approach could be applied to a collection of documentary photographs to produce an explainable recommender system and clustering-based descriptions of the themes within the collection. The present study is just one straightforward application of rich LLM- based multimodal methods. We expect to see a wide range of further applications of this general approach in the coming years, particularly as open-source models follow their usual pattern of catching up to the current state-of-the-art results currently attainable through closed, com- mercial systems [10, 26, 52]. We close with two specific extensions that highlight potential avenues of application for our framework. First, it is possible to add additional layers of safeguards to the recommendations, an important task when building interfaces to cultural heritage collections [11] This can be done through further prompt engineering or the filtering (or replacing) of terms before the embedding step. For example, we noticed that many terms in the captions, such as ‘man’ and ‘girl’, are gendered. As a result, the recommender system has a tendency to associate photos of people that it believes are the same gender, which in the case of people in the background is frequently based on inaccurate stereotypes [18]. Associations such as these can be mitigated, though never entirely avoided, by automated replacing gendered terms with neutral terms before running the text embedding. A second extension that can be implemented with the automatically generated captions would be offering an interface for a full-text search, allowing for new modes of accessibility [16]. Full-text search could be implemented to avoid the (not entirely correct) full captions themselves, or could expose these to end users along with a disclaimer about their autogenerated nature. References [1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Al- tenschmidt, S. Altman, S. Anadkat, et al. “GPT-4 technical report”. In: arXiv preprint arXiv:2303.08774 (2023). [2] M. M. Adnan, M. S. M. Rahim, A. Rehman, Z. Mehmood, T. Saba, and R. A. Naqvi. “Auto- matic image annotation based on deep learning models: a systematic review and future challenges”. In: IEEE Access 9 (2021), pp. 50253–50264. 570 [3] S. Afzal, S. Ghani, M. M. Hittawe, S. F. Rashid, O. M. Knio, M. Hadwiger, and I. Hoteit. “Visualization and visual analytics approaches for image and video datasets: A survey”. In: ACM Transactions on Interactive Intelligent Systems 13.1 (2023), pp. 1–41. [4] K. Anitha Kumari, C. Mouneeshwari, R. Udhaya, and R. Jasmitha. “Automated image cap- tioning for flickr8k dataset”. In: Proceedings of International Conference on Artificial Intel- ligence, Smart Grid and Smart City Applications: AISGSC 2019. Springer. 2020, pp. 679– 687. [5] U. S. N. Archives. DOCUMERICA: The Environmental Protection Agency’s Program to Pho- tographically Document Subjects of Environmental Concern, 1972–1977. https://catalog.ar chives.gov/id/542493. [6] T. Arnold and L. Tilton. “Automated Image Color Mapping for a Historic Photographic Collection”. In: CHR 2024: Computational Humanities Research Conference. CEUR Work- shop Proceedings, 2024. [7] T. Arnold and L. Tilton. “Distant viewing toolkit: A python package for the analysis of visual culture”. In: Journal of Open Source Software 5.45 (2020), p. 1800. [8] T. Arnold and L. Tilton. “Distant Viewing: Analyzing Large Visual Corpora”. In: Digital Scholarship in the Humanities 34.Supplement_1 (2019), pp. i3–i16. [9] T. Arnold and L. Tilton. Distant Viewing: Computational Exploration of Digital Images. MIT Press, 2023. [10] H. Chen, F. Jiao, X. Li, C. Qin, M. Ravaut, R. Zhao, C. Xiong, and S. Joty. “ChatGPT’s One- year Anniversary: Are Open-Source Large Language Models Catching up?” In: arXiv preprint arXiv:2311.16989 (2023). [11] C. N. Coleman. “Managing bias when library collections become data”. In: International Journal of Librarianship 5.1 (2020), pp. 8–19. [12] A. Cuntz, P. J. Heald, and M. Sahli. “Digitization and Availability of Artworks in On- line Museum Collections”. In: World Intellectual Property Organization (WIPO) Economic Research Working Paper Series 75 (2023). [13] L. Deal. “Visualizing digital collections”. In: Technical Services Quarterly 32.1 (2015), pp. 14–34. [14] Ç. Demiralp, C. E. Scheidegger, G. L. Kindlmann, D. H. Laidlaw, and J. Heer. “Visual embedding: A model for visualization”. In: IEEE Computer Graphics and Applications 34.1 (2014), pp. 10–15. [15] I. Di Lenardo, B. L. A. Seguin, and F. Kaplan. “Visual patterns discovery in large databases of paintings”. In: Digital Humanities 2016. 2016. [16] N. Dı́az-Rodrı́guez and G. Pisoni. “Accessible cultural heritage through explainable arti- ficial intelligence”. In: Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization. 2020, pp. 317–324. [17] B. Flueckiger and G. Halter. “Methods and Advanced Tools for the Analysis of Film Col- ors in Digital Humanities.” In: DHQ: Digital Humanities Quarterly 14.4 (2020). 571 [18] K. C. Fraser, S. Kiritchenko, and I. Nejadgholi. “A friendly face: Do text-to-image systems rely on stereotypes when the input is under-specified?” In: arXiv preprint arXiv:2302.07159 (2023). [19] A. Gefen, L. Saint-Raymond, and T. Venturini. “AI for digital humanities and compu- tational social sciences”. In: Reflections on Artificial Intelligence for Humanity (2021), pp. 191–202. [20] T. Hiippala and J. A. Bateman. “Semiotically-grounded distant viewing of diagrams: in- sights from two multimodal corpora”. In: Digital Scholarship in the Humanities 37.2 (2022), pp. 405–425. [21] R. C. King, V. Bharani, K. Shah, Y. H. Yeo, and J. S. Samaan. “GPT-4V passes the BLS and ACLS examinations: An analysis of GPT-4V’s image recognition capabilities”. In: Resuscitation 195 (2024). [22] I. Klinkert, L. A. McDonnell, S. L. Luxembourg, A. Maarten Altelaar, E. R. Amstalden, S. R. Piersma, and R. Heeren. “Tools and strategies for visualization of large image data sets in high-resolution imaging mass spectrometry”. In: Review of scientific instruments 78.5 (2007). [23] B. C. G. Lee. “The “Collections as ML Data” checklist for machine learning and cultural heritage”. In: Journal of the Association for Information Science and Technology (2023). [24] Y. Lei, Z. Li, Y. Li, J. Zhang, and H. Shan. “LICO: explainable models with language-image consistency”. In: Advances in Neural Information Processing Systems 36 (2024). [25] F. Liu, M. Zhang, B. Zheng, S. Cui, W. Ma, and Z. Liu. “Feature fusion via multi-target learning for ancient artwork captioning”. In: Information Fusion 97 (2023), p. 101811. [26] H. Liu, C. Li, Q. Wu, and Y. J. Lee. “Visual instruction tuning”. In: Advances in neural information processing systems 36 (2024). [27] L. McInnes, J. Healy, and J. Melville. “UMAP: Uniform manifold approximation and pro- jection for dimension reduction”. In: arXiv preprint arXiv:1802.03426 (2018). [28] C. Meinecke, C. Hall, and S. Jänicke. “Towards enhancing virtual museums by contextual- izing art through interactive visualizations”. In: ACM Journal on Computing and Cultural Heritage 15.4 (2022), pp. 1–26. [29] J.-P. Moreux. “Intelligence artificielle et indexation des images”. In: Journées du patri- moine écrit:“L’image aura-t-elle le dernier mot? Regards croisés sur les collections icono- graphiques en bibliothèques”. 2023. [30] C. Morse, B. Landau, C. Lallemand, L. Wieneke, and V. Koenig. “From #museumathome to #athomeatthemuseum: Digital museums and dialogical engagement beyond the COVID- 19 pandemic”. In: ACM Journal on Computing and Cultural Heritage (JOCCH) 15.2 (2022), pp. 1–29. [31] F. Murtagh and P. Legendre. “Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion?” In: Journal of classification 31 (2014), pp. 274–295. 572 [32] R. Paiss, H. Chefer, and L. Wolf. “No token left behind: Explainability-aided image clas- sification and generation”. In: European Conference on Computer Vision. Springer. 2022, pp. 334–350. [33] A. Petukhova, J. P. Matos-Carvalho, and N. Fachada. “Text clustering with LLM embed- dings”. In: arXiv preprint arXiv:2403.15112 (2024). [34] A. Puscasiu, A. Fanca, D.-I. Gota, and H. Valean. “Automated image captioning”. In: 2020 IEEE international conference on automation, quality and testing, robotics (AQTR). Ieee. 2020, pp. 1–6. [35] Z. Qi, S. Khorram, and L. Fuxin. “Embedding deep networks into visual explanations”. In: Artificial Intelligence 292 (2021), p. 103435. [36] Z. Qi and F. Li. “Learning explainable embeddings for deep networks”. In: NIPS Workshop on Interpretable Machine Learning. Vol. 31. 2017. [37] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. “Learning transferable visual models from natural language supervision”. In: International conference on machine learning. Pmlr. 2021, pp. 8748–8763. [38] A. M. Rinaldi, C. Russo, and C. Tommasino. “Automatic image captioning combining natural language processing and deep neural networks”. In: Results in Engineering 18 (2023), p. 101107. [39] S. Sheng and M.-F. Moens. “Generating captions for images of ancient artworks”. In: Proceedings of the 27th ACM international conference on multimedia. 2019, pp. 2478–2486. [40] N. Siddiqui. “Cutting the Frame: An In-Depth Look at the Hitchcock Computer Vision Dataset”. In: Journal of open humanities data 10.1 (2024). [41] T. Smits and M. Wevers. “A multimodal turn in Digital Humanities. Using contrastive machine learning models to explore, enrich, and analyze digital visual historical collec- tions”. In: Digital Scholarship in the Humanities 38.3 (2023), pp. 1267–1280. [42] A. Stefanowitsch. Corpus linguistics: A guide to the methodology. Language Science Press, 2020. [43] M. Straka, J. Hajic, and J. Straková. “UDPipe: trainable pipeline for processing CoNLL- U files performing tokenization, morphological analysis, pos tagging and parsing”. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). 2016, pp. 4290–4297. [44] M. Tan and Q. Le. “EfÏcientNet: Rethinking model scaling for convolutional neural net- works”. In: International conference on machine learning. Pmlr. 2019, pp. 6105–6114. [45] S. Tong, Z. Liu, Y. Zhai, Y. Ma, Y. LeCun, and S. Xie. “Eyes wide shut? exploring the visual shortcomings of multimodal llms”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, pp. 9568–9578. [46] A. Verma, A. K. Yadav, M. Kumar, and D. Yadav. “Automatic image caption generation using deep learning”. In: Multimedia Tools and Applications 83.2 (2024), pp. 5309–5325. 573 [47] M. Wevers and T. Smits. “The visual digital turn: Using neural networks to study histor- ical images”. In: Digital Scholarship in the Humanities 35.1 (2020), pp. 194–207. [48] M. Whitelaw. “Generous interfaces for digital cultural collections”. In: Digital humanities quarterly 9.1 (2015), pp. 1–16. [49] F. Windhager, P. Federico, G. Schreder, K. Glinka, M. Dörk, S. Miksch, and E. Mayr. “Vi- sualization of cultural heritage collection data: State of the art and future challenges”. In: IEEE transactions on visualization and computer graphics 25.6 (2018), pp. 2311–2330. [50] W. Wu, H. Yao, M. Zhang, Y. Song, W. Ouyang, and J. Wang. “GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?” In: arXiv preprint arXiv:2311.15732 (2023). [51] Y. Ye, R. Huang, and W. Zeng. “VISAtlas: An image-based exploration and query system for large visualization collections via neural image embedding”. In: IEEE Transactions on Visualization and Computer Graphics (2022). [52] S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen. “A Survey on Multimodal Large Language Models”. In: arXiv preprint arXiv:2306.13549 (2023). [53] H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, and Y. Yang. “Ferret: Refer and Ground Anything Anywhere at Any Granularity”. In: arXiv preprint arXiv:2310.07704 (2023). 574