<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1109/iccv51070.2023.01100</article-id>
      <title-group>
        <article-title>Viability of Zero-shot Classification and Search of Historical Photos</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>ErikaMaksimova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mari-AnnaMeimer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>MariPiirsalu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>PriitJärv</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Software Science, Tallinn University of Technology</institution>
          ,
          <country country="EE">Estonia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>11941</fpage>
      <lpage>11952</lpage>
      <abstract>
        <p>Multimodal neural networks are models that learn concepts in multiple modalities. The models can perform tasks like zero-shot classification: associating images with textual labels without specific training. This promises both easier and more flexible use of digital photo archives, e.g. annotating and searching. We investigate whether existing multimodal models can perform these tasks, when the data difers from the typical computer vision training sets, on historical photos from a cultural context outside the English speaking world.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;zero-shot learning</kwd>
        <kwd>digital heritage</kwd>
        <kwd>multimodal models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>ciated costs – the human labor involved in preparation of the training data and the operation
and maintenance of the model.</p>
      <p>
        Pre-trained multimodal neural networks come with the promise of removing these costs. As
an example, CLIP [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is a relatively lightweight model that can encode both natural language
input and images into a shared multimodal vector representation. For example, the text “cow”
and a picture of a cow would have a very similar representation. This allows the model to
perform zero-shot classification: when presented with an example of an image and a label, the
model can immediately predict whether the label is associated with the image, without needing
to see any other examples. The implication is that CLIP and other similar multimodal models
can replace supervised computer vision models, like CNNs, without requiring large training
data sets.
      </p>
      <p>
        There are limits to how well machine learning algorithms generalize to unseen data.
Ultimately, a model can only learn some representation of its training data, so the usefulness of the
model depends on how similar the distribution of data in the application is to the distribution
of the training data. For example, a study by de Vries et al. measured a 15-20% diference in the
classification accuracy of household items like soap between images from the United States and
from Somalia and Burkina Faso1[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. They associate this with representational bias, as the
majority of images in many computer vision datasets originate from Western countries. Historical
appearance of locations, items, situations and common activities is similarly underrepresented.
Therefore, existing successful use cases of multimodal models on modern photographs, or their
evaluations on standard computer vision datasets like ImageNe1t0[] and CIFAR100 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] are not
reliable indicators of their usefulness for historical photos.
      </p>
      <p>In this paper, we investigate if using of-the-shelf multimodal models is a viable method for
classifying photos from Ajapaik. To understand what trade-ofs or drawbacks this involves,
we do a comparison with supervised computer vision models. The multimodal vector
representations from the CLIP model can also be adapted for searching images based on their visual
content. This would be a very useful functionality for image collections, so we include the
evaluation of multimodal search in our experiment.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The main inspiration for our work comes from the paper by Smits and Weve1r2s][. They
demonstrate the capabilities of the CLIP model9][ on collections of magic lantern slides and
children’s books illustrations, originating from the 19th century to approximately 1940. In the
task of classifying indoor and outdoor images, CLIP was slightly less accurate than a
convolutional neural network trained specifically for the task. Smits and Wevers identify several
forms of bias that cause the model to make mistakes, like mis-identifying modern concepts
in historic images, and applying sex-role stereotypes. Their finding is that the diferences in
visual representation do not impact the performance negatively. As an example, the concept
of family is recognized from illustrations of both people and anthropomorphic animals.</p>
      <p>
        The report on CLIP by Radford et al.9][ also evaluates the model’s zero-shot performance
against a fully supervised neural network model. CLIP with no task-specific training
outperforms ResNet50 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] that was separately trained for each specific task, in classifying several
frequently used computer vision datasets, like ImageNet and CIFAR100. Radford et al. caution
that their evaluation set may be co-aligned with the capabilities of the model, which means
that the high performance is not guaranteed to carry over to applications. For example, in
their paper CLIP underperforms in the specialized task of classifying satellite images. The
authors observe that the natural language interface may be unsuited to specify more complex
tasks.
      </p>
      <p>
        We refer to the papers by Aske and Giardinetti1][ and Männistö et. al [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for a wider
overview of machine learning for visual archives and omit the discussion of methods that
do not involve large multimodal models here. Few papers explicitly investigate multimodal
models for historical images classification and retrieval. Barancová et al. explore the dating
of historical photos, concluding that zero-shot classification is relatively inefÏcient2[]. They
achieved better results by training a classifier on top of the multimodal model’s image
representation. Tschirschwitz et al. propose evaluation datasets and a framework for historical
images classification and retrieval [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. In their study, zero-shot CLIP achieves the highest
performance, however they also report that additional qualitative evaluation did not confirm their
quantitative results. Springstein et al. present methods of classification of art-historical
images in a hierarchical schema of visual themes13[]. Their work does not include of-the-shelf
multimodal models in the zero-shot setting.
      </p>
      <p>
        Our paper uses the CLIP, SigLIP 1[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and BLIP-2 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] models that combine vision and language
transformers and can encode images and text into a shared multimodal vector representation.
There are many more multimodal models, with over 60 models cited in a recent sur1v6e]y. [
This number is constantly growing. Most of these models are designed for “downstream” tasks
like image and text generation, and do not necessarily provide documented interfaces for
classification or for accessing shared multimodal representations.
      </p>
      <p>
        The contribution of our paper, as compared to evaluations i9n], [is that we use a dataset
that is superficially similar (photos of people, everyday life, buildings) to mainstream computer
vision training datasets but diferent in two aspects. The pictures are from diferent historical
eras, and from outside the English speaking cultural sphere, which is significant because the
representations of concepts in models are learned through language modelling. Our
contribution is complementary to 1[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] as it is a similar investigation on a diferent dataset.
We report results for both classification and search. We break down the search evaluation to
diferentiate between very general concepts, and entities and objects that are distinctly local
to the cultural context of Ajapaik.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>We present two experiments, covering two common use cases of a photo collection:
classification and search. The source code of the experiments is availablehattps://github.com/priitj/ch
r2024/. Because the copyright of most of the photos used in the experiments is held privately,
we cannot reproduce the photos in the paper and do not distribute the datasets.</p>
      <p>Our classification experiment measures the capability of multimodal models to automatically
label photos in a collection with a fixed set of categories. In the search experiment, a query
text is given and a multimodal model is used to retrieve a set of matching images. We measure
how well the models rank the images by the relevance of their visual content to the query text.</p>
      <p>
        We evaluate three multimodal models. CLIP was used in previous research on historical
images [
        <xref ref-type="bibr" rid="ref12 ref13 ref14 ref2">12, 2, 14, 13</xref>
        ], and is generally widely adopted and cited, so we include it as a reference
multimodal model. SigLIP is very similar to CLIP, but uses a diferent training objective. In
the evaluation done by it’s authors, SigLIP outperforms diferent variations of CLIP in all
classification tests [ 17]. Based on this, we included SigLIP in our study. For both of these models,
we use HuggingFace Transformer2simplementations.
      </p>
      <p>BLIP-2 is another model with an architecture optimized for efÏcient training. It is reported
to outperform CLIP in text-to-image retrieva6l][. For this model we used implementations
from the Salesforce LAVI3S language-vision package.</p>
      <p>The nomenclature of these models used in machine learning literature also includes the
description of their vision transformer component. A ViT-B vision transformer is the “base” size
with 12 transformer layers, while a ViT-L is the “large” transformer, with 24 layers and
increases in other settings as well. Thpeatch size describes how the input image is partitioned
before feeding it to the transformer: a “patch 32” model uses 32x32 pixel rectangles. Therefore,
the input fed to the transformer of a 16x16 patch model is actually four times larger, making
it more computationally expensive. These variations have an impact on the performance of
the models, so we include three diferent configurations of both CLIP and SigLIP in our
experiments.</p>
      <sec id="sec-3-1">
        <title>2https://huggingface.co/ 3https://github.com/salesforce/LAVIS</title>
        <sec id="sec-3-1-1">
          <title>3.1. Classification Experiment</title>
          <p>In the classification experiment, we use a sample of 17042 photos from the Ajapaik collection.
The images are annotated with the scene category, the viewpoint elevation category, or both.
frequent categories like “raised”, such that each category has at least 3000 photos. This was
done to obtain better performance with the supervised baseline models and to ensure enough
test examples in those categories. For the remainder of the paper, we refer to this sample as
the classification set .</p>
          <p>We test scene category and viewpoint elevation category classification separately. With the
supervised baselines, we use 5-fold cross-validation. In each round of cross-validation, the
images are split 75:5:20 between train, validation and test parts. The 5% validation part is only
used to automatically select the best model during training. All the measurements in reported
in the paper are done on the 20% test parts, which cover the entire set of images in a given
category over the 5 rounds of cross-validation.</p>
          <p>With CLIP and SigLIP, classifications for all images in a given category are computed
directly by the model from the input of an image and a set of prompts. We use the category
labels “interior”, “exterior”, “ground”, “raised” and “aerial” as initial prompts for their
respective categories.</p>
          <p>
            Both Radford et al. 9[] and Smits and Wevers [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] observe that the selection of prompts to
represent the classes has a noticeable efect on classification performance. With the Ajapaik
classification set, this efect should also be expected. The single word “raised” does not describe
the class very precisely and only becomes meaningful if we know that the context is viewpoint
elevation. Accordingly, we expand our sets of prompts for classification with diferent natural
language phrases describing the categories (Table6sand 8).
          </p>
          <p>To the best of our knowledge, BLIP-2 does not include a classification model. We implement
a simple nearest neighbors classifier on top of BLIP-2 vector representation. We compute image
vectors  and text vectors  with BLIP-2 for an image and a prompt . The class of the image
is the one that maximises the similarity of vectors:
(, ) =
  ⋅  
‖ ‖‖  ‖

(1)</p>
          <p>For baselines, we use convolutional neural networks (CNNs) and transfer learning. All
selected models are pre-trained on ImageNet. We train a shallow classifier on top of the
pretrained CNNs. Scene category classifiers and viewpoint elevation classifiers are trained
separately.</p>
          <p>
            For CNN architectures, we selected ResNet18 and ResNet50, as they were used as baselines
in papers [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] and [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ], respectively. Additionally, we selected DenseNet1241][to represent
a deeper architecture, and MobileNetV211[] as a more modern, lightweight computer vision
model.
          </p>
          <p>We measure the classification performance using per-class F1-score. For a given class, true
positives (TP) is the number of images that were correctly predicted by the model to be in
this class. False positives (FP) is the number of images that were incorrectly predicted to be in
the class. False negatives (FN) is the number of images that belong to the class, but the model
predicted a diferent class label. The F1-score penalizes both false positives and false negatives:
F1 =</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.2. Search Experiment</title>
          <p>For the search experiment, we downloaded the metadata of photos from the Ajapaik A4PWI.e
then randomly selected 11000 photos that had a textual description. Because of downloading
and image file format errors, our final sample has 10846 images. We will refer to this sample
as the search set.</p>
          <p>We evaluate the search performance by letting the models rank the images by their relevance
to search terms. To ensure that at least one relevant photo exists for each search term, we
extracted the search terms from the textual descriptions of the images in the search set. We
translated descriptions of images to English using the deep-transla5toparckage. We then POS
tagged and lemmatized the words in the descriptions, and detected named entities using the
SpaCy6 library.</p>
          <p>We select the search terms with the assumption that the users would mostly search for
objects (such as “boat”), events (“exhibition”), activities (“riding”) or named entities, like a place
name or a person. These search terms would also be non-ambiguous enough so that we can
decide whether a retrieved photo is relevant to the search term. We select English language
nouns as examples of objects, events and phenomena. Verbs are examples of activities. In total,
we used 8 diferent categories, with 10 terms in each category. The selected search terms are
listed in Table2.</p>
          <p>Common and rare sets are included so that both easier and more difÏcult searches are
represented. In common sets, we selected 10 terms that occurred most frequently in descriptions.
All common search terms had occurred with at least 20 photos, and sometimes with hundreds.
In rare sets, we selected 10 random terms from among those that had occurred once.</p>
          <p>The random English words selection is used to diversify the search terms and to reduce any
unintentional bias from the frequency based selection of other terms. The same random words
are human translated to Estonian. This set is used to test whether the models give any useful
results with non-English input.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>4https://opendata.ajapaik.ee/ 5https://github.com/nidhaloff/deep-translator 6https://spacy.io</title>
        <p>view, farm, building, group, an- venue, wetland, stomach,
triniversary,
portrait,
exhibition,
color, rumba,
detective, lore,
Named entities
working, making, sitting, stand- ing, illustrating,
schoolmasterchild, school, competition
Activities commissioning,</p>
        <p>performing,
leaving, speaking, taking, giving,
ing
Tallinn, the University of Tartu,
Tartu, Viljandi, Narva, Tartu
University,</p>
        <p>Harju, V. Kingissepa,
Rakvere, Moscow
English
lecturer, horse
heath, population, score
fertilizing, hoeing, windmilling,
watering, mourning,
seamstressing, stitching, binding
Jaan</p>
        <p>Kadakas,</p>
        <p>Jüri</p>
        <p>Randla,
U.K. Kekkonen, Ralf Allikvee,
A. Rosenberg, Setumaa,</p>
        <p>Karl
Parts, Sara Teitelbaum, Margit
Tooman, Valeri Kirss
Estonian
hobune
Random words
winter, interior,</p>
        <p>milk, village,
boy, driving, education, fence,
talv, interjöör, piim, küla, poiss,
sõitmine, haridus, aed, lektor,</p>
        <p>The terms selected by automatic criteria required some manual changes. For example, with
named entities, we decided not to include names of countries, because “Estonia” would match
with the majority of photos. An excluded search term was replaced with the next most frequent
and   . The search results for a search termare  most similar images, sorted by(, )
term for the common categories, and a new randomly selected term for the other categories.</p>
        <p>We implement search by computing text-to-image similarity. For an ima gaend search term
 , the similarity is computed using Equatio1n from the multimodal representation vecto rs
. Our
implementation uses the Voyage7r approximate nearest neighbors index to find and sort the
most similar images.</p>
        <p>We evaluate the ability of models to rank relevant results above irrelevant ones. We selected
mean average precision (MAP) as the measurement of the quality of search resul7ts, p[.
155161]. MAP is a robust measure that is not sensitive to the number of relevant documents in the
search set. If there are not enough relevant photos, MAP does not penalize filling the remainder
of the search results with irrelevant photos.</p>
        <p>We report the measurements for the top-results, considering that this is what the user will
see in practical application. For a que rybelonging to a set of queries , let   () be the set
of relevant results among the first results. Let  be a search result at positio n. Average</p>
        <p>1
|  ()|</p>
        <p>∑
  ∈  ()
|  ()|

(3)</p>
        <p>AP rewards ranking relevant results above irrelevant results. For example, if the first 4
among top 10 results are relevant, then A@P10 = 1.0. If the 4 relevant results are ranked
below 6 irrelevant results, A@P10 = 14 ( 17 + 82 + 93 + 140 ) ≈ 0.28. If there are no relevant results,
AP = 0. Mean average precision (MAP) is calculated over a set of queries:
|| ∈
(4)</p>
        <p>While the search set contains some positive labels to evaluate relevance of images to queries,
there are no negative labels. For example, if a description of a photo includes the word “boy”
we could count it as relevant towards a query “boy”, however there is no information about
what the photo does not depict. Therefore, the relevance of each photo in search results to each
search term was evaluated by human judges.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>We begin with the performance of the supervised baselines to provide a frame of reference to
the results obtained with multimodal models. Tab3lleists the classification results with the
CNN models trained specifically for the scene category and viewpoint elevation classification
tasks. We observe that firstly, the “interior”/“exterior” classification is the easier one of the two
tasks shown. Secondly, the “raised” category is ambiguous as a textual description, but based
on Table3 it is also ambiguous visually. The per-class F1-score for “raised” is 20-30 percentage
points lower than other classes, indicating that the supervised models struggle generalizing
this concept.</p>
      <p>With multimodal models, we first report per-class classification performance when using
class labels “as is”. For example, when classifying the viewpoint elevation category, the image
is matched with the texts “ground”, “raised” and “aerial”.</p>
      <p>Table 4 lists the per-class F1 scores for scene and viewpoint elevation categories. As with
the supervised baselines, the scene category classification works better. Also similarly to the
supervised baselines, the “raised” class is the most difÏcult. However, compared to Tab3l,ethe
multimodal models do much worse, with F1-scores ranging from 0.02 to 0.24.</p>
      <p>
        Several outcomes were unexpected. Contrary to the evaluations by the authors of the models
[
        <xref ref-type="bibr" rid="ref9">9, 17</xref>
        ], smaller versions of the models perform equal or better than bigger ones. Unlike in the
paper by Zhai, et al. [17], SigLIP does not clearly outperform CLIP in classification, in fact
the small CLIP ViT-B/32 model is the best in predicting three classes out of five. It is also
surprising that the performance in the aerial category is low, because we would expect the
aerial photographs to be visually distinct.
      </p>
      <p>The reason behind the lower performance for the aerial category is revealed by the confusion
matrices in Figure2. The rows are the true classes and the columns are the classes that the
model predicted. The CLIP ViT-B/32 and BLIP-2 models labeled 98% of aerial pictures correctly,
only 2% of them were labeled as ground. Their low performance is caused by the false positives,
as they heavily tend towards the aerial category and also label other photos as aerial.</p>
      <p>In comparison, the SigLIP models tend heavily towards predicting the ground category. Due
to having fewer false positives for aerial, the per-class score is higher. The trade-of is that the
per-class score for ground is lowered. SigLIP ViT-B/16 has the most balanced performance in
the viewpoint elevation category, thanks to being able to detect raised elevation photos better
than the other models.</p>
      <p>Using more descriptive prompts allowed multimodal models to reach higher performance,
but the overall impact of prompt engineering was mixed. We provide the full results in
AppendixA. The Tables6–7 give the prompts and mean F1-scores for the scene category
classification task. The Tables8–9 are the prompts and results for the viewpoint elevation classification
task.</p>
      <p>A brief summary: with the prompts “indoor scene” and “outdoor scene”, SigLIP ViT-B/16
0.8
e
r
sco 0.6
1F0.4
0.2
0.0</p>
      <p>0.88
0.76
model achieves F1 = 0.92 for the interior and F1= 0.97 for the exterior scenes. This is the
only instance out of all combinations of models and prompts, where a multimodal model
outperforms any of the supervised baselines. However, out of 42 tests using the more descriptive
prompts in scene category classification, in 24 instances the performance dropped, compared
to using class labels directly. The more descriptive prompts had a positive impact for the CLIP
model in viewpoint elevation classification. In total, we did 84 tests with prompt engineering
and in exactly half (42) the performance improved or remained the same. In the remaining 42
test instances the performance dropped.</p>
      <p>Figure 3 compares the best performing multimodal models and supervised baselines. We
selected SigLIP ViT-B/16 as the representative of multimodal models, as it achieved multiple
highest per-class and mean F1-scores, including the one result that outperformed the baselines.
We selected ResNet50 as the representative of CNNs. The best results of multimodal models
are competitive with supervised baselines in scene classification and below the baselines by a
large margin in viewpoint elevation classification.</p>
      <p>The search experiments are summarized in Tabl5e. Two weak categories are clearly visible
– the rare named entities and the Estonian language search terms. The models perform the best
when searching for objects, then activities and are overall weakest when searching for named
entities.</p>
      <p>Like in the classification experiment, model performance relative to each other difers from
evaluations in previous literature. Surprisingly, the best performing model is clearly SigLIP
SO-400m/14, which is optimized for classification [17]. BLIP-2, which we included due to prior
strong results in text-to-image retrieval, did not perform as well.</p>
      <p>The MAP scores in Table5 do not tell us directly whether there were many correct results and
the ranking within top-10 did not matter, or if the models were able to precisely rank relevant
photos above irrelevant ones. We analyze the ranking ability in the left graph of Figu4r.eThe
AP@10 results are plotted against the number of relevant photos in top-10. With SigLIP
SO400m/14, the best performing model, we still see that the best AP@10 score drops below 1.0
when there were fewer than 4 relevant results. In other words, in searches with 1-3 matches,
there were always irrelevant photos ranked above relevant ones.</p>
      <p>The second question is, whether the models can find all search terms, or are there gaps that
the aggregate MAP scores do not show. We present the distribution of all AP@10 scores in
Figure4, right graph. Over all models, the most likely result of a search is either a complete
success or a complete failure, with intermediate results much less likely.</p>
      <p>Clearly, many of the low AP@10 results come from the named entity or Estonian language
searches. When we look at the other five categories of regular English language words, a more
interesting result emerges. For each model, there are 3-10 terms where AP@10 was under
0.3. At the same time, for each term at least one model achieved AP@1&gt;0 0.5, except the
word “score” where the best AP@10= 0.38. Therefore, each model has gaps in the knowledge,
but these gaps lie in diferent places for diferent models, including diferent sizes of CLIP and
SigLIP.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>
        Prior evaluations on mainstream computer vision datasets do not generalize well to the case of
Ajapaik. Previously established rankings of which models perform better in classification and
search ([17] and [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], respectively) were not reproduced in our experiments. This implies that
if the users want to ensure good performance, they need to test their own particular use case,
which would involve human evaluation or annotation of data.
      </p>
      <p>The multimodal models did well in the scene category classification task and less so in
viewpoint elevation classification. Prompt engineering closed the gap to supervised baselines, but
the multimodal models responded to prompts unpredictably. Using more descriptive prompts,
like “elevated view” instead of “raised”, was equally likely to increase or to decrease the
performance. This is important, because there was no prior indication of what model and prompt set
would be best. We only know that SigLIP ViT-B/16 performed well in scene category thanks
to having the annotated classification set.</p>
      <p>
        The evidence from prompt engineering tests shows that the difÏculties encountered in the
classification have more to do with having to specify the classification task precisely through
the natural language interface, as was in fact anticipated by Radford et al. when they discuss
applications of CLIP 9[]. The only practical way of mitigating this is to move away from the
zero-shot setting. Training a classifier on top of multimodal representations, like done2i]n, [
may require similar amounts of annotated data like with CNNs. However, Radford et al. show
improved performance with multi-shot learning where much fewer examples are needed (up
to 16 in the paper) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>The impact of the cultural sphere on the search results was clear. The Estonian language,
localities and persons are not well represented in the models. When we remove this
requirement of out of domain knowledge and look at 50 search terms of common English words, the
models still have gaps, with 3-10 searches per model failing. Importantly, which terms fail
difers by model. On the positive side, this unpredictability is not as pronounced as with the
classification. The top search results are clearly populated with relevant photos for common
objects, activities and random English worlds, independent of the model.</p>
      <p>Our paper has multiple limitations. The relevance of a photo to a given search term in search
results was validated by one person each. Methodologically it would be preferable if several
persons validated one result. However, the photo–term pairs were distributed between the
judges in a way that was, for practical purposes, random. Each set of photos returned by a
model was therefore validated by multiple judges, which should dilute the efect of possible
bias.</p>
      <p>In the search experiment, we used the Voyager approximate nearest neighbors index. It
is possible that Voyager had some impact on the search results, if it did not return the exact
 nearest neighbors set each time. To validate this, the search experiment has to be repeated
without an index, and we omitted this because of additional labor needed to validate the results.</p>
      <p>There are many other multimodal models that we did not include in our experiment.
Additional investigation is needed to determine, which of those, if any, provide an interface to
shared multimodal vector representation of text and images. Finally, we used only one dataset,
so our result may not generalize to other similar datasets.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>We investigated the viability of zero-shot classification and search of historical photos in
Ajapaik. We found that this application domain is diferent enough from the models’
training data that expectations based on previous evaluations on model performance do not hold.
Multimodal models can successfully search for common everyday concepts from the photos.
However, the zero-shot usage on this archive is problematic. Firstly, models have unpredictable
gaps in knowledge with common English words that appear depending on model size.
Secondly, knowledge of Estonian language words and names is mostly missing. Thirdly,
classification performance is below supervised baselines and cannot be easily improved with prompt
engineering. Therefore, in the context of historical visual archives, the multimodal models do
not deliver on the promise of removing dataset annotation related costs. Thorough evaluation
is recommended to ensure viability in each use case.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We thank the anonymous reviewers for their comments that helped to improve the paper. We
are grateful to Anna Grund for providing the annotated classification set and to the contributors
of Ajapaik, who made writing this paper possible.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Prompt Engineering</title>
      <p>The appendix contains the prompt sets and the corresponding results for scene classification
(Tables 6–7) and viewpoint elevation classification (Tables8–9).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Aske</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Giardinetti</surname>
          </string-name>
          . “
          <article-title>(Mis)Matching Metadata: Improving Accessibility in Digital Visual Archives through the EyCon Project”</article-title>
          .
          <source>InA:CM Journal on Computing and Cultural Heritage 16.4</source>
          (
          <issue>2023</issue>
          ),
          <volume>76</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>76</lpage>
          :
          <fpage>20</fpage>
          . doi:
          <volume>10</volume>
          .1145/3594726.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barancová</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wevers</surname>
          </string-name>
          , and
          <string-name>
            <surname>N. van Noord. “</surname>
          </string-name>
          <article-title>Blind Dates: Examining the Expression of Temporality in Historical Photographs”</article-title>
          .
          <source>IPnr:oceedings of the Computational Humanities Research</source>
          Conference. Ed. by
          <string-name>
            <given-names>A.</given-names>
            <surname>Sela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Jannidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Romanowska</surname>
          </string-name>
          . Vol.
          <volume>3558</volume>
          . CEUR Workshop Proceedings. Paris, France: CEUR-WS.org,
          <year>2023</year>
          , pp.
          <fpage>490</fpage>
          -
          <lpage>499</lpage>
          . urhlt:tps://ce ur-ws.
          <source>org/</source>
          Vol-
          <volume>3558</volume>
          /paper5790.pd.f
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          . “
          <article-title>Deep Residual Learning for Image Recognition”</article-title>
          .
          <source>In: 2016 IEEE Conference on Computer Vision</source>
          and Pattern Recognition, CVPR. Las Vegas,
          <string-name>
            <surname>NV</surname>
          </string-name>
          , USA: IEEE Computer Society,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . doi:
          <volume>10</volume>
          .1109/cvpr.
          <year>2016</year>
          .
          <volume>90</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F. N.</given-names>
            <surname>Iandola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. W.</given-names>
            <surname>Moskewicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Karayev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Keutzer. DenseNet: Implementing EfÏcient ConvNet Descriptor</surname>
          </string-name>
          <article-title>Pyramids</article-title>
          .
          <source>arXiv:1404</source>
          .
          <year>1869</year>
          [cs.CV].
          <year>2014</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.1404.
          <year>1869</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          .
          <article-title>Learning Multiple Layers of Features from Tiny Images</article-title>
          .
          <source>Technical Report</source>
          , University of Toronto.
          <year>2009</year>
          . urlh:ttps://www.cs.toronto.edu/~kriz/learning-features-
          <volume>2</volume>
          <fpage>009</fpage>
          -
          <lpage>TR</lpage>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Savarese</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. C. H.</given-names>
            <surname>Hoi</surname>
          </string-name>
          . “
          <article-title>BLIP-2: Bootstrapping Language-Image Pretraining with Frozen Image Encoders and Large Language Models”</article-title>
          .
          <source>Int:ernational Conference on Machine Learning</source>
          , ICML. Ed. by
          <string-name>
            <given-names>A.</given-names>
            <surname>Krause</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Brunskill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Engelhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sabato</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Scarlett</surname>
          </string-name>
          . Vol.
          <volume>202</volume>
          .
          <source>Proceedings of Machine Learning Research. Honolulu</source>
          , Hawaii, USA: Pmlr,
          <year>2023</year>
          , pp.
          <fpage>19730</fpage>
          -
          <lpage>19742</lpage>
          . url:https://proceedings.mlr.press/v202/li23 q.html.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Schütze</surname>
          </string-name>
          . Introduction to information retrieval. Cambridge, UK: Cambridge University Press,
          <year>2008</year>
          . do1i:
          <fpage>0</fpage>
          .1017/cbo9780511809071.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Männistö</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Iosifidis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Raitoharju</surname>
          </string-name>
          .
          <article-title>Automatic Image Content Extraction: Operationalizing Machine Learning in Humanistic Photographic Studies of Large Visual Archives</article-title>
          .
          <source>arXiv:2204.02149 [cs.CV]</source>
          .
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2204.02149.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , G. Krueger,
          <string-name>
            <surname>and I. Sutskever.</surname>
          </string-name>
          “
          <article-title>Learning Transferable Visual Models From Natural Language Supervision”</article-title>
          .
          <source>InP:roceedings of the 38th International Conference on Machine Learning</source>
          , ICML. Ed. by
          <string-name>
            <given-names>M.</given-names>
            <surname>Meila</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          . Vol.
          <volume>139</volume>
          .
          <source>Proceedings of Machine Learning Research. Virtual Event: Pmlr</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          . urhlt:tp://proceedi ngs.mlr.press/v139/radford21a.ht m.l
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>O.</given-names>
            <surname>Russakovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Krause</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satheesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Berg</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>Fei-Fei</surname>
          </string-name>
          .
          <article-title>“ImageNet Large Scale Visual Recognition Challenge”</article-title>
          .
          <source>In:Int. J. Comput. Vis. 115.3</source>
          (
          <issue>2015</issue>
          ), pp.
          <fpage>211</fpage>
          -
          <lpage>252</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11263- 015-0816-y.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sandler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhmoginov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          . “
          <article-title>MobileNetV2: Inverted Residuals and Linear Bottlenecks”</article-title>
          .
          <source>In2:018 IEEE Conference on Computer Vision</source>
          and Pattern Recognition, CVPR. Salt Lake City,
          <string-name>
            <surname>UT</surname>
          </string-name>
          , USA: Computer Vision Foundation / IEEE Computer Society,
          <year>2018</year>
          , pp.
          <fpage>4510</fpage>
          -
          <lpage>4520</lpage>
          . doi:
          <volume>10</volume>
          .1109/cvpr.
          <year>2018</year>
          .
          <volume>00474</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Smits</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Wevers</surname>
          </string-name>
          . “
          <article-title>A multimodal turn in Digital Humanities. Using contrastive machine learning models to explore, enrich, and analyze digital visual historical collections”</article-title>
          .
          <source>In: Digit. Scholarsh. Humanit. 38.3</source>
          (
          <issue>2023</issue>
          ), pp.
          <fpage>1267</fpage>
          -
          <lpage>1280</lpage>
          . doi:
          <volume>10</volume>
          .1093/llc/fqad008.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Springstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rahnama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stalter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kristen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Müller-Budack</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>R.</given-names>
            <surname>Ewerth</surname>
          </string-name>
          . “Visual Narratives:
          <article-title>Large-scale Hierarchical Classification of Art-historical Images”</article-title>
          .
          <source>In: IEEE/CVF Winter Conference on Applications of Computer Vision</source>
          , WACV. Waikoloa,
          <string-name>
            <surname>HI</surname>
          </string-name>
          , USA: Ieee,
          <year>2024</year>
          , pp.
          <fpage>7195</fpage>
          -
          <lpage>7205</lpage>
          . doi:
          <volume>10</volume>
          .1109/wacv57701.
          <year>2024</year>
          .
          <volume>00705</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tschirschwitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Klemstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schmidgen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Rodehorst</surname>
          </string-name>
          . “
          <article-title>Drawing the Line: A Dual Evaluation Approach for Shaping Ground Truth in Image Retrieval Using Rich Visual Embeddings of Historical Images”</article-title>
          .
          <source>InP:roceedings of the 7th International Workshop on Historical Document Imaging and Processing</source>
          ,
          <string-name>
            <surname>HIPICDAR</surname>
          </string-name>
          <year>2023</year>
          . San Jose, CA, USA: Acm,
          <year>2023</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>18</lpage>
          . doi:
          <volume>10</volume>
          .1145/3604951.3605524.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>T. de Vries</surname>
            , I. Misra,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            , and
            <given-names>L. van der Maaten.</given-names>
          </string-name>
          “
          <article-title>Does Object Recognition Work for Everyone?</article-title>
          ” In:IEEE Conference on
          <article-title>Computer Vision and Pattern Recognition Workshops</article-title>
          . Long Beach, CA, USA: Computer Vision Foundation / IEEE,
          <year>2019</year>
          , pp.
          <fpage>52</fpage>
          -
          <lpage>59</lpage>
          . urlh: ttp: //openaccess.thecvf.
          <source>com/content%5C%5FCVPRW%5C%5F2019/html/cv4gc/de%5C%5 FVries%5C%5FDoes%5C%5FObject%5C%5FRecognition%5C%5FWork%5C%5Ffor%5C%5 FEveryone%5C%5FCVPRW%5C%5F2019%5C%5Fpaper.htm.l</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and E.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <source>A Survey on Multimodal Large Language Models. arXiv:2306.13549 [cs.CV]</source>
          .
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2306.13549.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>