=Paper=
{{Paper
|id=Vol-3834/paper20
|storemode=property
|title=Viability of Zero-shot Classification and Search of Historical Photos
|pdfUrl=https://ceur-ws.org/Vol-3834/paper20.pdf
|volume=Vol-3834
|authors=Erika Maksimova,Mari-Anna Meimer,Mari Piirsalu,Priit Järv
|dblpUrl=https://dblp.org/rec/conf/chr/MaksimovaMPJ24
}}
==Viability of Zero-shot Classification and Search of Historical Photos==
Viability of Zero-shot Classification and Search of
Historical Photos
Erika Maksimova1 , Mari-Anna Meimer1 , Mari Piirsalu1 and Priit Järv1,∗
1
Institute of Software Science, Tallinn University of Technology, Estonia
Abstract
Multimodal neural networks are models that learn concepts in multiple modalities. The models can per-
form tasks like zero-shot classification: associating images with textual labels without specific training.
This promises both easier and more flexible use of digital photo archives, e.g. annotating and search-
ing. We investigate whether existing multimodal models can perform these tasks, when the data differs
from the typical computer vision training sets, on historical photos from a cultural context outside the
English speaking world.
Keywords
zero-shot learning, digital heritage, multimodal models
1. Introduction
Cultural heritage archives may contain millions of photos. For efÏcient searching, the images
need descriptions, like categories and captions. Traditionally, these are provided by human
annotators. Ajapaik1 is a crowd-sourced digital photo archive. It contains historical photos
mainly from and related to Estonia and neighboring countries. The users of the archive upload
and annotate the photos. Multiple collections from museums and the national archive have
also been added. The earliest photos are dated before 1875, but the majority are taken from
1918 until present day. At the time of writing, Ajapaik contains 1181273 photos.
Figure 1 shows a screen capture from the website with image categorizations. The scene
category can be either “exterior” or “interior”. The viewpoint elevation category is “ground”,
“raised” or “aerial”. In the beginning of 2024, the scene category was specified for 43% and
the viewpoint for 37% of the images. For the last three years, the number of images has been
growing faster than the number of annotated images, meaning that the crowd of volunteers
cannot keep up with the growth of the archive. The existing categories are somewhat limited
and arbitrary, but adding new categorizations would further increase the annotation workload
of the volunteers.
Training convolutional neural networks (CNN) to recognize the categories has been the con-
ventional approach to automated annotation of images. Such task specific models have asso-
CHR 2024: Computational Humanities Research Conference, December 4–6, 2024, Aarhus, Denmark
∗
Corresponding author.
£ erika.maksimova@gmail.com (E. Maksimova); marianna.meimer@taltech.ee (M. Meimer);
mari.piirsalu@taltech.ee (M. Piirsalu); priit.jarv1@taltech.ee (P. Järv)
ȉ 0000-0001-7725-543X (P. Järv)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
1
https://ajapaik.ee/
1242
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Figure 1: Ajapaik photo view with image categorizations. Photo CC BY 4.0, Johannes Pääsuke, ERM
Fk 214:136, Eesti Rahva Muuseum, https://opendata.muis.ee/object/605721
ciated costs – the human labor involved in preparation of the training data and the operation
and maintenance of the model.
Pre-trained multimodal neural networks come with the promise of removing these costs. As
an example, CLIP [9] is a relatively lightweight model that can encode both natural language
input and images into a shared multimodal vector representation. For example, the text “cow”
and a picture of a cow would have a very similar representation. This allows the model to
perform zero-shot classification: when presented with an example of an image and a label, the
model can immediately predict whether the label is associated with the image, without needing
to see any other examples. The implication is that CLIP and other similar multimodal models
can replace supervised computer vision models, like CNNs, without requiring large training
data sets.
There are limits to how well machine learning algorithms generalize to unseen data. Ulti-
mately, a model can only learn some representation of its training data, so the usefulness of the
model depends on how similar the distribution of data in the application is to the distribution
of the training data. For example, a study by de Vries et al. measured a 15-20% difference in the
classification accuracy of household items like soap between images from the United States and
from Somalia and Burkina Faso [15]. They associate this with representational bias, as the ma-
jority of images in many computer vision datasets originate from Western countries. Historical
appearance of locations, items, situations and common activities is similarly underrepresented.
Therefore, existing successful use cases of multimodal models on modern photographs, or their
evaluations on standard computer vision datasets like ImageNet [10] and CIFAR100 [5] are not
reliable indicators of their usefulness for historical photos.
In this paper, we investigate if using off-the-shelf multimodal models is a viable method for
classifying photos from Ajapaik. To understand what trade-offs or drawbacks this involves,
1243
we do a comparison with supervised computer vision models. The multimodal vector repre-
sentations from the CLIP model can also be adapted for searching images based on their visual
content. This would be a very useful functionality for image collections, so we include the
evaluation of multimodal search in our experiment.
2. Related Work
The main inspiration for our work comes from the paper by Smits and Wevers [12]. They
demonstrate the capabilities of the CLIP model [9] on collections of magic lantern slides and
children’s books illustrations, originating from the 19th century to approximately 1940. In the
task of classifying indoor and outdoor images, CLIP was slightly less accurate than a convo-
lutional neural network trained specifically for the task. Smits and Wevers identify several
forms of bias that cause the model to make mistakes, like mis-identifying modern concepts
in historic images, and applying sex-role stereotypes. Their finding is that the differences in
visual representation do not impact the performance negatively. As an example, the concept
of family is recognized from illustrations of both people and anthropomorphic animals.
The report on CLIP by Radford et al. [9] also evaluates the model’s zero-shot performance
against a fully supervised neural network model. CLIP with no task-specific training outper-
forms ResNet50 [3] that was separately trained for each specific task, in classifying several
frequently used computer vision datasets, like ImageNet and CIFAR100. Radford et al. caution
that their evaluation set may be co-aligned with the capabilities of the model, which means
that the high performance is not guaranteed to carry over to applications. For example, in
their paper CLIP underperforms in the specialized task of classifying satellite images. The au-
thors observe that the natural language interface may be unsuited to specify more complex
tasks.
We refer to the papers by Aske and Giardinetti [1] and Männistö et. al [8] for a wider
overview of machine learning for visual archives and omit the discussion of methods that
do not involve large multimodal models here. Few papers explicitly investigate multimodal
models for historical images classification and retrieval. Barancová et al. explore the dating
of historical photos, concluding that zero-shot classification is relatively inefÏcient [2]. They
achieved better results by training a classifier on top of the multimodal model’s image rep-
resentation. Tschirschwitz et al. propose evaluation datasets and a framework for historical
images classification and retrieval [14]. In their study, zero-shot CLIP achieves the highest per-
formance, however they also report that additional qualitative evaluation did not confirm their
quantitative results. Springstein et al. present methods of classification of art-historical im-
ages in a hierarchical schema of visual themes [13]. Their work does not include off-the-shelf
multimodal models in the zero-shot setting.
Our paper uses the CLIP, SigLIP [17] and BLIP-2 [6] models that combine vision and language
transformers and can encode images and text into a shared multimodal vector representation.
There are many more multimodal models, with over 60 models cited in a recent survey [16].
This number is constantly growing. Most of these models are designed for “downstream” tasks
like image and text generation, and do not necessarily provide documented interfaces for clas-
sification or for accessing shared multimodal representations.
1244
The contribution of our paper, as compared to evaluations in [9], is that we use a dataset
that is superficially similar (photos of people, everyday life, buildings) to mainstream computer
vision training datasets but different in two aspects. The pictures are from different historical
eras, and from outside the English speaking cultural sphere, which is significant because the
representations of concepts in models are learned through language modelling. Our contribu-
tion is complementary to [12] and [14] as it is a similar investigation on a different dataset.
We report results for both classification and search. We break down the search evaluation to
differentiate between very general concepts, and entities and objects that are distinctly local
to the cultural context of Ajapaik.
3. Methods
We present two experiments, covering two common use cases of a photo collection: classifica-
tion and search. The source code of the experiments is available at https://github.com/priitj/ch
r2024/. Because the copyright of most of the photos used in the experiments is held privately,
we cannot reproduce the photos in the paper and do not distribute the datasets.
Our classification experiment measures the capability of multimodal models to automatically
label photos in a collection with a fixed set of categories. In the search experiment, a query
text is given and a multimodal model is used to retrieve a set of matching images. We measure
how well the models rank the images by the relevance of their visual content to the query text.
We evaluate three multimodal models. CLIP was used in previous research on historical im-
ages [12, 2, 14, 13], and is generally widely adopted and cited, so we include it as a reference
multimodal model. SigLIP is very similar to CLIP, but uses a different training objective. In
the evaluation done by it’s authors, SigLIP outperforms different variations of CLIP in all clas-
sification tests [17]. Based on this, we included SigLIP in our study. For both of these models,
we use HuggingFace Transformers2 implementations.
BLIP-2 is another model with an architecture optimized for efÏcient training. It is reported
to outperform CLIP in text-to-image retrieval [6]. For this model we used implementations
from the Salesforce LAVIS3 language-vision package.
The nomenclature of these models used in machine learning literature also includes the de-
scription of their vision transformer component. A ViT-B vision transformer is the “base” size
with 12 transformer layers, while a ViT-L is the “large” transformer, with 24 layers and in-
creases in other settings as well. The patch size describes how the input image is partitioned
before feeding it to the transformer: a “patch 32” model uses 32x32 pixel rectangles. Therefore,
the input fed to the transformer of a 16x16 patch model is actually four times larger, making
it more computationally expensive. These variations have an impact on the performance of
the models, so we include three different configurations of both CLIP and SigLIP in our exper-
iments.
2
https://huggingface.co/
3
https://github.com/salesforce/LAVIS
1245
Table 1
The number of images in the classification set
Category Category Label Photos
Scene interior 3056
exterior 8614
Viewpoint elevation ground 10662
raised 3057
aerial 3053
3.1. Classification Experiment
In the classification experiment, we use a sample of 17042 photos from the Ajapaik collection.
The images are annotated with the scene category, the viewpoint elevation category, or both.
Table 1 gives the number of photos in each category. Additional photos were added to less
frequent categories like “raised”, such that each category has at least 3000 photos. This was
done to obtain better performance with the supervised baseline models and to ensure enough
test examples in those categories. For the remainder of the paper, we refer to this sample as
the classification set.
We test scene category and viewpoint elevation category classification separately. With the
supervised baselines, we use 5-fold cross-validation. In each round of cross-validation, the
images are split 75:5:20 between train, validation and test parts. The 5% validation part is only
used to automatically select the best model during training. All the measurements in reported
in the paper are done on the 20% test parts, which cover the entire set of images in a given
category over the 5 rounds of cross-validation.
With CLIP and SigLIP, classifications for all images in a given category are computed di-
rectly by the model from the input of an image and a set of prompts. We use the category
labels “interior”, “exterior”, “ground”, “raised” and “aerial” as initial prompts for their respec-
tive categories.
Both Radford et al. [9] and Smits and Wevers [12] observe that the selection of prompts to
represent the classes has a noticeable effect on classification performance. With the Ajapaik
classification set, this effect should also be expected. The single word “raised” does not describe
the class very precisely and only becomes meaningful if we know that the context is viewpoint
elevation. Accordingly, we expand our sets of prompts for classification with different natural
language phrases describing the categories (Tables 6 and 8).
To the best of our knowledge, BLIP-2 does not include a classification model. We implement
a simple nearest neighbors classifier on top of BLIP-2 vector representation. We compute image
vectors 𝐼𝑖 and text vectors 𝑇𝑗 with BLIP-2 for an image 𝑖 and a prompt 𝑗. The class of the image
is the one that maximises the similarity of vectors:
𝐼𝑖 ⋅ 𝑇 𝑗
𝑠𝑖𝑚(𝑖, 𝑗) = (1)
‖𝐼𝑖 ‖‖𝑇𝑗 ‖
For baselines, we use convolutional neural networks (CNNs) and transfer learning. All se-
lected models are pre-trained on ImageNet. We train a shallow classifier on top of the pre-
1246
trained CNNs. Scene category classifiers and viewpoint elevation classifiers are trained sepa-
rately.
For CNN architectures, we selected ResNet18 and ResNet50, as they were used as baselines
in papers [12] and [9], respectively. Additionally, we selected DenseNet121 [4] to represent
a deeper architecture, and MobileNetV2 [11] as a more modern, lightweight computer vision
model.
We measure the classification performance using per-class F1-score. For a given class, true
positives (TP) is the number of images that were correctly predicted by the model to be in
this class. False positives (FP) is the number of images that were incorrectly predicted to be in
the class. False negatives (FN) is the number of images that belong to the class, but the model
predicted a different class label. The F1-score penalizes both false positives and false negatives:
2TP
F1 = (2)
2TP + FP + FN
3.2. Search Experiment
For the search experiment, we downloaded the metadata of photos from the Ajapaik API.4 We
then randomly selected 11000 photos that had a textual description. Because of downloading
and image file format errors, our final sample has 10846 images. We will refer to this sample
as the search set.
We evaluate the search performance by letting the models rank the images by their relevance
to search terms. To ensure that at least one relevant photo exists for each search term, we
extracted the search terms from the textual descriptions of the images in the search set. We
translated descriptions of images to English using the deep-translator5 package. We then POS
tagged and lemmatized the words in the descriptions, and detected named entities using the
SpaCy6 library.
We select the search terms with the assumption that the users would mostly search for ob-
jects (such as “boat”), events (“exhibition”), activities (“riding”) or named entities, like a place
name or a person. These search terms would also be non-ambiguous enough so that we can
decide whether a retrieved photo is relevant to the search term. We select English language
nouns as examples of objects, events and phenomena. Verbs are examples of activities. In total,
we used 8 different categories, with 10 terms in each category. The selected search terms are
listed in Table 2.
Common and rare sets are included so that both easier and more difÏcult searches are rep-
resented. In common sets, we selected 10 terms that occurred most frequently in descriptions.
All common search terms had occurred with at least 20 photos, and sometimes with hundreds.
In rare sets, we selected 10 random terms from among those that had occurred once.
The random English words selection is used to diversify the search terms and to reduce any
unintentional bias from the frequency based selection of other terms. The same random words
are human translated to Estonian. This set is used to test whether the models give any useful
results with non-English input.
4
https://opendata.ajapaik.ee/
5
https://github.com/nidhaloff/deep-translator
6
https://spacy.io
1247
Table 2
Selected search terms
Common Rare
Objects, events, phenomena view, farm, building, group, an- venue, wetland, stomach, tri-
niversary, portrait, exhibition, color, rumba, detective, lore,
child, school, competition heath, population, score
Activities commissioning, performing, fertilizing, hoeing, windmilling,
leaving, speaking, taking, giving, watering, mourning, seamstress-
working, making, sitting, stand- ing, illustrating, schoolmaster-
ing ing, stitching, binding
Named entities Tallinn, the University of Tartu, Jaan Kadakas, Jüri Randla,
Tartu, Viljandi, Narva, Tartu Uni- U.K. Kekkonen, Ralf Allikvee,
versity, Harju, V. Kingissepa, A. Rosenberg, Setumaa, Karl
Rakvere, Moscow Parts, Sara Teitelbaum, Margit
Tooman, Valeri Kirss
English Estonian
Random words winter, interior, milk, village, talv, interjöör, piim, küla, poiss,
boy, driving, education, fence, sõitmine, haridus, aed, lektor,
lecturer, horse hobune
The terms selected by automatic criteria required some manual changes. For example, with
named entities, we decided not to include names of countries, because “Estonia” would match
with the majority of photos. An excluded search term was replaced with the next most frequent
term for the common categories, and a new randomly selected term for the other categories.
We implement search by computing text-to-image similarity. For an image 𝑖 and search term
𝑗, the similarity is computed using Equation 1 from the multimodal representation vectors 𝐼𝑖
and 𝑇𝑗 . The search results for a search term 𝑗 are 𝑘 most similar images, sorted by 𝑠𝑖𝑚(𝑖, 𝑗). Our
implementation uses the Voyager7 approximate nearest neighbors index to find and sort the
most similar images.
We evaluate the ability of models to rank relevant results above irrelevant ones. We selected
mean average precision (MAP) as the measurement of the quality of search results [7, p. 155-
161]. MAP is a robust measure that is not sensitive to the number of relevant documents in the
search set. If there are not enough relevant photos, MAP does not penalize filling the remainder
of the search results with irrelevant photos.
We report the measurements for the top-𝑘 results, considering that this is what the user will
see in practical application. For a query 𝑞 belonging to a set of queries 𝑄, let 𝑅𝑞 (𝑘) be the set
of relevant results among the first 𝑘 results. Let 𝑟𝑖 be a search result at position 𝑖. Average
precision for 𝑘 results, or AP@𝑘, is calculated
1 |𝑅𝑞 (𝑖)|
AP@𝑘(𝑞) = ∑ (3)
|𝑅𝑞 (𝑘)| 𝑟 ∈𝑅 (𝑘) 𝑖
𝑖 𝑞
7
https://github.com/spotify/voyager
1248
Table 3
Classification performance of supervised baselines
Model F1
Interior Exterior Ground Raised Aerial
ResNet18 0.87 0.96 0.92 0.63 0.90
ResNet50 0.89 0.96 0.94 0.70 0.92
DenseNet121 0.87 0.96 0.92 0.66 0.92
MobileNetV2 0.86 0.95 0.93 0.71 0.91
AP rewards ranking relevant results above irrelevant results. For example, if the first 4
among top 10 results are relevant, then AP@10 = 1.0. If the 4 relevant results are ranked
below 6 irrelevant results, AP@10 = 14 ( 17 + 82 + 93 + 10
4
) ≈ 0.28. If there are no relevant results,
AP = 0. Mean average precision (MAP) is calculated over a set of queries:
1
MAP@𝑘(𝑄) = ∑ 𝐴𝑃@𝑘(𝑞) (4)
|𝑄| 𝑞∈𝑄
While the search set contains some positive labels to evaluate relevance of images to queries,
there are no negative labels. For example, if a description of a photo includes the word “boy”
we could count it as relevant towards a query “boy”, however there is no information about
what the photo does not depict. Therefore, the relevance of each photo in search results to each
search term was evaluated by human judges.
4. Results
We begin with the performance of the supervised baselines to provide a frame of reference to
the results obtained with multimodal models. Table 3 lists the classification results with the
CNN models trained specifically for the scene category and viewpoint elevation classification
tasks. We observe that firstly, the “interior”/“exterior” classification is the easier one of the two
tasks shown. Secondly, the “raised” category is ambiguous as a textual description, but based
on Table 3 it is also ambiguous visually. The per-class F1-score for “raised” is 20-30 percentage
points lower than other classes, indicating that the supervised models struggle generalizing
this concept.
With multimodal models, we first report per-class classification performance when using
class labels “as is”. For example, when classifying the viewpoint elevation category, the image
is matched with the texts “ground”, “raised” and “aerial”.
Table 4 lists the per-class F1 scores for scene and viewpoint elevation categories. As with
the supervised baselines, the scene category classification works better. Also similarly to the
supervised baselines, the “raised” class is the most difÏcult. However, compared to Table 3, the
multimodal models do much worse, with F1-scores ranging from 0.02 to 0.24.
Several outcomes were unexpected. Contrary to the evaluations by the authors of the models
[9, 17], smaller versions of the models perform equal or better than bigger ones. Unlike in the
paper by Zhai, et al. [17], SigLIP does not clearly outperform CLIP in classification, in fact
1249
Table 4
Classification performance with class labels as prompts
Model Transf. Patch F1
size size Interior Exterior Ground Raised Aerial
CLIP ViT-B 32 0.82 0.94 0.82 0.06 0.62
ViT-B 16 0.81 0.93 0.79 0.02 0.64
ViT-L 14 0.78 0.90 0.79 0.11 0.73
SigLIP ViT-B 16 0.76 0.88 0.77 0.24 0.80
ViT-L 16 0.77 0.89 0.74 0.17 0.74
SO-400m 14 0.79 0.90 0.76 0.05 0.73
BLIP-2 ViT-L 14 0.65 0.78 0.80 0.20 0.69
CLIP ViT-B/32 SigLIP ViT-B/16 SigLIP SO-400m/14 BLIP-2
Ground 0.78 0.06 0.16 Ground 0.80 0.20 0.00 Ground 0.85 0.15 0.00 Ground 0.77 0.10 0.14
Raised 0.39 0.03 0.58 Raised 0.67 0.23 0.10 Raised 0.94 0.04 0.02 Raised 0.47 0.15 0.38
Aerial 0.02 0.00 0.98 Aerial 0.27 0.00 0.73 Aerial 0.41 0.00 0.59 Aerial 0.02 0.00 0.98
d
l
d
l
d
l
d
l
Ra d
Ra d
Ra d
Ra d
ria
ria
ria
ria
ise
ise
ise
ise
n
n
n
n
ou
ou
ou
ou
Ae
Ae
Ae
Ae
Gr
Gr
Gr
Gr
Figure 2: Confusion matrices for viewpoint elevation. Rows: true classes, columns: predicted classes.
the small CLIP ViT-B/32 model is the best in predicting three classes out of five. It is also
surprising that the performance in the aerial category is low, because we would expect the
aerial photographs to be visually distinct.
The reason behind the lower performance for the aerial category is revealed by the confusion
matrices in Figure 2. The rows are the true classes and the columns are the classes that the
model predicted. The CLIP ViT-B/32 and BLIP-2 models labeled 98% of aerial pictures correctly,
only 2% of them were labeled as ground. Their low performance is caused by the false positives,
as they heavily tend towards the aerial category and also label other photos as aerial.
In comparison, the SigLIP models tend heavily towards predicting the ground category. Due
to having fewer false positives for aerial, the per-class score is higher. The trade-off is that the
per-class score for ground is lowered. SigLIP ViT-B/16 has the most balanced performance in
the viewpoint elevation category, thanks to being able to detect raised elevation photos better
than the other models.
Using more descriptive prompts allowed multimodal models to reach higher performance,
but the overall impact of prompt engineering was mixed. We provide the full results in Ap-
pendix A. The Tables 6–7 give the prompts and mean F1-scores for the scene category classifica-
tion task. The Tables 8–9 are the prompts and results for the viewpoint elevation classification
task.
A brief summary: with the prompts “indoor scene” and “outdoor scene”, SigLIP ViT-B/16
1250
Scene Viewpoint elevation
1.0 0.88 0.920.97 0.89
0.96 1.0 0.94 0.92
0.76 0.77 0.8 0.8 0.81
0.8 0.8 0.7
F1-score
F1-score
0.6 0.6 0.48
0.4 0.4 Ground
0.24
0.2 Interior 0.2 Raised
Exterior Aerial
0.0 0.0
SigLIP ViT-B/16 SigLIP ViT-B/16 ResNet50 SigLIP ViT-B/16 SigLIP ViT-B/16 ResNet50
(class labels) (best prompt) (class labels) (best prompt)
Figure 3: Comparison of multimodal and supervised baseline classifiers.
model achieves F1 = 0.92 for the interior and F1 = 0.97 for the exterior scenes. This is the
only instance out of all combinations of models and prompts, where a multimodal model out-
performs any of the supervised baselines. However, out of 42 tests using the more descriptive
prompts in scene category classification, in 24 instances the performance dropped, compared
to using class labels directly. The more descriptive prompts had a positive impact for the CLIP
model in viewpoint elevation classification. In total, we did 84 tests with prompt engineering
and in exactly half (42) the performance improved or remained the same. In the remaining 42
test instances the performance dropped.
Figure 3 compares the best performing multimodal models and supervised baselines. We
selected SigLIP ViT-B/16 as the representative of multimodal models, as it achieved multiple
highest per-class and mean F1-scores, including the one result that outperformed the baselines.
We selected ResNet50 as the representative of CNNs. The best results of multimodal models
are competitive with supervised baselines in scene classification and below the baselines by a
large margin in viewpoint elevation classification.
The search experiments are summarized in Table 5. Two weak categories are clearly visible
– the rare named entities and the Estonian language search terms. The models perform the best
when searching for objects, then activities and are overall weakest when searching for named
entities.
Like in the classification experiment, model performance relative to each other differs from
evaluations in previous literature. Surprisingly, the best performing model is clearly SigLIP
SO-400m/14, which is optimized for classification [17]. BLIP-2, which we included due to prior
strong results in text-to-image retrieval, did not perform as well.
The MAP scores in Table 5 do not tell us directly whether there were many correct results and
the ranking within top-10 did not matter, or if the models were able to precisely rank relevant
photos above irrelevant ones. We analyze the ranking ability in the left graph of Figure 4. The
AP@10 results are plotted against the number of relevant photos in top-10. With SigLIP SO-
400m/14, the best performing model, we still see that the best AP@10 score drops below 1.0
when there were fewer than 4 relevant results. In other words, in searches with 1-3 matches,
there were always irrelevant photos ranked above relevant ones.
1251
Table 5
Search results by search term categories, mean average precision (MAP@10)
Model Transf. Patch Objects Activities Named Ent. Random
size size Com. Rare Com. Rare Com. Rare Eng. Est.
CLIP ViT-B 32 0.83 0.45 0.79 0.47 0.39 0.05 0.85 0.19
ViT-B 16 0.85 0.40 0.68 0.60 0.47 0.01 0.89 0.04
ViT-L 14 0.83 0.43 0.69 0.47 0.54 0.06 0.90 0.23
SigLIP ViT-B 16 0.92 0.53 0.79 0.63 0.39 0.01 0.90 0.40
ViT-L 16 0.94 0.55 0.79 0.75 0.50 0.00 0.92 0.39
SO-400m 14 0.95 0.59 0.84 0.73 0.74 0.02 0.96 0.57
BLIP-2 ViT-L 14 0.84 0.47 0.71 0.47 0.19 0.05 0.85 0.12
SigLIP SO-400m/14 All models
1.00 150
Number of searches
0.75
100
AP@10
0.50
0.25 50
0.00 0
0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0
Relevant photos in top-10 AP@10
Figure 4: Left: ranking performance, AP@10 by the number of relevant results in top-10. Right: overall
distribution of AP@10.
The second question is, whether the models can find all search terms, or are there gaps that
the aggregate MAP scores do not show. We present the distribution of all AP@10 scores in
Figure 4, right graph. Over all models, the most likely result of a search is either a complete
success or a complete failure, with intermediate results much less likely.
Clearly, many of the low AP@10 results come from the named entity or Estonian language
searches. When we look at the other five categories of regular English language words, a more
interesting result emerges. For each model, there are 3-10 terms where AP@10 was under
0.3. At the same time, for each term at least one model achieved AP@10 > 0.5, except the
word “score” where the best AP@10 = 0.38. Therefore, each model has gaps in the knowledge,
but these gaps lie in different places for different models, including different sizes of CLIP and
SigLIP.
5. Discussion
Prior evaluations on mainstream computer vision datasets do not generalize well to the case of
Ajapaik. Previously established rankings of which models perform better in classification and
search ([17] and [6], respectively) were not reproduced in our experiments. This implies that
if the users want to ensure good performance, they need to test their own particular use case,
1252
which would involve human evaluation or annotation of data.
The multimodal models did well in the scene category classification task and less so in view-
point elevation classification. Prompt engineering closed the gap to supervised baselines, but
the multimodal models responded to prompts unpredictably. Using more descriptive prompts,
like “elevated view” instead of “raised”, was equally likely to increase or to decrease the perfor-
mance. This is important, because there was no prior indication of what model and prompt set
would be best. We only know that SigLIP ViT-B/16 performed well in scene category thanks
to having the annotated classification set.
The evidence from prompt engineering tests shows that the difÏculties encountered in the
classification have more to do with having to specify the classification task precisely through
the natural language interface, as was in fact anticipated by Radford et al. when they discuss
applications of CLIP [9]. The only practical way of mitigating this is to move away from the
zero-shot setting. Training a classifier on top of multimodal representations, like done in [2],
may require similar amounts of annotated data like with CNNs. However, Radford et al. show
improved performance with multi-shot learning where much fewer examples are needed (up
to 16 in the paper) [9].
The impact of the cultural sphere on the search results was clear. The Estonian language,
localities and persons are not well represented in the models. When we remove this require-
ment of out of domain knowledge and look at 50 search terms of common English words, the
models still have gaps, with 3-10 searches per model failing. Importantly, which terms fail
differs by model. On the positive side, this unpredictability is not as pronounced as with the
classification. The top search results are clearly populated with relevant photos for common
objects, activities and random English worlds, independent of the model.
Our paper has multiple limitations. The relevance of a photo to a given search term in search
results was validated by one person each. Methodologically it would be preferable if several
persons validated one result. However, the photo–term pairs were distributed between the
judges in a way that was, for practical purposes, random. Each set of photos returned by a
model was therefore validated by multiple judges, which should dilute the effect of possible
bias.
In the search experiment, we used the Voyager approximate nearest neighbors index. It
is possible that Voyager had some impact on the search results, if it did not return the exact
𝑘 nearest neighbors set each time. To validate this, the search experiment has to be repeated
without an index, and we omitted this because of additional labor needed to validate the results.
There are many other multimodal models that we did not include in our experiment. Ad-
ditional investigation is needed to determine, which of those, if any, provide an interface to
shared multimodal vector representation of text and images. Finally, we used only one dataset,
so our result may not generalize to other similar datasets.
6. Conclusions
We investigated the viability of zero-shot classification and search of historical photos in
Ajapaik. We found that this application domain is different enough from the models’ train-
ing data that expectations based on previous evaluations on model performance do not hold.
1253
Multimodal models can successfully search for common everyday concepts from the photos.
However, the zero-shot usage on this archive is problematic. Firstly, models have unpredictable
gaps in knowledge with common English words that appear depending on model size. Sec-
ondly, knowledge of Estonian language words and names is mostly missing. Thirdly, classifi-
cation performance is below supervised baselines and cannot be easily improved with prompt
engineering. Therefore, in the context of historical visual archives, the multimodal models do
not deliver on the promise of removing dataset annotation related costs. Thorough evaluation
is recommended to ensure viability in each use case.
Acknowledgments
We thank the anonymous reviewers for their comments that helped to improve the paper. We
are grateful to Anna Grund for providing the annotated classification set and to the contributors
of Ajapaik, who made writing this paper possible.
References
[1] K. Aske and M. Giardinetti. “(Mis)Matching Metadata: Improving Accessibility in Digital
Visual Archives through the EyCon Project”. In: ACM Journal on Computing and Cultural
Heritage 16.4 (2023), 76:1–76:20. doi: 10.1145/3594726.
[2] A. Barancová, M. Wevers, and N. van Noord. “Blind Dates: Examining the Expression of
Temporality in Historical Photographs”. In: Proceedings of the Computational Humanities
Research Conference. Ed. by A. Sela, F. Jannidis, and I. Romanowska. Vol. 3558. CEUR
Workshop Proceedings. Paris, France: CEUR-WS.org, 2023, pp. 490–499. url: https://ce
ur-ws.org/Vol-3558/paper5790.pdf.
[3] K. He, X. Zhang, S. Ren, and J. Sun. “Deep Residual Learning for Image Recognition”. In:
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. Las Vegas, NV,
USA: IEEE Computer Society, 2016, pp. 770–778. doi: 10.1109/cvpr.2016.90.
[4] F. N. Iandola, M. W. Moskewicz, S. Karayev, R. B. Girshick, T. Darrell, and K. Keutzer.
DenseNet: Implementing EfÏcient ConvNet Descriptor Pyramids. arXiv:1404.1869 [cs.CV].
2014. doi: 10.48550/arXiv.1404.1869.
[5] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical Report,
University of Toronto. 2009. url: https://www.cs.toronto.edu/~kriz/learning-features-2
009-TR.pdf.
[6] J. Li, D. Li, S. Savarese, and S. C. H. Hoi. “BLIP-2: Bootstrapping Language-Image Pre-
training with Frozen Image Encoders and Large Language Models”. In: International Con-
ference on Machine Learning, ICML. Ed. by A. Krause, E. Brunskill, K. Cho, B. Engelhardt,
S. Sabato, and J. Scarlett. Vol. 202. Proceedings of Machine Learning Research. Honolulu,
Hawaii, USA: Pmlr, 2023, pp. 19730–19742. url: https://proceedings.mlr.press/v202/li23
q.html.
1254
[7] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cam-
bridge, UK: Cambridge University Press, 2008. doi: 10.1017/cbo9780511809071.
[8] A. Männistö, M. Seker, A. Iosifidis, and J. Raitoharju. Automatic Image Content Extraction:
Operationalizing Machine Learning in Humanistic Photographic Studies of Large Visual
Archives. arXiv:2204.02149 [cs.CV]. 2022. doi: 10.48550/arXiv.2204.02149.
[9] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell,
P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. “Learning Transferable Visual Models
From Natural Language Supervision”. In: Proceedings of the 38th International Conference
on Machine Learning, ICML. Ed. by M. Meila and T. Zhang. Vol. 139. Proceedings of Ma-
chine Learning Research. Virtual Event: Pmlr, 2021, pp. 8748–8763. url: http://proceedi
ngs.mlr.press/v139/radford21a.html.
[10] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A.
Khosla, M. S. Bernstein, A. C. Berg, and L. Fei-Fei. “ImageNet Large Scale Visual Recog-
nition Challenge”. In: Int. J. Comput. Vis. 115.3 (2015), pp. 211–252. doi: 10.1007/s11263-
015-0816-y.
[11] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen. “MobileNetV2: Inverted
Residuals and Linear Bottlenecks”. In: 2018 IEEE Conference on Computer Vision and Pat-
tern Recognition, CVPR. Salt Lake City, UT, USA: Computer Vision Foundation / IEEE
Computer Society, 2018, pp. 4510–4520. doi: 10.1109/cvpr.2018.00474.
[12] T. Smits and M. Wevers. “A multimodal turn in Digital Humanities. Using contrastive
machine learning models to explore, enrich, and analyze digital visual historical collec-
tions”. In: Digit. Scholarsh. Humanit. 38.3 (2023), pp. 1267–1280. doi: 10.1093/llc/fqad008.
[13] M. Springstein, S. Schneider, J. Rahnama, J. Stalter, M. Kristen, E. Müller-Budack, and
R. Ewerth. “Visual Narratives: Large-scale Hierarchical Classification of Art-historical
Images”. In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV.
Waikoloa, HI, USA: Ieee, 2024, pp. 7195–7205. doi: 10.1109/wacv57701.2024.00705.
[14] D. Tschirschwitz, F. Klemstein, H. Schmidgen, and V. Rodehorst. “Drawing the Line: A
Dual Evaluation Approach for Shaping Ground Truth in Image Retrieval Using Rich Vi-
sual Embeddings of Historical Images”. In: Proceedings of the 7th International Workshop
on Historical Document Imaging and Processing, HIPICDAR 2023. San Jose, CA, USA: Acm,
2023, pp. 13–18. doi: 10.1145/3604951.3605524.
[15] T. de Vries, I. Misra, C. Wang, and L. van der Maaten. “Does Object Recognition Work for
Everyone?” In: IEEE Conference on Computer Vision and Pattern Recognition Workshops.
Long Beach, CA, USA: Computer Vision Foundation / IEEE, 2019, pp. 52–59. url: http:
//openaccess.thecvf.com/content%5C%5FCVPRW%5C%5F2019/html/cv4gc/de%5C%5
FVries%5C%5FDoes%5C%5FObject%5C%5FRecognition%5C%5FWork%5C%5Ffor%5C%5
FEveryone%5C%5FCVPRW%5C%5F2019%5C%5Fpaper.html.
[16] S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen. A Survey on Multimodal Large
Language Models. arXiv:2306.13549 [cs.CV]. 2023. doi: 10.48550/arXiv.2306.13549.
1255
[17] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. “Sigmoid Loss for Language Image Pre-
Training”. In: IEEE/CVF International Conference on Computer Vision, ICCV. Paris, France:
Ieee, 2023, pp. 11941–11952. doi: 10.1109/iccv51070.2023.01100.
1256
Table 6
Prompts for scene classification
Prompt set Interior Exterior
𝑃11 interior exterior
𝑃12 interior scene exterior scene
𝑃13 interior view outdoors
𝑃14 indoor scene outdoor scene
𝑃15 indoors outdoors
𝑃16 inside a building outside
𝑃17 inside a room outside
Table 7
Scene category classification with different prompt sets
Model Transf. Patch Mean F1-score
size size 𝑃11 𝑃12 𝑃13 𝑃14 𝑃15 𝑃16 𝑃17
CLIP ViT-B 32 0.88 0.87 0.58 0.92 0.79 0.79 0.88
ViT-B 16 0.87 0.88 0.37 0.76 0.71 0.73 0.86
ViT-L 14 0.84 0.79 0.47 0.86 0.69 0.76 0.92
SigLIP ViT-B 16 0.82 0.92 0.76 0.95 0.76 0.81 0.76
ViT-L 16 0.83 0.65 0.69 0.94 0.69 0.80 0.74
SO-400m 14 0.85 0.91 0.71 0.94 0.78 0.69 0.65
BLIP-2 ViT-L 14 0.71 0.80 0.81 0.88 0.87 0.80 0.83
Table 8
Prompts for viewpoint elevation classification
Prompt set Ground Raised Aerial
𝑃21 ground raised aerial
𝑃22 ground view raised view aerial view
𝑃23 street level view from building view from airplane
𝑃24 ground view elevated view bird’s eye view
𝑃25 ground elevated view aerial
𝑃26 ground level elevated view aerial
𝑃27 ground level elevated view aerial view
A. Prompt Engineering
The appendix contains the prompt sets and the corresponding results for scene classification
(Tables 6–7) and viewpoint elevation classification (Tables 8–9).
1257
Table 9
Viewpoint elevation classification with different prompt sets
Model Transf. Patch Mean F1-score
size size 𝑃21 𝑃22 𝑃23 𝑃24 𝑃25 𝑃26 𝑃27
CLIP ViT-B 32 0.50 0.40 0.51 0.50 0.68 0.70 0.73
ViT-B 16 0.49 0.58 0.57 0.60 0.65 0.66 0.68
ViT-L 14 0.54 0.61 0.67 0.55 0.67 0.70 0.70
SigLIP ViT-B 16 0.60 0.70 0.54 0.49 0.30 0.29 0.34
ViT-L 16 0.55 0.52 0.52 0.69 0.49 0.52 0.66
SO-400m 14 0.51 0.49 0.59 0.61 0.39 0.39 0.55
BLIP-2 ViT-L 14 0.56 0.50 0.49 0.54 0.61 0.47 0.51
1258