Towards Multimodal Computational Humanities. Using
CLIP to Analyze Late-Nineteenth Century Magic
Lantern Slides
Thomas Smits, Mike Kestemont
University of Antwerp, Prinsstraat 13, 2000, Antwerpen, Belgium


                             Abstract
                             The introduction of the CLIP model signaled a breakthrough in multimodal deep learning. This paper
                             examines whether CLIP can be fruitfully applied to a (binary) classification task in the Humanities.
                             We focus on a historical collection of late-nineteenth century magic lantern slides from the Lucerna
                             database. Based on the available metadata, we evaluate CLIP’s performance on classifying slide
                             images into ‘exterior’ and ‘interior’ categories. We compare the performance of several textual
                             prompts for CLIP to two conventional mono-modal models (textual and visual) which we train and
                             evaluate on the same stratified set of 5,244 magic lantern slides and their captions. We find that
                             the textual and multimodal models achieve a respectable performance (∼0.80 accuracy) but are still
                             outperformed by a vision model that was fine-tuned to the task (∼0.89). We flag three methodological
                             issues that might arise from the application of CLIP in the (computational) humanities. First, the
                             lack of (need for) labelled data makes it hard to inspect and/or interpret the performance of the
                             model. Second, CLIP’s zero-shot capability only allows for classification tasks to be simulated, which
                             makes it doubtful if standard metrics can be used to compare its performance to text and/or image
                             models. Third, the lack of effective prompt engineering techniques makes the performance of CLIP
                             (highly) unstable.

                             Keywords
                             CLIP, classification, prompt engineering, multimodality, visual culture, magic latern slides,


1. Introduction
Following the development of deep learning models that are trained on expressions of a single
sensory modality, mostly hearing (text) and seeing (images), researchers have recently focused
on multimodal applications: models that process and relate information from multiple modal-
ities [10]. While there are many different multimodal configurations, Baltrušaitis et al. (2019)
note that text to image description (and, conversely, image to text), where the model is trained
on image and text combinations, has emerged as the primary task of the subfield [2].
   In January 2021, the introduction of the CLIP (Contrastive Language-Image Pre-training)
signaled a breakthrough in the field of multimodal machine learning [12]. Trained on dataset
of 400M image/text pairs collected from the internet, CLIP, given an image, must predict
which out of a set of 32,768 randomly sampled text snippets it was paired with in the dataset.
Radford et al. (2021) suggest that CLIP approaches this task by identifying visual concepts


CHR 2021: Computational Humanities Research Conference, November 17–19, 2021, Amsterdam, The
Netherlands
£ thomas.smits@uantwerpen (T. Smits); mike.kestemont@uantwerpen.be (M. Kestemont)
Ǳ 0000-0001-8579-824X (T. Smits); 0000-0003-3590-693X (M. Kestemont)
                           © 2021 Copyright for this paper by its authors.
                           Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Wor
Pr
   ks
    hop
 oceedi
      ngs
            ht
            I
             tp:
               //
                ceur
                   -
            SSN1613-
                    ws
                     .or
                   0073
                       g

                           CEUR Workshop Proceedings (CEUR-WS.org)


                                                                               149
in the images and associating them with textual descriptions [12]. As a result, the model can
be applied to a wide variety of broad zero-shot ‘text to image’ and ‘image to text’ tasks.
   While computer vision models have frequently been reported to outperform humans, they
are optimized for performance on the specific task and data of the benchmark. As a result,
their performance cannot be compared to the highly-contextual vision of humans [15]. Radford
et al. (2021) report that CLIP matches the performance of computer vision models on thirty
existing computer vision benchmarks, such as ImageNet, without being trained on the data of
these benchmarks. CLIP thus shows a high performance ‘in the wild’ on tasks and datasets
for which it was not optimized via training [12].
   Building on recent discussions about the ’visual digital turn’ [17], audio-visual Digital Hu-
manities [1] and the connection between multimodality theory and digital humanities research
[5, 6, 14, 18], this paper examines the application of a multimodal model to a (binary) clas-
sification task in the humanities. We focus on a historical collection of 40K magic lantern
slides from the late-nineteenth century. The set includes digital reproductions of the slides
(as a flat image), the title/captions (text), as well as meta-data (year of publication, mode
of production). Recently recognized as being a highly multimodal medial form [8, 16, 19, 7],
this collection of lantern slides provides an opportunity to evaluate the possible benefits of
multimodal models for the (computational) humanities.
   Based on the available metadata for the slides, we evaluate CLIP’s performance on recogniz-
ing images of exterior/interior scenes. Seemingly purely visual in nature, multimodality theory
would argue that text, such as captions, play a crucial role in producing these categories [3].
We compare the performance of CLIP to mono-modal text and image models, which we train
and evaluate on a stratified set of 5,244 labelled magic lantern slides (and their captions) of
exterior and interior locations. While the image model achieves the highest accuracy (∼0.898),
we find that the best performing textual prompt for CLIP (interior/exterior) is competitive
with the textual models (∼0.807 CLIP/∼0.806 BERT).
   We flag three methodological issues that might arise from a possible widespread application
of CLIP in the (computational) humanities. First, the lack of (need for) labelled data makes it
hard to inspect and/or interpret the performance of the model. Second, even if labelled data is
available, CLIP’s zero-shot capability only allows for classification tasks to be simulated. As a
result, it is doubtful whether accuracy and other standard metrics can be used to meaningfully
compare CLIP to text and/or image models. Finally, the lack of methods to find the right, let
alone the optimal, textual prompt(s) makes the performance of CLIP (highly) unstable. As
a result, ‘prompt engineering’ [12] should be a major concern for future research that applies
CLIP in the (computational) humanities.
   This paper is part of the larger History of Implicit Bias project at the University of Antwerp,
which applies machine learning to identify patterns of (implicit) bias in several nineteenth
century digital collections. Multimodal machine learning could provide a breakthrough for
this kind of research, which seeks to analyze large-scale and complex patterns of meaning in
(historical) data. Models like CLIP could not only offer researchers the opportunity to study
categories, such as ’the family,’ that are highly multimodal in nature, but also, in conjunction
with mono-modal techniques, fleece out the distribution of different modalities in meaning-
making. This exploratory paper tests the robustness of CLIP to provide a sound basis for such
research in the future.


                                               150
Figure 1: Example of exterior category                     Figure 2: Example of interior category
‘Sphinxen-Allee, Karnack’                                  ‘Christie tells Treffy only a month’
Slide 7 of Bilder aus Ägypten (year unknown).              Slide 11 of Christie’s old organ (1875).


Figure 3: Example of a fictional exterior location         Figure 4: Example of a fictional exterior location
‘Poor Robin cannot fly’                                    ‘The girl’s footprints...’
Slide 5 of Down by the ferry (1903).                       Slide 24 of The two golden lilies (1893)


2. Material and methods
The study of the magic lantern has been stimulated by the increasing digital accessibility of
lantern slides. The Lucerna Magic Lantern Web Resource was the first digital repository of
digitized lantern slides. At the time of writing, it contained 42,019 digital slides, up from 38,000
in 2019, most of them uploaded and annotated by Lucerna’s founder Richard Crangle [7]. We
collected the digitized slides, their captions and several other metadata fields. The resulting
dataset contains the URL, filename, title, year of publication, format, people connected to the
slide, type of image, dimensions, materials, production process, person shown, image content
tags, image location and collection for 42,019 slides (Dataset to be released with camera-ready
paper).
   To compare the performance of CLIP to mono-modal models on the exterior/interior classifi-


                                                     151
cation task we used the ‘type of image’ field to produce a stratified .60/.20/.20 train, validation
and test set of exterior and interior images with captions. As Table 1 shows, Lucerna’s slides
were manually labelled for several types describing the physical setting captured on the slide.
We combined the types ‘photograph of exterior location’ and ‘photograph of life models in
exterior location’ to collect slides showing exterior locations (Fig. 1) and the ‘photograph of
interior location’ and ‘photograph of life models in interior location’ types to collect slides of
interior locations (Fig. 2). Initially, we also included the ‘photograph of life models in studio
set’ in the collection of interior slides. However, as Fig. 3 and Fig. 4 show, this category often
contains fictional ‘outdoor’ scenes 1 This demonstrates that seemingly binary categories, such
as outdoor/indoor, often prove to be far-less rigid in actual practice. To enable comparison
to a purely textual model, we only included slides with captions, discarding those without
captions or with frequently recurring or generic ones, such as ‘Intro(duction)’ or ‘Title’. To
create a balanced set, we included all the remaining slides of the interior category (2,622) and
an equally-sized random sample of slides from the exterior category (5,244 total).
   We compared the zero-shot performance of CLIP for several (apparently) binary prompts
(Table 2) to a visual and a textual model (Table 3). The main advance of CLIP is that it
does not need labeled training data to achieve competitive performance on a wide variety of
classification tasks. However, this zero-shot capability results in the fact that we can only
simulate a classification task. First, textual prompts have to be picked that are (apparent)
mutually exclusive terms, phrases, or sentences. However, this does not exclude the possibility
that both prompts are (un)likely textual descriptions of the same image. In contrast to models
that are trained for a binary classification task, we do not ask CLIP a single question (Is this
A or B?) but rather normalize the answers to two questions (Is this A?/Is this B?). Following
earlier work, to calculate the accuracy of CLIP on a classification task, we use the softmax
function to normalize the output of the model for the two prompts into a single probability
distribution. While most deep learning models use softmax to normalize the output into a
probability score, we ague that its application is conceptually different in the case of CLIP.
   To compare CLIP’s zero-shot capabilities to mono-modal models we used relatively sim-
ple transfer learning methods. For the vision model, we applied the fast.ai framework to
train a ResNet 18, a relatively simple convolutional neural network, pretrained on the Ima-
geNet dataset. Instead of manually selecting hyperparameters, for example by determining
the learning rate, we resorted to fast.ai’s default finetune method and its default parameters
(for four epochs). For the text-only model, we first used a run-of-the-mill text classification
approach [13], implemented in the established scikit-learn framework [11]. We represented the
documents in train and test under a bag-of-words model. All features were normalized via the
standard TF-IDF procedure (fitted on the training data only) to boost the weight of document-
specific features. We report results for a word unigram model and a character trigram model.
We applied a single-layer, linear classifier that is optimized via gradient descent to minimize
a log loss objective. We have not optimized the hyperparameter settings and resort to default
settings with an unpruned vocabulary (4,290 word unigrams; 6,358 character trigrams). The
captions are primarily in English, but there some rare instances of other Western European
languages (Dutch or German) which were not explicitly removed to increase the realism of the
task.
   1
     Copyright of Figures 1-4. Reproduced by permission via Lucerna Magic Lantern Web Resource. Figure 1:
Private collection. Digital image © 2016 Anke Napp. Figure 2: The Hive. Digital image © 2018 Worcestershire
County Council. Figure 3: Philip and Rosemary Banham Collection. Digital image © 2016 Philip and Rosemary
Banham. Figure 4: Private collection. Digital image © 2006 Ludwig Vogl-Bienek.


                                                   152
Table 1
Absolute frequency distribution for the ‘Type of image’ field in Lucerna’s full, original metadata. Only
categories marked by an asterisk (*) were included.
                   Type of image                                     Number of slides
                   *photograph of exterior location                            17,064
                   drawing / painting / print                                  11,789
                   photograph of life models in studio set                      4,705
                   photograph                                                   2,367
                   *photograph of interior location                             2,075
                   *photograph of life models in exterior location              1,473
                   unknown                                                      1,318
                   text                                                           523
                   *photograph of life models in interior location                194
                   photograph of life models                                      174
                   NA                                                             133
                   photograph of studio set                                        78
                   drawing / painting / print of exterior location                 59
                   other                                                           40
                   unknown of exterior location                                    20
                   physical object                                                  6
                   text of exterior location                                        1
                   Total                                                       42,019


   We supplemented this, potentially naive, classifier with a generic, pretrained BERT for
sequence classification. We started from the uncased, multilingual model from the Transformers
library, which we finetuned as a binary exterior/interior classifier on the training set (we
monitored on the development set via early stopping) and evaluated on the test set. The
motivation for this was twofold. First, because we could bootstrap from a pretrained model, we
expected the model to be able to model more subtle semantic aspects of the textual descriptions
that aren’t obvious from the lexical surface level (e.g. synonyms). Second, we started from the
multilingual model that is available for this architecture: because our data is not exclusively
monolingual, which could have given the BERT classifier a modest edge. A drawback of this
neural approach is that model criticism through feature inspection is less straightforward.


3. Results
Table 3 shows that all models achieve a respectable accuracy, but that the vision-only model
outperforms both CLIP and the textual models (almost a ∼50% error reduction). While it is a
major advantage that CLIP does not require the labor-intensive and time-consuming process
of producing labelled data and the training and fitting of models, the model is not competitive
for this specific classification task. It depends on the questions of the humanities researcher
whether a (possible) loss in accuracy is problematic. Researchers will have to make a decision
whether possible improvements in accuracy warrant the investment needed to produce labelled
data. Next to this pragmatic consideration, we argue that multimodal models also come with
a new set of pitfalls. By comparing the performance of the text, vision and multimodal model,
we flag three issues.


                                                   153
                          Table 2
                          Accuracy of prompts on exterior/interior categories
  Prompts                                 accuracy on exterior   accuracy on interior   accuracy on all
  exterior/interior                              0.902                  0.711               0.807
  a photograph of an exterior location/
  a photograph of an interior location           0.717                  0.877               0.797
  outside/inside                                 0.609                  0.931               0.769
  outdoor/indoor                                 0.498                  0.964               0.730
  outdoors/indoors                               0.668                  0.944               0.806
  exterior/indoor                                0.768                  0.577               0.673
  street/interior                                0.501                  0.898               0.699


   Starting with the performance of CLIP,
Table 2 shows that different prompts lead to
different accuracy scores. From ∼0.96 for ‘in-
doors’ in the ‘outdoors/indoors’ prompt, to
worse then guessing: ∼0.49 for ‘outdoor’ in
the ‘outdoor/indoor’ prompt. Similar to the
‘prompt engineering’ discussions surrounding
GPT-3 [4], Radford et al. (2021) note that
determining the right prompt(s) can signifi-
cantly improve the performance of CLIP [12].
The difference in accuracy between ‘outdoor’
and ‘outdoors’ (Table 2) is a good example
of this.
   In relation to prompt engineering, Radford
et al. (2021) note that images are rarely
paired with a single word [12]. As a result,
they suggest that prompts that include con-
textual information achieve higher accuracy
on several benchmarks. For example, ‘a pho-
tograph of a German Sheppard, a type of dog’
performs better then ‘German Sheppard’. For
our classification task, which seeks to dis-
tinguish between two high-level visual con-
cepts, which are themselves already contex-
tual, it is unclear what kind of information
could improve the prompts. For example,
the difference in performance between ‘exte- Figure 5: Top-scoring 15 weights for either class
rior/interior’ and ‘A photograph of an exte-             (ex/in) from the linear model for the
rior/interior location’ is limited (Table 2).            token unigrams.
   The limited increase in accuracy of adding
‘a photograph of’ to the prompts might be partly a result of the ‘temporal bias’ [15] of CLIP.
The model was trained on 400M combinations of high-definition photographs and texts ex-
tracted from the internet. Although all the slides in our set are photographs, they look very
different then the present-day images made by high-definition camera’s. The fact that a large


                                                  154
                          Table 3
                          Accuracy of the textual, visual and multimodal mod-
                          els on the test set
                       Model                 Description             Accuracy
                       Textual             Word unigrams               0.798
                       Textual           Character trigrams            0.777
                       Textual                  BERT                   0.806
                       Visual        ResNet 18, ImageNet weights       0.898
                       Multimodal      CLIP (’exterior/interior’)      0.807


number of them are colored in (Fig. 1) might be the most striking visual difference. CLIP
might not recognize (all) of our images as photographs, making it less beneficial to add this
information to the prompts.
   Looking at Table 2 we hypothesized that combining high performing words or snippets from
different prompts might lead to better results. However, this is not the the case. While
‘exterior’ achieves high accuracy in the ‘exterior/interior’ prompt, its performance drops when
combined with ‘indoors,’ which achieved high accuracy in the ‘outdoors/indoors’ combination
and experiences an even more dramatic drop in accuracy when combined with ‘exterior’ (Table
2). This process can be explained by the fact that we normalize the output of the model for
two prompts into a single probability distribution.
   Regarding the textual models, a number of observations can be made. First, they score on
par with the multimodal model, which is striking because the latter was trained nor finetuned
on this specific dataset and task. Second, the visual model outperforms the textual models,
suggesting that the textual modality is less relevant for this classification task. Interestingly,
the word unigram model outperforms that based on character trigrams: this is an atypical
result for a common text classification task and suggest that most of the useful cues in the title
data is actually realised at the simple lexical level of atomic tokens. The visualization (Fig. 5)
of the word unigram model’s highest weights for either class supports this hypothesis. Apart
from the telltale feature ‘interior’, the indoor vocabulary is dominated by lexis related to the
interior of church buildings (‘misericord’, ‘nave’, ‘choir’, etc.) – Exeter cathedral, in particular,
might be over-represented in the data. The outdoor vocabulary, on the other hand, clearly
points to more panoramic, landscape-related or aquatic (e.g. ‘bridge’, ‘lake’, ‘canal’, ‘harbour’)
features or urban scenery (e.g. ‘street’, ‘town’, ‘gate’). The fixed expression ‘view from’ is also
recognized by the model as a powerful lexical predictor of the exterior category. The fact
that clear lexical clues are doing all the hard discriminatory work is also the suggested by the
unimpressive performance of BERT: given its pretrained nature, in spite of the limited size of
the training data set, we expected BERT to be able to harness at least some its pre-existing
linguistic knowledge, but that hardly seems to be the case. Concerning prompt engineering,
we hypothesized that highest weights for the two classes might result in relevant prompts for
CLIP. However, as Table 2 shows, the combination street/interior does not lead to particularly
good results.
   Next to looking at the accuracy metric, we can use the top errors of CLIP and the visual
model to compare them (Fig 6a). Clearly, the models have diﬀiculties with different kinds of
slides. The errors of the vision model seem the result from a lack of sky. The top error of
CLIP is a result of mislabeling. While its caption (’in a Javanese home’) suggest the interior
category, the image shows a family outside their house. CLIP wrongly attributed the other


                                                 155
images to the exterior category, while they show details inside Exeter cathedral.


4. Discussion
Multimodal models hold the promise to lead to a ‘practical revolution’ in computational hu-
manities research [9]. Instead of spending time (and money) on labelling datasets and training
and fitting models, the zero-shot capabilities of CLIP could leave researchers free to apply
deep learning techniques to more and different kinds of research questions and focus on the
interpretation of results rather then the methods themselves. However, while CLIP has shown
to be competitive on a large number of benchmarks, this paper demonstrates that this is not
necessarily a given for all classification tasks. Relatively simple and easy to apply mono-modal
models might significantly outperform CLIP for specific tasks. The fact that any textual
prompt will yield a result, when not properly thresholded, might lead humanities scholars to
expect too much. Future research should develop standardized practices to asses if results
obtained with CLIP are reliable and meaningful. The fact that classification tasks can only
be tackled indirectly, as we show in this exploratory paper, could pose a significant hurdle
for future work. Traditional metrics, such as accuracy, might not be suitable to compare the
performance of CLIP to other models. In line with this, the performance and reliability of
CLIP could be significantly improved by better and more stable prompt engineering.


Acknowledgments
We would like to thank Ruben Ros for his help with the data collection and Melvin Wevers
for helping setting up CLIP.


References
 [1]   T. Arnold, S. Scagliola, L. Tilton, and J. V. Gorp. “Introduction: Special Issue on Au-
       dioVisual Data in DH”. In: Digital Humanities Quarterly 015.1 (2021).
 [2]   T. Baltrušaitis, C. Ahuja, and L.-P. Morency. “Multimodal Machine Learning: A Survey
       and Taxonomy”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence
       41.2 (2019), pp. 423–443. doi: 10.1109/tpami.2018.2798607.
 [3]   J. A. Bateman. Text and Image: A Critical Introduction to the Visual/Verbal Divide.
       London; New York: Routledge, 2014.
 [4]   T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan,
       P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
       R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M.
       Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever,
       and D. Amodei. “Language Models Are Few-Shot Learners”. In: arXiv:2005.14165 [cs]
       (2020). arXiv: 2005.14165 [cs].
 [5]   T. Hiippala. “Distant Viewing and Multimodality Theory: Prospects and Challenges”. In:
       Multimodality & Society (2021), p. 26349795211007094. doi: 10.1177/26349795211007094.


                                              156
 [6]   T. Hiippala and J. A. Bateman. “Semiotically-Grounded Distant Viewing of Diagrams:
       Insights from Two Multimodal Corpora”. In: arXiv:2103.04692 [cs] (2021). arXiv: 2103.
       04692 [cs].
 [7]   J. Kember. “The Magic Lantern: Open Medium”. In: Early Popular Visual Culture 17.1
       (2019), pp. 1–8. doi: 10.1080/17460654.2019.1640605.
 [8]   F. Kessler and S. Lenk. “Projecting Faith: French and Belgian Catholics and the Magic
       Lantern Before the First World War”. In: Material Religion 16.1 (2020), pp. 61–83. doi:
       10.1080/17432200.2019.1696560.
 [9]   B. Nicholson. “The Digital Turn”. In: Media History 19.1 (2013), pp. 59–73.
[10]   L. Parcalabescu, N. Trost, and A. Frank. “What Is Multimodality?” In: arXiv:2103.06304
       [cs] (2021). arXiv: 2103.06304 [cs].
[11]   F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
       P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M.
       Brucher, M. Perrot, and E. Duchesnay. “Scikit-learn: Machine Learning in Python”. In:
       Journal of Machine Learning Research 12 (2011), pp. 2825–2830.
[12]   A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A.
       Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. “Learning Transferable Visual
       Models From Natural Language Supervision”. In: arXiv:2103.00020 [cs] (2021). arXiv:
       2103.00020 [cs].
[13]   F. Sebastiani. “Machine learning in automated text categorization”. In: ACM Comput.
       Surv. 34.1 (2002), pp. 1–47. doi: 10.1145/505282.505283. url: https://doi.org/10.1145/
       505282.505283.
[14]   T. Smits and R. Ros. “Quantifying Iconicity in 940K Online Circulations of 26 Iconic
       Photographs”. In: Proceedings of the Workshop on Computational Humanities Research
       (CHR 2020). Ed. by F. Karsdorp, B. McGillivray, A. Nerghes, and M. Wevers. Vol. 2723.
       Amsterdam: Ceur-ws, 2020, pp. 375–384.
[15]   T. Smits and M. Wevers. “The Agency of Computer Vision Models as Optical Instru-
       ments.” In: Visual Communication Online First (2021). doi: 10.1177/1470357221992097.
[16]   K. Vanhoutte and N. Wynants. “On the Passage of a Man of the Theatre through a
       Rather Brief Moment in Time: Henri Robin, Performing Astronomy in Nineteenth Cen-
       tury Paris”. In: Early Popular Visual Culture 15.2 (2017), pp. 152–174. doi: 10.1080/
       17460654.2017.1318520.
[17]   M. Wevers and T. Smits. “The Visual Digital Turn. Using Neural Networks to Study
       Historical Images”. In: Digital Scholarship in the Humanities 35.1 (2020), pp. 194–207.
       doi: 10.1093/llc/fqy085.
[18]   M. Wevers, T. Smits, and L. Impett. “Modeling the Genealogy of Imagetexts: Studying
       Images and Texts in Conjunction Using Computational Methods”. In.
[19]   D. Yotova. “Presenting “The Other Half”: Jacob Riis’s Reform Photography and Magic
       Lantern Spectacles as the Beginning of Documentary Film”. In: Visual Communication
       Quarterly 26.2 (2019), pp. 91–105. doi: 10.1080/15551393.2019.1598265.


                                              157
   exterior, interior, 0.995                            interior, exterior, 1.000


   exterior, interior, 0.984                            interior, exterior, 1.000


   exterior, interior, 0.981                            interior, exterior, 1.000


   exterior, interior, 0.980                            interior, exterior, 1.000
Figure 6: Top 4 errors (prediction, actual, probability) for CLIP (col 1) and the visual model (col 2).
Copyright Col 1 (top to bottom) reproduced by permission via Lucerna Magic Lantern Web Resource: Digital
image © 2017 Manchester Museum/Digital image © 2014 Royal Albert Memorial Museum and Art Gallery,
Exeter/Digital image © 2015 Royal Albert Memorial Museum and Art Gallery, Exeter/Digital image © 2015
                                                    158 Copyright Col 2 (top to bottom) reproduced by
Royal Albert Memorial Museum and Art Gallery, Exeter.
permission via Lucerna Magic Lantern Web Resource: Digital image © 2019 Manchester Museum/Digital
image © 2015 Royal Albert Memorial Museum and Art Gallery, Exeter/Digital image © 2014 Royal Albert
Memorial Museum and Art Gallery, Exeter/Digital image © 2018 Worcestershire County Council.