=Paper=
{{Paper
|id=Vol-3602/paper7
|storemode=property
|title=Exploring Naming Inventories for Architectural Elements for Use in Multi-modal Machine Learning Applications
|pdfUrl=https://ceur-ws.org/Vol-3602/paper7.pdf
|volume=Vol-3602
|authors=Ronja Utescher,Aaron Pattee,Ferdinand Maiwald,Jonas Bruschke,Stephan Hoppe,Sander Münster,Florian Niebling,Sina Zarrieß
|dblpUrl=https://dblp.org/rec/conf/comhum/UtescherPMBHMNZ22
}}
==Exploring Naming Inventories for Architectural Elements for Use in Multi-modal Machine Learning Applications==
<pdf width="1500px">https://ceur-ws.org/Vol-3602/paper7.pdf</pdf>
<pre>
                                Exploring Naming Inventories for Architectural
                                Elements for Use in Multi-modal Machine Learning
                                Applications⋆
                                Ronja Utescher1,3 , Aaron Pattee2 , Ferdinand Maiwald1 , Jonas Bruschke4 ,
                                Stephan Hoppe2 , Sander Münster1 , Florian Niebling4 and Sina Zarrieß3,*,†
                                1
                                  Friedrich-Schiller-University Jena, Fürstengraben 1, 07743 Jena, Germany
                                2
                                  Ludwig-Maximilians-Universität München, Geschwister-Scholl-Platz 1, 80539 Munich, Germany
                                3
                                  Bielefeld University, Universitätsstraße 25, 33615 Bielefeld, Germany
                                4
                                  Julius-Maximilians-University of Würzburg, Sanderring 2, 97070 Würzburg, Germany


                                                                         Abstract
                                                                         Computer vision models are increasingly relevant and useful to Digital History. Next to the increasingly
                                                                         complex neural models, data and data selection are an integral part of this process. In this paper, we
                                                                         examine and extend the data collection practices from a major recent paper in the domain of architectural
                                                                         element classification. We collected an image-text data set for a selection of 56 Baroque landmarks to
                                                                         be analysed in like manner. This different architectural domain yielded insights into the transferability
                                                                         of the original model and data collection procedures. Notably, the architectural domain also has an
                                                                         impact on the availability of classes of architectural elements as well as the performance of the models
                                                                         classifying them.

                                                                         Keywords
                                                                         Architecture, Art History, Computer Vision, Machine Learning


                                1. Introduction
                                The study of architectural art history has greatly benefited from innovative, computer-aided
                                approaches in recent years. From high-resolution two-dimensional (2D) photos of building
                                edifices, to three-dimensional (3D) models of entire structures, these emerging techniques
                                are laying the foundation for new methodologies in researching architecture [1, 2]. Provided
                                the three-dimensional nature of buildings, research projects have appropriately focused on
                                techniques that produce digital replicas of their forms, such as Structure-from-Motion (SfM)
                                Photogrammetry and Terrestrial Laser Scanning (TLS) [3]. However, one critical aspect in
                                the study of architecture has largely been dormant since the emergence of these technologies,
                                namely, the computer-aided description of architectural elements. 2D images and 3D models
                                were the logical first steps in devising new methodologies for identifying architectural elements,
                                providing precise calculations of their dimensions and structures, buttressed by the unique

                                COMHUM 2022: Workshop on Computational Methods in the Humanities, June 09–10, 2022, Lausanne, Switzerland
                                ⋆
                                  You can use this document as the template for preparing your publication. We recommend using the latest version
                                  of the ceurart style.
                                *
                                  Corresponding author.
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                                          95


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Ronja Utescher et al. CEUR Workshop Proceedings                                             95–105


Figure 1: Examples from the facade and tower classes in Wikiscenes (top) and Wikiscenes-Baroque
(bottom)


capability to virtually study a building. What normally follows is a traditional text description
of the discoveries achieved by the employment of these techniques, relegating these digital
applications as mere means to a more enlightened end. The lamentable result is a digital
purgatory of 3D models awaiting their fate in repositories or online databases.
   This paper presents one aspect of a larger project seeking to utilise 2D images and 3D models
as essential components of a search engine for architectural elements. These digital objects can
serve as reference points for future research in architectural art history and archaeology. What
is required is a systematic identification of the elements themselves using text descriptors, in
order that the digital representations of the elements can be efficiently explored. In many ways,
this avenue of research is an evolution of [4], which had expert and non-expert annotators
choose phrases from longer textual descriptions with which to index paintings in a collection.
For this purpose, we implement the methodology of recent research [5] as the foundation of
the link between text and digital representation. The work by [5] represents an important step
in integrating 3D models, collections of photographs, and written descriptions of a historic
building using state-of-the-art machine learning methods. Given a multi-modal collection of
cathedrals, the model learns to detect and classify 10 classes of architectural objects, for example
portals and columns. These classes of objects are however limited by which terms are frequently
associated with images in the original source.
   In this paper, we take a closer look at the first two steps in their workflow [5]; (1) curating


                                                  96
Ronja Utescher et al. CEUR Workshop Proceedings                                              95–105


images and their descriptions from large, open-source collections and (2) selecting the vocabulary
of architectural elements for the machine-learning model to classify. We argue that these are
important design decisions that have ramifications for the output of the model down the line.
Both [5] and the authors of this paper source their images from Wikimedia Commons. While
Commons is a free and abundant source of images, the indexing and naming it provides for
individual images is comparatively limited and unsystematic.
   Our case study, similar to [5], is abstract as it is neither limited to a specific building, nor a
design implemented by a specific architect. Rather, it is a collection of baroque monumental
buildings largely built between the late 17th and mid-18th centuries, ranging geographically
from Portugal to Russia, by a large network of architects. In effect, we construct a new collection
which parallels the one introduced in [5]. The majority of the collection consists of 2D images,
though this could be supplemented in future research by existing high-resolution TLS models
of historic buildings in the German city of Dresden, such as the iconic Zwinger.
   For the purposes of this paper, we define a domain according to architectural style. [5] do
not explicitly frame the issue this way, but we will briefly talk about the issue here since it is
important from a historical and architectural perspective. Wikiscenes and WikiscenesBaroque
differ both in architectural style (gothic/baroque) and building function (house of worship/resi-
dence of high-ranking dignitaries). [5] define it more so by function, although the Cathedrals
lean towards the Gothic style. The paper is structured as follows: In section 2, we discuss the
method of the paper, including the Wikiscenes data curation policy and classification model
as well as our modifications to said policy. In section 3, we examine the resulting new dataset,
WikiscenesBaroque. Section 4 details the setup and results of the experiments we conducted in
order to assess the influence of this different data on the classification model. Finally, Section 5
summarises the different lessons learnt from the case study.


2. Background
Our goal is to build a machine learning-based system that labels architectural elements in
images of buildings, of particular architectural styles assumed in the study of architectural art


                     Category:Hofburg


               Category:Heldenplatz, Vienna


              Category:Leopoldinischer Trakt

Description: The northwestern facade of the Leopold Wing of the Hofburg Imperial Palace in Vienna.
Figure 2: Image with its Wikimedia Commons category hierarchy (left) and text description (bottom).
The gold label facade is sourced from the description.


                                                  97
Ronja Utescher et al. CEUR Workshop Proceedings                                           95–105


history. In Section 2.1, we take a brief look at the uses and requirements of machine learning
methods in art history. Sections 2.2 and 2.3 describe [5]’s and our approach respectively.

2.1. Image Classification for Architectural Categories
The documentation of historical built works often goes hand in hand with large collections of
image materials, especially for landmarks which are popular objects of study. Machine Learning
(ML) methods have the potential to greatly benefit research of these historical landmarks, since
they allow the automatic processing of large amounts of data. These ML applications can take
the form of labelling data according to predefined categories, or in interactive settings like
Image Retrieval. This paper focuses on modelling and data collection in classification settings.
   Image and Object Classification models have undergone significant development in the last
10 years. Although there have been a number of new models for Image Classification and other
Vision tasks, most models in image processing use a convolutional neural network (CNN) as
a backbone (cf [6, 7, 8]. Image classification models generally work with a set classification
vocabulary, requiring labelled training data. Datasets in digital history are smaller and more
difficult to obtain. There are standard vocabularies for categorising architectural elements, but
these are not connected to the datasets. Domain specificity in classification models runs the
gamut from models trained on generic vocabularies [9, 10, 11], to domain-specifically trained or
fine-tuned models, to models trained on specific landmarks. We use an approach that does not
rely on NLP, but on language in terms of the vocabulary which we use to talk about real-world
objects and their visual representations.
   The method for creating these reconstructions is unsupervised except for the previously
mentioned choice in input images for each model. COLMAP also creates a match between
the 3D space of the model and the 2D images. [5] exploited these matches in their image
classification and segmentation, using a loss function which rewards the model for consistent
classification of points in the landmark across different images.

2.2. Landmarks and Image Categories in Wikiscenes
[5] use a combination of automatic data collection and manual refinement which utilises noisy,
but freely available data. The authors mine this data for semantic concepts which act as classes
for classification and segmentation models based on [12]. This approach bypasses the need for
costly manual annotation and exploits the inherent structure of the data.
   [5] draw upon Wikimedia Commons as the source of image-and-text data. Any interested
party can contribute images, add them to the appropriate Wikimedia Commons (WM) category.
This serves as an alternative to other annotation paradigms, such as expert annotation or
crowd-sourced annotation. There are ways in which it is noisy; if users utilise the caption
and description fields at all, the content and length tends to vary wildly. These user-based
annotations are more sophisticated than what could reasonably be collected by laypeople in
a crowdworking setting. In this paper, we aim to go into detail about the curation process of
a subset of Wikimedia Commons for a number of Baroque Architectural Objects. Wikimedia
Commons provides us with a large number of user-uploaded images with open-source licensing.
Users are given the opportunity to annotate their images with captions, descriptions, and a


                                                  98
Ronja Utescher et al. CEUR Workshop Proceedings                                                     95–105


selection of other metadata such as the geolocation and camera specifications.

2.3. AE classes in WikiScenes
The classes in the original dataset are decided upon bottom-up; if there is an architectural
element that has a significant number of instances available for training, it is selected as one of
the classes for the model. In other words, [5] compute the frequency of all terms from categories,
descriptions and captions and manually select a number of architectural element classes from
the most frequent terms. Figure 1 showcases examples of two of these classes, facade and tower.
   [5] use image descriptions as well as Wikimedia Commons categories. On Wikimedia Com-
mons, users can add descriptions or captions to their images. However, only a small portion
of images have captions/descriptions. This makes Category pages the most comprehensive
source for text describing images. The images of said landmark are organised into subcategories,
sub-subcategories, and so on. Figure 2 shows an example of a Wikimedia Commons Image and
its category tree. In this example, the facade label can be sourced from the image description.
As described in [5], the Wikimedia commons categories themselves are a significant source for
concept terms; if a term is used in a category name, all direct member images of the category
can be assigned the term as an architectural element (AE) class.


3. Dataset
The original WikiScenes dataset [5] contains data for 99 Gothic cathedrals. We build an anal-
ogous dataset of 56 Baroque landmarks, mostly palaces. We investigate how [5] approach
transfers to modelling architectural elements in a different domain.

Table 1
Numbers of instances in the train and test sets for WikiscenesBaroque
                        garden         facade        hall           statue         court
               total 4291 (0.23) 3822 (0.2)          3344 (0.18) 2208 (0.12) 2064 (0.11)
               train 3844 (0.23) 3548 (0.21) 2980 (0.18) 1969 (0.12) 1871 (0.11)
               test     447 (0.24)     274 (0.15)    364 (0.2)      239 (0.13)     193 (0.1)
                    court          stair        gallery      tower         column       fountain
          total 2064 (0.11) 929 (0.05) 712 (0.04) 657 (0.04) 375 (0.02) 318 (0.02)
          train 1871 (0.11) 829 (0.05) 645 (0.04) 580 (0.03) 323 (0.02) 271 (0.02)
          test      193 (0.1)      100 (0.05) 67 (0.04)      77 (0.04)     52 (0.03)    48 (0.03)


3.1. Curating Landmarks and Image Categories
[5] build two point cloud models per landmark; one for the outside and one for the inside of
the landmark. The inside model is computed from images from the "Interior of [landmark]"
category and its leaf categories, white the outside model uses the "Exterior of [landmark]" and
"Views of [landmark]" categories. These three subcategories are present throughout the original
dataset’s landmarks, however we did not find this to be universal in our collection of Baroque


                                                    99
Ronja Utescher et al. CEUR Workshop Proceedings                                                    95–105


Table 2
Numbers of class instances in the original train set and in our cross-domain test set; for big sets, 1500
instances were randomly sampled for our test set
                              facade       window            chapel       organ       nave
                Wu et al.
                              1352 (.14)   874 (.09)         1050 (.11)   628 (.06)   1063 (.11)
                (train set)
                cross-test    1500 (.42)   22 (.01)          191 (.05)    119 (.03)   0 (.0)
                              tower        choir             portal       altar       statue
                Wu et al.
                              932 (.10)    1029 (.11)        957 (.10)    956 (.10)   949 (.10)
                (train set)
                cross-test    629 (.17)    0 (.0)            173 (.05)    0 (.0)      909 (.26)


landmarks. Out of the 56 landmarks, 22 only had one subcategory that matched the original
inside/outside selection process. Overall, there were 38 "interior" categories, and 23 "exterior"
(14 + 9 "view").
   We selected 56 palaces based upon their architectural similarities to the Zwinger in Dresden,
in which all exhibit building phases in the late 17th and early 18th centuries. Additionally, the
architects and designers of the selection of buildings all had connections as part of a larger
network in which ideas and designs were shared, as many of the major construction efforts
occurred between 1700 and 1730 A.D. It was during this time that the Great Northern War
raged in which the Baltic and Black Seas, as well as the entire east of Europe served as the
battleground for the various armies. Many palatial architects were active military officers
involved in constructing bastions and fortresses for and against artillery pieces. As a result,
there was a major exchange of ideas and designs during this period which even translated into
the construction plans of palaces.
   Each landmark has its own hierarchy of categories in Wikimedia Commons. We use the set
of all immediate subcategories - the landmark’s category being the root - as basis for coming up
with a list of terms to blacklist. For each of the 432 categories in this manual selection, we select
all its subcategories as well (unless they are in the blacklist). We use the blacklist sparingly, i.e.
including as many images as sensible in this step of the annotation. The blacklist is stamp, in art,
painting, collections, plans, aerial, panoramic, plans, signs, maps, things, history, events, leading to
the exclusion of 53 of 432 subcategories. This list is aimed at excluding objects which are not
part of the landmark, but located in it (collections, paintings, signs) or associated with it (in art,
stamp, things). Also, we exclude non-photographic images (plans, maps) and images with an
atypical perspective (aerial, panoramic). Finally, the last two keywords in the blacklist exclude
historical images and images depicting certain events This allows us to utilise a larger set of
images compared to only collecting the images uploaded in the "Exterior", "Interior" and "Views
of" categories This initial selection yields 85,629 images, while using [5]’s method would yield
28,794 images. From these images, we select instances of the AE classes for our cross-domain
test set as well as our data for the Baroque-specific model (cf. Section 4).


                                                       100
Ronja Utescher et al. CEUR Workshop Proceedings                                               95–105


3.2. Selecting AE Classes
We followed [5] and selected 10 AE classes from the most frequent terms present in the WM
Categories and Descriptions. This led to a different set of classes than the original dataset, as
shown in Table 3. Out of the 10 AE classes, only 3 (facade, statue, tower) are present in both
models. Out of the available images, we were able to link 17115 with AE classes for training the
model described in the new domain setting (Section 4.2).
   In order to evaluate [5]’s original model, we also construct an alternate test set using instances
of AE classes from the original Wikiscenes dataset. As listed in Table 2, the availability of
instances in our Baroque domain is mixed. The classes nave and choir are entirely absent, while
the data contains comparatively many statues and facades.


4. Experiments
In this section, we introduce the image classification model without 3D Loss from [5] and use it
to perform experiments in two novel classification setups.

4.1. Model
The model described in [5] performs two general steps. 1) For each landmark, generate a 3D
model from photographs and 2) assemble an architectural vocabulary for identifying elements
of the landmark. The [5] model is based on [13]. The implementation uses a resnet50 backbone
[14] with an ImageNet [15] pretraining. Their model’s backbone extracts both low-level and
high-level image features, which are unified in a pooling layer followed by a Global Cue Injection
(GCI) module. In [5], the 3D loss provides a small performance boost (3.3% across all images).
In this paper, we generally forego the use of 3D Loss as proposed by [5]. We argue that using
the baseline classification model is adequate to judge the overall feasibility, since 3D does not
fundamentally change the results, instead giving a small boost in accuracy overall.

Table 3
Image-level classification precision w/o 3D Loss on WikiscenesBaroque (WSB)
             garden    facade   hall   statue     court   stair   gallery   tower   column   fountain
 precision   60.3      87.8     77.9   45.8       57.6    47.3    74.0      30.2    58.4     45.6


4.2. Classification Setups
We perform experiments in two classification setups. The baroque domain setting is analogous
to [5]; the cross-domain setting tests the performance of the original model in images from a
different architectural style.

Baroque Domain In the Baroque Domain task, we train a new image classification model
without 3D loss, using our collection of Baroque Wikimedia Commons images. We use a batch
size of 16 and 10 epochs for training the model. Like [5], we use a 9:1 split on landmark level,
so that the images seen at test time belong to unseen landmarks.


                                                101
Ronja Utescher et al. CEUR Workshop Proceedings                                            95–105


Table 4
Image-level classification precision w/o 3D Loss (baseline) on known Wikiscenes landmarks (WS-K),
unknown Wikiscenes landmarks (WS-U) and our cross-domain test set
            Test Set     Model     mAP mAP* facade          window      chapel organ
            WS-K         baseline 70.8   77.7    87.2       89.2        60.2     89.7
            WS-U         baseline 48.3   64.0    71.0       92.2        10.7     57.3
            cross-test baseline 44.2     60.0    88.9       50.8        20.7     30.5
               Test Set    Model     nave tower choir        portal   altar statue
               WS-K        baseline 85.8    64.1   61.5      68.0     50.0 52.0
               WS-U        baseline 71.0    53.4   43.6      31.1     25.8 27.1
               cross-test baseline -        55.4   -         29.5     -       33.8


Cross-Domain In this experiment, we evaluate the performance of the original model by [5]
on images of the landmarks in our data. We refer to this as the cross-domain setup because a
model trained on Gothic cathedrals is used to classify images of Baroque palaces. We construct
a cross-test set with instances from our 56 Baroque landmarks. In Table 2, we provide statistics
on the distribution of classes in our data. Some classes had no instances or very few in the
Baroque dataset, likely due to the domain itself. For the very common class facade, we randomly
sampled 1500 instances for the test set. Analogously to [5], we evaluate by calculating the
precision per class, as well as the mean average precision (mAP) over all instances and the mAP
over all landmarks (mAP*).

4.3. Results
Baroque Domain Figure 3 shows some examples of correct and incorrect judgements of
the model. Note that while (a) has the gold label fountain and there is a fountain in the right
foreground of the image, most of the space in the image is taken up by the facade of the building,
which is the predicted class of the model.
   In Table 3, we report the precision of the model trained on WikiscenesBaroque (WSB) for
each AE class. Similarly to Wikiscenes-baseline model (cf. Table 4), this model achieves a high
precision for the class facade, but also for hall and gallery. Even though the WSB model is
tested on unseen landmarks, its precision on statues is higher than the Wikiscenes baseline
model (45.8% vs. 33.8%). The model performs worst for tower at 30.2%, and under 50% for statue
(45.8%), fountain 45.6% and stair (47.3%).

Cross-domain In Table 4, we report the mean average precision (mAP), for each AE class as
well as across all images. In [5], the baseline model achieves a higher overall mAP on known
landmarks (WS-K) than on images of unknown landmarks (WS-U) with 70.8% vs. 48.3%. The
mAP for the cross-test set is lower than for the WS-U test set, however the difference is much
smaller (48.3% vs. 44.2%). In the cross-test set, the model performs worst for the chapel AE at
20.7%, which is substantially lower than the 60.2% mAP in the WS-K set, but higher than the
10.7% in the WS-U set. On the other hand, the baseline model performs consistently well on
the facade class. For the tower and portal classes, the model performance is very similar in the
cross-test and WS-U sets.


                                               102
Ronja Utescher et al. CEUR Workshop Proceedings                                                   95–105


(a) Gold label: fountain, (b) Gold label: portal,    (c) Gold label: portal,   (d) Gold label: statue,
    predicted: facade         predicted: tower           predicted: portal         predicted: statue

Figure 3: Classification examples


5. Discussion
The automatic annotation of basic architectural elements is beneficial to the digital documen-
tation of Historical Heritage Sites. Architectural classification models like [5] are adept at
utilising large amounts of already existing, noisily annotated data, taking concepts from the
text annotations and locating them in a 3D model of the landmarks. Approaches like these also
require the attention of the researcher handling the model. In this paper, we have examined the
various stages of data curation and their effect on the model’s performance.
   For more specific vocabularies, the text provided in Wikimedia Commons is not a suitable
source for labelled data as there are comparatively few instances which are directly annotated.
Training a model with more classes and more diverse training data is possible in principle.
However, domain-specific models have the advantage of needing less computing power. Addi-
tionally, domain experts can advise during the data curation process and steer the model’s AE
classes towards what is of interest to them.
   The results of our experiments suggest that classification and detection of architectural
elements works best within domain, with a moderate gap to the cross-domain performance.
Images of known landmarks which are seen during training appear to be overall easier to classify,
comparing the WS-K to the WS-U and cross-test sets as listed in Table 4. Architectural elements
and their visual styles can be highly domain-specific, or even very individualised to particular
buildings. For well-documented landmarks, it may well be feasible to have classification models
solely for a particular landmark. However, less-resourced landmarks can still benefit from more
general models.
   Besides under-resourced landmarks, we want to point out the issue of more fine-grained
AE classes as a topic for future research. The AE classes which can be mined from resources
like Wikimedia Commons are useful for basic classification of images and segmentation of 3D
models. There are a great number of distinctions of AE classes within the study of architectural
art history, many of which are catalogued in resources like the Art & Architecture Thesaurus
[16]. Building Machine Learning models which can identify more elaborate concepts would
open up more detailed analysis and documentation of the landmarks and would be an interesting


                                                    103
Ronja Utescher et al. CEUR Workshop Proceedings                                            95–105


direction for future research.


6. Conclusion
In summary, this paper as well as [5] suggest that the practice of using online image collections,
in particular Wikimedia Commons, yields data that can be used to train models which in turn
classify architectural elements. We curate our own dataset, focusing on Baroque landmarks
that were constructed in the late 17th and early 18th centuries. The experiments in Section 4
suggest that in our case study, model precision decreases for images of unknown landmarks,
especially if they come from a stylistically different domain.


Acknowledgements
This work was supported by a grant from the Federal Ministry of Education and Research
(BMBF, grant No. 01UG2120).


References
 [1] J.-E. Lutteroth, S. Hoppe, Schloss Friedrichstein 2.0: von digitalen 3D-Modellen und dem
     Spinnen eines semantischen Graphen, in: Computing art reader: Einführung in die digitale
     Kunstgeschichte, 2018, pp. 184–198.
 [2] P. Sapirstein, Accurate measurement with photogrammetry at large sites, Journal of
     Archaeological Science 66 (2016) 137–145.
 [3] N. Lercari, Terrestrial laser scanning in the age of sensing, Digital methods and remote
     sensing in archaeology (2016) 3–33.
 [4] R. J. Passonneau, T. Lippincott, T. Yano, J. L. Klavans, Relation between agreement
     measures on human labeling and machine learning performance: results from an art
     history domain, in: Proceedings of the Sixth International Conference on Language
     Resources and Evaluation (LREC’08), 2008.
 [5] X. Wu, H. Averbuch-Elor, J. Sun, N. Snavely, Towers of babel: Combining images, lan-
     guage, and 3d geometry for learning multimodal vision, in: Proceedings of the IEEE/CVF
     International Conference on Computer Vision, 2021, pp. 428–437.
 [6] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recog-
     nition, 2014. URL: https://arxiv.org/abs/1409.1556. doi:10.48550/ARXIV.1409.1556.
 [7] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, 2015. URL:
     https://arxiv.org/abs/1512.03385. doi:10.48550/ARXIV.1512.03385.
 [8] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell,
     P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from
     natural language supervision, 2021. URL: https://arxiv.org/abs/2103.00020. doi:10.48550/
     ARXIV.2103.00020.
 [9] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick,
     Microsoft COCO: Common objects in context, in: European conference on computer
     vision, Springer, 2014, pp. 740–755.


                                               104
Ronja Utescher et al. CEUR Workshop Proceedings                                            95–105


[10] H. Caesar, J. Uijlings, V. Ferrari, COCO-stuff: Thing and stuff classes in context, in:
     Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
     2018.
[11] J. Redmon, A. Farhadi, Yolo9000: Better, faster, stronger, in: Proceedings of the IEEE
     Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[12] N. Araslanov, S. Roth, Single-stage semantic segmentation from image labels, in: IEEE/CVF
     Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4253–4262.
[13] N. Araslanov, S. Roth, Single-stage semantic segmentation from image labels, in: Proceed-
     ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp.
     4253–4262.
[14] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, 2015. URL:
     https://arxiv.org/abs/1512.03385. doi:10.48550/ARXIV.1512.03385.
[15] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A Large-Scale Hierarchical
     Image Database, in: CVPR09, 2009.
[16] P. Harpring, Development of the Getty vocabularies: AAT, TGN, ULAN, and CONA, Art
     Documentation: Journal of the Art Libraries Society of North America 29 (2010) 67–72.


                                               105

</pre>