<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Automatic Classification of Portraits: Application of Transformer and CNN Based Models for an Art Historic Dataset</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Sebastian</forename><surname>Diem</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Hildesheim</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Thomas</forename><surname>Mandl</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Hildesheim</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Automatic Classification of Portraits: Application of Transformer and CNN Based Models for an Art Historic Dataset</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">828898E80FAB399724686B5E212640DB</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:20+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Digital Humanities</term>
					<term>Portraits</term>
					<term>Image Processing</term>
					<term>Deep Learning</term>
					<term>CNN</term>
					<term>Computer Vision</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This research compares the performance of a Visual Transformer to a ResNet50 on a small art historical dataset. The ResNet is a widely used model based on Convolutional Neural Networks (CNNs) and has achieved good performance in a variety of computer vision experiments. Our experiments show how the relatively novel Visual Transformer performs compared to ResNet50 for a dataset from the Digital Humanities. We are using a large collection of portraits from the 15th to the 19th century and select the 10 most frequent artists for a classification task. Portraits reveal social values and artistic styles over the centuries. Like many other collections in the Humanities, they lack annotations and require automatic methods for generating metadata. We observe that the Visual Transformer achieves a top-1 accuracy of 87.09 % in contrast to the ResNet's 46.13 % accuracy. Analysing features like the printing technique and active period of the artist in question shows, that these features could be important to explain the model's inference process. Other features like the portrait type seem to have less impact. To further analyze the performance of the models, we applied Centered Kernel Alignment method, Gradient-weighted Class Activation Maps (GradCAMs) and Attention Map visualizations. On the one hand, the importance of the printing technique can be further emphasized when visualizing the models' hidden layers, where both models seem to attend to the portrait backgrounds, as these parts could be the easiest to distinguish the distinctive printing patterns. On the other hand, tends the Visual Transformer to focus on the portrayed person as they seem to be important for the artist classification.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Since the Iconic Turn, research involving pictures and visual media has been established in the Digital Humanities. Images are very important in the spread of knowledge. For example, the invention of lithography in the 19th century resulted in declining manufacturing costs for printed images, giving an increasing number of people access to a wealth of visual information <ref type="bibr" target="#b0">[1]</ref>. In the Digital Humanities the potential of image processing capabilities also gained significance. As libraries and museums further digitalize their collections, more researches gain accessibility to art historical data to conduct experiments <ref type="bibr" target="#b1">[2]</ref>. Diachronic developments within image collections are particularly fascinating, because research might show trends in stylistic and LWDA'23: Lernen, Wissen, Daten, Analysen. October 09-11, 2023, Marburg, Germany diem@uni-hildesheim.de (S. Diem); mandl.uni-hildesheim.de (T. Mandl) 0009-0000-9196-4814 (S. Diem); 0000-0002-8398-969 (T. <ref type="bibr">Mandl)</ref> aesthetic representation. The creation of suitable tools and techniques for distant viewing, or the automatic analysis of massive volumes of visual data using computer vision technologies, is crucial for the Digital Humanities. Often, the tasks and collections within Digital Humanities are not well suited for generating annotations. However, future search and analysis systems need to provide many more options than current tools. As a consequence, the automatic generation of metadata for large datasets is one solution to improve the research opportunites within image collections. This is also necessary to overcome critical positions within the Humanities toward digital methods (e.g. <ref type="bibr" target="#b2">[3]</ref>). Such automatically generated metadata can be used in retrieval tools for more flexibel access. However, there is still much doubt about the quality of such data.</p><p>In this research, the performance of a Transformer based model is compared to the results of an established CNN model for an art historical dataset. Previously, Convolutional Neural Networks have been the state of the art for a variety of Computer Vision tasks. The introduction of the Transformer architecture in the field of Computer Vision might represent an alternative to CNNs, as models like the Visual Transformer achieve comparable results to modern CNNs <ref type="foot" target="#foot_0">1</ref> . In order to observe how the new Transformer based models perform on an art historical dataset this work compares the Visual Transformer with the often used ResNet50 which is based on CNNs. Additionally, this research uses a small custom art historic dataset to compare top artists who created printed portraits.</p><p>We utilize different methods to explain which features are important for the individual models in their classification process. It can be observed that the used printing technique influences the prediction quality as different printing techniques have significantly different detection rates. This can be proven when observing the visualization of different hidden layers, where lower layers attend to these local features. Furthermore, the epoch when the artists were active seems to be an important factor, as artists active at the same point in time are harder to distinguish than others. In the related works we introduce the medium of printed portraits before looking at related art historical applications of computer vision models. Lastly, we show the current situation in computer vision with the introduction of the Transformer architecture. The experiment setup explains how the dataset was created, how the models were trained and how we evaluate the results, before presenting them and discussing the findings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Printed Portraits</head><p>The medium of printed portraits only started to get acknowledged for scientific studies in the last few decades as it was previously regarded as copies of popular artworks. For Europe in the early modern period printed portraits were a widespread medium (1450 until the end of the 18th century). Over time the profession of printers was established <ref type="bibr" target="#b3">[4]</ref>. Printers were mainly regarded as illustrators and produced commissional work without being necessarily painters themselves. This led to the increased production of printed portraits as more social groups could afford portraits of themselves (e.g. aristocrats, scholars, craftsmen and wealthy citizens). This popularized the trend of collecting and trading portraits for individuals and even led to portraits being cut out of books to expand private collections, thus removing them from their original historical context <ref type="bibr" target="#b3">[4]</ref>. With this acknowledgement, more research has been conducted to analyse printed portraits. One of the most popular forms of analysing portraits is iconography. Iconography analyses the content and style of an image and interprets it to gain historical and art historical insights. One interesting observation is, how popular motives change over the centuries as the zeitgeist changes conventions of depiction <ref type="bibr" target="#b4">[5]</ref>. This can be best seen in the reinterpretation of popular motives where the clothing or gestures stay mostly the same and other elements of the artwork are changed to suit the likings of the current epoch <ref type="bibr" target="#b5">[6]</ref>. In this form of analysis, the visual representation of a motif could only be described. This extraction of information limits the comparability of visual elements as they might possess more nuanced information that get lost in a description.</p><p>For visual interpretation an attempt is made to contextualize a portrait based on reoccurring features or other conventions of representation typical for a period. Different epochs usually possess comparable representations like the occurrence of certain objects or clothes <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b5">6]</ref>. These and other typical elements can give insights into the social status of the depicted person or used stylistic conventions of the time. In Figure <ref type="figure" target="#fig_0">1</ref> this can be seen for different depictions of scholars. All of them have similar clothes and objects which show their status in society. Often, further insights can be gained about the origins of a portrait based on the reoccurrence of objects and other elements. Printer often reused elements like the portrait frame or common objects like books to reduce production time and cost. Additionally, different printing techniques have been used over the years from wood and copper engravings to advanced techniques like lithography, which are differentiable by their individual properties (e.g. wood grain).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Computer Vision</head><p>Computer Vision has advanced significantly in recent years, particularly due to progress with deep learning methods and representation learning. These data-driven methods have been successful for a variety of tasks and frequently perform better than conventional image processing techniques focusing, for instance, on color and form analysis. Contemporary Deep Learning techniques identify pertinent features from images and learn their own representation schemes <ref type="bibr" target="#b6">[7]</ref>. Multiple-layer Convolutional Neural Networks (CNNs) in particular have demonstrated to be quite successful <ref type="bibr" target="#b7">[8]</ref>.</p><p>The ResNet architecture, especially ResNet50 is currently one of the most relevant models in Computer Vision. It was developed in 2015 and has been used as a baseline for a variety of research papers since then <ref type="bibr" target="#b8">[9]</ref>. It introduced the skip connection and used the ReLU activation function for its hidden layers to achieve state of the art results with 76.1 % top-1 accuracy on the ImageNet dataset. In 2021 a research team revisited the original ResNet50 architecture and used novel optimization and data augmentation techniques without changing the architecture to achieve a top-1 accuracy of 80.4% on the ImageNet dataset, which emphasizes its relevance even today <ref type="bibr" target="#b9">[10]</ref>.</p><p>The Visual Transformer (ViT) was presented in late 2020 and introduced the successful Transformer architecture to Computer Vision. With its introduction in the field of Natural Language Processing the Transformer architecture with known models like BERT or GPT1 became the new standard for a variety of tasks. As the Transformer was build to handle sequences of text the architecture of the ViT differs a lot from a Convolutional Neural Network (CNN). The ViT splits an image into 16x16 pixel patches and realigns them into a sequence of patches. Spatial information are retained by the position embedding of each patch. It uses the self-attention function to focus parts of the image and has multiple attention heads per layer. The Vit-H (Huge) variant achieved 88.55 % top-1 accuracy on the ImageNet Dataset <ref type="bibr" target="#b10">[11]</ref>.</p><p>Methods for a deeper understanding of how Computer Vision models work are being developed with different approaches. Visualization techniques like the Class Activation Maps allow to look into deep learning model's inner representations and utilize the activation function to see which part of an image is attended to the most <ref type="bibr" target="#b11">[12]</ref>. Other visual approaches use cluster algorithms to differentiate classes into potential clusters. While processing an image, a CNN creates an embedding vector to capture the information about the image before using this embedding for their prediction. With dimensionality reduction algorithms like t-SNE or UMAP these embedding vectors can be reduced to a visualizable number of dimensions. These cluster visualizations help to distinguish which classes are easier or harder to differentiate <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14]</ref>.</p><p>In 2021, research compared how ViT and ResNet utilize image information for classification tasks. They used Centered Kernel Alignment to save the hidden states of all layers and calculate a similarity score between all possible combinations of layers <ref type="bibr" target="#b14">[15]</ref>. They observed, that ViT have a more uniform representation across all model layers whereas the similarity between lower and higher model layers of a ResNet is weaker <ref type="bibr" target="#b15">[16]</ref>. Comparing every layer of the ResNet with every ViT layer shows, that the first 30 ViT layer have the most similar representations with the first 60 ResNet layers. The higher the representations of the models are, the lower is the similarity between them. This implies that local information aggregation, which is mostly captured in early layers, is import for both architectures and later more abstract representations are used for the final classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Applications of Digital Humanities for Art History</head><p>In recent years, multiple tools emerged for the automatic analysis of art <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18]</ref>. They often utilized Computer Vision models and focused on very different aspects of art <ref type="bibr" target="#b18">[19]</ref>. The predominant dataset for these studies is the WikiArt dataset. The WikiArt dataset consists of around 250,000 artworks by over 3000 artists and provides metadata regarding style, date, artist, genre and more for the individual pieces. Other commonly available datasets that have been used are the Web Gallery of Art (WGA) <ref type="foot" target="#foot_1">2</ref> and the TICC Printmaking dataset <ref type="foot" target="#foot_2">3</ref> . The WGA dataset consists of 52,867 pieces and like the WikiArt dataset includes artworks from many epochs and a variety of different media. The TICC dataset with 58,630 images and 210 artists is a more specific dataset in comparison. It focuses on printed artworks from the Netherlands State Museum (Rijksmuseum) and excludes other media.</p><p>The observed studies mainly focus on classification tasks. They include the differentiation of aspects like art style, genre, artist and the painting style <ref type="bibr" target="#b19">[20,</ref><ref type="bibr" target="#b20">21,</ref><ref type="bibr" target="#b21">22,</ref><ref type="bibr" target="#b22">23,</ref><ref type="bibr" target="#b23">24]</ref>. They use Support Vector Machines and different iterations of CNNs like the CaffeNet, ResNet18, ResNet50 or the All Convolutional Net. The results for the WikiArt dataset range from 33.62 % to 79.1 % for artist classification top-1 accuracy <ref type="bibr" target="#b21">[22,</ref><ref type="bibr" target="#b19">20]</ref>. For the WGA dataset, artist classification reached a score of 69.6 % (top-1 accuracy) and on the TICC dataset 76.2 % top-1 accuracy and 82.12 % mean class accuracy <ref type="bibr" target="#b19">[20,</ref><ref type="bibr" target="#b23">24]</ref>. In another study, a maximum accuracy of 80 % was obtained <ref type="bibr" target="#b24">[25]</ref>. Many other experiments were published, however, there they were applied to diverse datasets.</p><p>In regards of printed media from the early modern period only a few examples have been found. The TICC dataset includes printed portraits in their artist classification but do not specify findings or challenges regarding this medium like the influence of the used printing technique. Different printing techniques, like woodcuts, wood engravings and copper engravings are shown to have varying detection rates for CNN based models <ref type="bibr" target="#b25">[26]</ref>. These properties also seem to influence the quality of applications on printed media datasets like the detection of objects in early modern children and youth books. Beyond classification experiments, similarity has often been considered as an important concept in art history <ref type="bibr" target="#b26">[27,</ref><ref type="bibr" target="#b27">28]</ref>.</p><p>Utilizing visualization techniques like the Class Activation Maps reveal, that algorithms might not consider the content parts of an image and rather focus on other parts with more distinctive patterns like the frame of an image <ref type="bibr" target="#b28">[29]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experiment Setup</head><p>For this research a comparison between two deep learning models is conducted. First, the dataset for this experiment is introduced. Afterwards, the used models and the training process are briefly described. The last part describes the evaluation process to examine the classification results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Dataset</head><p>The dataset used in this experiment is part of an art historical collection of printed portraits from the Herzog August Library in Wolfenbüttel <ref type="foot" target="#foot_3">4</ref> . The collection consists of nearly 32,000 portraits of which roughly 28,000 have been digitalized over the last decade heterogeneously. Based on the metadata of the collection the ten most occurring printing artists have been selected for this classification experiment. In total 2834 images can be associated with these ten artists. As the distribution between the artists is highly diverse, with 631 portraits for the most prevalent artist and 156 examples for the least prevalent, the training dataset was limited to 140 randomly selected images per artist. Other studies used between 96 to 500 artworks per artist, which indicates that the number of examples is sufficient <ref type="bibr" target="#b19">[20,</ref><ref type="bibr" target="#b20">21,</ref><ref type="bibr" target="#b21">22]</ref>. The training dataset was split into 80 % training and 20 % test data per artist.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Models and Training Process</head><p>For this research the performance of a ResNet50 and a Visual Transformer will be compared based on the classification dataset. The ResNet is, as previously described, one of the most renown architectures and has been featured in a variety of different comparison studies. The other model is a large Visual Transformer model with 16x16 pixel patch size (ViT-L/16). Both models are trained with different hyperparameter configurations. The images have been resized in a preprocess step to fit the models' expected input size. The best models based on validation top-1 accuracy and validation loss are selected. Previous art historic works described that the ResNet is prone to overfit easily <ref type="bibr" target="#b19">[20,</ref><ref type="bibr" target="#b22">23]</ref>. To counteract this tendency early stopping is implemented. As the ViT-L is a much bigger model than the ResNet50 with 307 million parameters to compared 26 million parameters its possible overfitting tendencies are also monitored. For the ResNet 12 different hyperparameter configurations are tested. For the ViT the ViT-B/16 variant is also tested to observe possible differences in the training process. All models have been trained from scratch and in a resolution of 224x224 pixels or 384x384 pixels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Evaluation Process</head><p>To evaluate the performance of the models beyond the top-1 accuracy additional data sources and analysis tools are used. The metadata of the printed portrait collection also includes further information regarding the context and content of the portrait. These information include the printing technique that was used in creation of the portrait and also which section of the person was depicted (e.g. half-length portrait or chest up portrait). Additionally, the metadata includes the year of origin for 1656 portraits. One of the artists has been excluded in the productive periods comparison as only three portraits possess date information (Fennitzer, Georg). Afterwards the median per artist was calculated and used as a reference value for wrongly classified images. This way it can be observed if a wrongly assigned portrait was created in the same period as the artists' productive period. Before analysing the individual features a chi squared significance test has been made. The results here showed that there is a statistical dependency in the data, as the null-hypothesis was rejected.</p><p>For insights into the model's inner processes the previously mentioned Centered Kernel Alignment method, Gradient-weighted Class Activation Maps (GradCAMs) and Attention map visualizations are used. As only the core principle of CKA is implemented in the demonstration of previous works, the implementation in this work might differ. Other than in the related works proposed comparison of ResNet50 and ViT-L only the outputs of the individual blocks have been used as the calculation per layer is computationally demanding. This results in 50 outputs from the ResNet50 and 24 outputs from the ViT-L. To get standardized results the CKA is calculated on 25 examples per artist with 250 examples in total. The example representations are averaged layer wise before calculating the CKA. For the GradCAMs the last hidden layer of every ResNet block is visualized and manually compared to results from the Attention maps. The attention maps of the ViT are created by taking the sequence length x number of patches and averaging them layerwise over all attention heads<ref type="foot" target="#foot_4">5</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results</head><p>This chapter summarizes the main findings of the research. The final results for the test dataset with 1274 datapoints reached a top-1 accuracy of 46.13 % for the ResNet and 87.09 % for the Visual Transformer. Compared to all 2834 examples the ResNet50 achieved a top-1 accuracy of 51.9 % with 1471 hits and the ViT-L 92.27 % with 2615 hits. The distribution of error is comparable for both results. To have more datapoints for the following comparisons the full dataset of 2834 images will be used. The accuracy per artist is summarized in Table <ref type="table" target="#tab_0">1</ref>. For the ResNet the best prediction is for the artist Tobias Stimmer with 92.95 % and the lowest accuracy for Johann Martin Bernigeroth with only 17.39 %. The ViT achieved the highest accuracy also for Tobias Stimmer with 99.36 % and the lowest for Martin Bernigeroth with 84.79 %.</p><p>Evaluating the detection rate in comparison to the printing technique shows that ResNet had the highest accuracy for the wood engraving technique with 93.55 % when excluding mezzotint/ etching which only occurs two times in the whole dataset. The ViT also has the highest detection rate with 100% for wood engraving. Both models also have the lowest accuracy for the combination of etching / copper engraving with 37.34 % for the ResNet and 89.21 % for the ViT (see Table <ref type="table">2</ref>).</p><p>The mean difference between the artists productive time and wrongly assigned portraits is at 34.6 years for the ResNet and 20.4 years for the ViT. Lastly, for 75 % of the false predictions are within a time difference of 51.0 years for the ResNet model and 31.3 years for the ViT model. Transferring the detection rate to the portait type the highest detection rate for the ResNet was 73.17 % and 97.56 % for the ViT. Both of these results are for the headpiece portrait type. The lowest detection rate for the ResNet was 36.36 % and 81.81 % for the ViT. Both results are for the chest up portaits as seen in Table <ref type="table">3</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Discussion</head><p>The results of the presented experiments show the performance of two very different deep learning architectures on a rather small dataset. They also show an extreme difference in the accuracy of the ResNet50 model and the ViT model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Classification of Artists</head><p>Previous works achieved different results utilizing ResNet variants from 49.4 % top-1 accuracy for style detection to 80.0 % for artist classification which indicates that there might be room for improvement <ref type="bibr" target="#b22">[23,</ref><ref type="bibr" target="#b24">25]</ref>. As previously mentioned the ResNet's training process was prone to overfit. A bigger dataset with more classes or more examples per class might reduce the risk of overfitting. Contrary to that are the results of the ViT-L with 92.27 %. The developers of the Vision Transformer claim, that the ViT models are more prone to overfitting and perform worse on small datasets in comparison to ResNets <ref type="bibr" target="#b10">[11]</ref>. The final tests are conducted over the whole dataset of 2834 examples of which only 1120 were used for the actual training. This indicates that the ViT did not overfit in our experiments and extracted valid internal representations of the dataset.</p><p>Comparing the experiments' results with further information from the metadata additional trends can be observed. The first observation indicates different detection rates between artists. For the ResNet this trend is very strong with the best detection rate for Tobias Stimmer with 92.95 % accuracy and only 17.39 % for Martin Bernigeroth. This clearly shows that different artists possess differentiable features. This trend can also be observed for the ViT model, although the gap is smaller between 84.78 % for Martin Bernigeroth and 99.5 % for Tobias Stimmer. This supports the thesis that the portraits of Tobias Stimmer are easier to differentiate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Analysis of Metadata</head><p>In Figure <ref type="figure" target="#fig_1">2</ref>, the period of the productive period can be seen for the artists with available date information. Here it can be seen that Tobias Stimmer is the only of the nine artists activate in the late 16th century, whereas Martin Beringeroth is active in the first half of the 18th century together with four other artists. Overall, the performance of the ResNet is the worst in this period with the accuracy between 17.39 % and 54.65 %. The ViT again shares this trend with smaller extremes between 84.78 % and 94.19 %. For both models Johann Georg Mentzel is the easiest artist to identify for the models in this period. Another good indicator for the differentiability between the artists seems to be the printing technique. The ResNet assigned 93.5 % of the 155 Wood Engravings to the correct artist and the ViT achieved 100% accuracy for this printing technique. As it is one of the older printing techniques the only artist that used it is Tobias Stimmer. Another example is the Mezzotint technique which was nearly exclusively used by Georg Fennitzer. For this technique, ResNet assigned 81.65 % of 447 Mezzotint portraits correctly. The ViT classified 96.87 % correctly. This could indicate that exclusively used printing techniques are a good indicator for high classification accuracy. Contrary to this, the ViT achieved its second highest prediction accuracy with 97.54 % for the etching technique which was used by three different artists to a certain extend. This shows, that good accuracy can also be achieved without relying solely on the printing technique. This can be further emphasized for Copper Engravings which is the most used printing technique with 1503 portraits and frequently used by 8 of the 10 artists. Both models were able to achieve high accuracy for Matthias van Somer with 83.47 % for the ResNet and 96.61 % for the ViT, who used Copper Engravings in 212 of his 236 portraits in this dataset. This could indicate that he used a different style or other common elements in his portraits, which made it easier to identify his works. All in all, as both the ViT and the ResNet acquire a large amount of low-level information in the first few layers, the printing technique might be a good indicator for classification accuracy when comparing portraits from multiple centuries <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b15">16]</ref>.</p><p>The last used metadata feature, the portrait type, seems to have less influence on the accuracy although fluctuations can be observed here too. This could mainly be due to the fact that 2221 portraits are part of the two main portrait styles (chest up and half figure portraits). Different to previous comparisons the ResNet's detection rate had less divergent extremes with a difference of only 37 % (see Table <ref type="table">3</ref>). For the ViT, the difference accumulated to 23 % between the best and the worst detection rate. This indicates that this is a harder feature to focus on for the model. This could very well be due to the unbalanced distribution of portait styles in the used dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Model Analysis</head><p>Figure <ref type="figure" target="#fig_2">3</ref> shows the Centered Kernel Alignment representations for the ResNet on the left and ViT on the right. As previously mentioned the CKA method determines how similar the representations between the individual layers are. This is measured by the similarity score ranging from 0 to 1, where 1 is the highest similarity. As both axis of the graphs display the layers of the model the diagonal line always has a similarity of 1 as the layer is compared to itself. For the ResNet, it is clear to see, that layers in closer proximity share more similar representations than layers further apart due to the nature of its architecture. This is in line with the findings of previous work <ref type="bibr" target="#b14">[15]</ref>. The grid pattern, that can be observed if Figure <ref type="figure" target="#fig_2">3a</ref> arises from the architecture of the model <ref type="bibr" target="#b14">[15]</ref>. In previous studies this specific pattern in a Transformer model was attributed to the skip connection <ref type="bibr" target="#b15">[16]</ref>. ResNet's architecture also includes a skip (shortcut) connection <ref type="bibr" target="#b8">[9]</ref>. This could imply that this connection is also visible in the the ResNet's CKA. Contrary to previous studies Figure <ref type="figure" target="#fig_2">3 (b)</ref> shows the CKA of the ViT model has the highest similarity in the last third of the layers and a few corresponding layers in the middle part of the model. A possible explanation was given by Raghu et al.. If the ViT is not supplied with enough data it cannot learn local representations in early layers. This could result in lower similarity as displayed in Figure <ref type="figure" target="#fig_2">3</ref>. Visual Transformer have transitional phases where the representation between the layers shifts from lower layers to higher layers. Lower layers attend to local and also global information whereas higher layers attend only to global features <ref type="bibr" target="#b15">[16]</ref>. This could be supported by visualizing the GradCAMs and Attentions.</p><p>In Figure <ref type="figure" target="#fig_3">4</ref> the visualizations show which parts of the image are most important for the classification of the image. The overlay color determines how important a region is. Image regions that are yellow have a big impact on the classification, green a slight influence and blue regions (the base color) are irrelevant. In the early layers it is possible to see, that the ResNet focusses on the lower level details as it highlights part of the clothes, background and portrait frame, possibly to determine the structures of the printing technique, as previous art historical works observed similar behaviour (Figure <ref type="figure" target="#fig_3">4</ref> (a)) <ref type="bibr" target="#b28">[29]</ref>. This can also be observed for the ViT, although in lower detail, as it seems to observe all parts of the portrait around the portrait person first (Figure <ref type="figure" target="#fig_3">4 (b)</ref>). With higher layers the ResNet keeps its focus and propagates the detected local features to global features for a final classification (Figure <ref type="figure" target="#fig_3">4 (c)</ref>). For the ViT the attention shifts completely and nearly exclusively attends to the portrayed person for its classification (Figure <ref type="figure" target="#fig_3">4 (d)</ref>). This supports that there is no similarity between the lower and higher layers in the CKA analysis (the dark areas of Figure <ref type="figure" target="#fig_2">3 (b)</ref>). These differences can be observed for multiple examples. As the ViT focusses on the portrayed person means, that this part of the artwork must possess important information. This could be due to a multitude of reasons. One explanation could be, that the artists must have distinctive styles. It might also be due to the social classes of the customers being portrayed by an artist. Lastly, also the depiction conventions of the different time periods could be of significance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion and Future Work</head><p>This research shows the applicability of modern deep learning models for an art historical dataset. On the one hand it demonstrates how newer state of the art Transformer based models perform in comparison to established CNN based models. This is especially significant in regards of the high top-1 accuracy which the Visual Transformer achieved with over 87 % and a very small dataset size. This performance shows that this architecture might be useful for niche or other art historical classification problems to potentially outperform older models and thus support the work of art historians more reliably. It might also be interesting how the ViT would perform with a bigger dataset and more artists. For this the dataset could be expanded to include more artists or an eleventh "other" class, which could be useful for metadata generation. It is to be said, that the performance of a different CNN model like the EfficientNet could achieve superior results <ref type="bibr" target="#b29">[30]</ref>. This needs to be analyzed in future work. On the other hand, this research utilized a variety of different tools to analyse both the prediction results and model representations. For this artist classification it can be seen that other features of the portraits and even historical aspects are significant indicators. Features like the printing technique have a noticeable impact for the prediction quality, especially for the ResNet model. The year of a portraits creation also seems to be important, as depiction trends could influence how persons are portrayed. The usage of visualization techniques showed that both models seem to focus on the background on lower layers as they might be attending to the small distinctive features of the printing technique. The ViT often focused on the displayed person to determine the portrait's artist. This indicates, that the portrayed person might present valuable information for the classification process. Further models for analysing deep networks could be used in future work <ref type="bibr" target="#b30">[31]</ref>.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Three examples from the top artists depicting similarly dressed scholars (a) from Johann Georg Mentzel (b) from Johann Georg Mentzel (c) from Johann Christoph Sysang Artworks part of the printed portrait collection of the Herzog August Library Source: http://portraits.hab.de/</figDesc><graphic coords="3,111.59,84.19,96.25,127.56" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2:This graph shows the share of the top 10 artists over all portraits from the Herzog August Library's collection of printed portraits per decade. Fennitzer, Georg has been removed as previously mentioned as no dates were available</figDesc><graphic coords="10,89.29,84.19,416.69,212.60" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Centered Kernel Alignment visualization (a) Similarity between the ResNet layers for all 50 layers (b) Similarity between the ViT Layers for all 24 layers</figDesc><graphic coords="11,113.47,84.19,180.01,141.73" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: GradCAM and Attention visualization (a) Middle layer of ResNet (b) Second layer of ViT (c) Last layer of ResNet (b) Third to last layer of ViT Artworks part of the printed portrait collection of the Herzog August Library Source: http://portraits.hab.de/</figDesc><graphic coords="12,148.47,230.68,110.01,127.56" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Top-1 accuracy for the ten artists</figDesc><table><row><cell>Artist</cell><cell></cell><cell cols="3">Top-1 accuracy in % Quantity</cell></row><row><cell></cell><cell></cell><cell>ResNet</cell><cell>ViT</cell></row><row><cell cols="2">Johann Martin Bernigeroth</cell><cell>17,39</cell><cell>87,96</cell><cell>299</cell></row><row><cell>Martin Bernigeroth</cell><cell></cell><cell>31,70</cell><cell>84,79</cell><cell>631</cell></row><row><cell>Georg Fennitzer</cell><cell></cell><cell>84,49</cell><cell>97,61</cell><cell>419</cell></row><row><cell>Johann Franck</cell><cell></cell><cell>89,38</cell><cell>97,50</cell><cell>160</cell></row><row><cell>Wolfgang Philipp Kilian</cell><cell></cell><cell>38,77</cell><cell>91,63</cell><cell>227</cell></row><row><cell>Johann Friedrich Leonart</cell><cell></cell><cell>53,68</cell><cell>95,24</cell><cell>231</cell></row><row><cell>Johann Georg Mentzel</cell><cell></cell><cell>54,65</cell><cell>94,19</cell><cell>172</cell></row><row><cell>Matthias van Somer</cell><cell></cell><cell>83,47</cell><cell>96,61</cell><cell>236</cell></row><row><cell>Tobias Stimmer</cell><cell></cell><cell>92,95</cell><cell>99,36</cell><cell>156</cell></row><row><cell>Johann Christoph Sysang</cell><cell></cell><cell>24,42</cell><cell>92,08</cell><cell>303</cell></row><row><cell>Total</cell><cell></cell><cell>51,91</cell><cell>92,27</cell><cell>2834</cell></row><row><cell>Table 2</cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">Detection rate per printing technique combinations</cell><cell></cell></row><row><cell>Printing technique</cell><cell></cell><cell cols="3">Detection rate in % Quantity</cell></row><row><cell></cell><cell></cell><cell>ResNet</cell><cell>ViT</cell></row><row><cell>Copper Engraving</cell><cell></cell><cell>42,78</cell><cell>90,22</cell><cell>1503</cell></row><row><cell>Etching</cell><cell></cell><cell>55,33</cell><cell>97,54</cell><cell>244</cell></row><row><cell cols="2">Etching/ Copper Engraving</cell><cell>37,34</cell><cell>89,21</cell><cell>482</cell></row><row><cell>Mezzotint</cell><cell></cell><cell>81,66</cell><cell>96,89</cell><cell>447</cell></row><row><cell>Mezzotint/ Etching</cell><cell></cell><cell>100,00</cell><cell>100,00</cell><cell>2</cell></row><row><cell>Wood engraving</cell><cell></cell><cell>93,55</cell><cell>100,00</cell><cell>155</cell></row><row><cell>Table 3</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Detection rate per portrait type</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Portrait type</cell><cell cols="4">Detection rate in % Quantity</cell></row><row><cell></cell><cell cols="2">ResNet</cell><cell>ViT</cell></row><row><cell>Bust</cell><cell></cell><cell>36,36</cell><cell>81,82</cell><cell>11</cell></row><row><cell>Full length</cell><cell></cell><cell>40,74</cell><cell>74,07</cell><cell>27</cell></row><row><cell>Half length</cell><cell></cell><cell>40,57</cell><cell>89,69</cell><cell>456</cell></row><row><cell>Head piece</cell><cell></cell><cell>73,17</cell><cell>97,56</cell><cell>41</cell></row><row><cell>Kit-cat</cell><cell></cell><cell>44,86</cell><cell>91,83</cell><cell>1090</cell></row><row><cell>one-quarter length</cell><cell></cell><cell>63,57</cell><cell>94,25</cell><cell>1131</cell></row><row><cell>three-quarter length</cell><cell></cell><cell>41,54</cell><cell>87,69</cell><cell>65</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://paperswithcode.com/sota/image-classification-on-imagenet</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://www.wga.hu/index_database.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://auburn.uvt.nl/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">Herzog August Bibliothek Wolfenbüttel, Germany, (www.hab.de)</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">Code from: https://github.com/jeonsworld/ViT-pytorch/blob/main/visualize_attention_map.ipynb</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A survey of computational methods for iconic image analysis</title>
		<author>
			<persName><forename type="first">N</forename><surname>Van Noord</surname></persName>
		</author>
		<idno type="DOI">10.1093/llc/fqac003</idno>
	</analytic>
	<monogr>
		<title level="j">Digital Scholarship in the Humanities</title>
		<imprint>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="page" from="1316" to="1338" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Digital art history and the computational imagination</title>
		<author>
			<persName><forename type="first">G</forename><surname>Mercuriali</surname></persName>
		</author>
		<idno type="DOI">10.11588/dah.2018.3.47287</idno>
	</analytic>
	<monogr>
		<title level="j">International Journal for Digital Art History</title>
		<imprint>
			<biblScope unit="page">141</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Debating digital art history</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bentkowska-Kafel</surname></persName>
		</author>
		<idno type="DOI">10.11588/dah.2015.1.21634</idno>
	</analytic>
	<monogr>
		<title level="j">International Journal for Digital Art History</title>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Porträtgalerien auf Papier. Sammeln und Ordnen von druckgrafischen Porträts am Beispiel Kaiser Franz&apos; I. von Österreich und anderer fürstlicher Sammler</title>
		<author>
			<persName><forename type="first">P</forename><surname>Poch</surname></persName>
		</author>
		<idno type="DOI">10.7767/9783205208556</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
			<publisher>Böhlau Verlag</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<author>
			<persName><forename type="first">S</forename><surname>Skowronek</surname></persName>
		</author>
		<ptr target="https://books.google.de/books?id=_Jpx8FkObicC" />
	</analytic>
	<monogr>
		<title level="m">Autorenbilder: Wort und Bild in den Porträtkupferstichen von Dichtern und Schriftstellern des Barock</title>
		<title level="s">Würzburger Beiträge zur deutschen Philologie</title>
		<imprint>
			<publisher>Königshausen &amp; Neumann</publisher>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Niedermeier</surname></persName>
		</author>
		<ptr target="https://kunstgeschichte-kongress.de/programm/programm-2022/,DeutscherKunsthistorikertag" />
		<title level="m">Visuelle Ähnlichkeit als relationaler Formbegriff: Automatische Bilderkennung von Reproduktionen frühneuzeitlicher Porträtgrafik</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">C 5 Bild-und Video-Retrieval</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Diem</surname></persName>
		</author>
		<idno type="DOI">10.1515/9783110769043-035</idno>
	</analytic>
	<monogr>
		<title level="m">Grundlagen der Informationswissenschaft</title>
				<editor>
			<persName><forename type="first">R</forename><surname>Kuhlen</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Lewandowski</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">W</forename><surname>Semar</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Womser-Hacker</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin, Boston</addrLine></address></meeting>
		<imprint>
			<publisher>De Gruyter Saur</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="413" to="422" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Neural Networks and Deep Learning A Textbook</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">C</forename><surname>Aggarwal</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-319-94463-0</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
			<publisher>Springer</publisher>
			<pubPlace>Cham</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Deep residual learning for image recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
		<idno>CoRR abs/1512.03385</idno>
		<ptr target="http://arxiv.org/abs/1512.03385" />
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Resnet strikes back: An improved training procedure in timm</title>
		<author>
			<persName><forename type="first">R</forename><surname>Wightman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Touvron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Jégou</surname></persName>
		</author>
		<idno>CoRR abs/2110.00476</idno>
		<ptr target="https://arxiv.org/abs/2110.00476.arXiv:2110.00476" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">An image is worth 16x16 words: Transformers for image recognition at scale</title>
		<author>
			<persName><forename type="first">A</forename><surname>Dosovitskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Weissenborn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dehghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Minderer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Heigold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gelly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Houlsby</surname></persName>
		</author>
		<idno>CoRR abs/2010.11929</idno>
		<ptr target="https://arxiv.org/abs/2010.11929" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">R</forename><surname>Selvaraju</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Vedantam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cogswell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Parikh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Batra</surname></persName>
		</author>
		<idno>CoRR abs/1610.02391</idno>
		<ptr target="http://arxiv.org/abs/1610.02391" />
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Visualizing data using t-sne</title>
		<author>
			<persName><forename type="first">L</forename><surname>Van Der Maaten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Hinton</surname></persName>
		</author>
		<ptr target="http://jmlr.org/papers/v9/vandermaaten08a.html" />
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="2579" to="2605" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Mcinnes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Healy</surname></persName>
		</author>
		<idno>CoRR abs/1802.03426</idno>
		<ptr target="http://arxiv.org/abs/1802.03426.arXiv:1802.03426" />
		<title level="m">UMAP: uniform manifold approximation and projection for dimension reduction</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Similarity of neural network representations revisited</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kornblith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Norouzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">E</forename><surname>Hinton</surname></persName>
		</author>
		<idno>CoRR abs/1905.00414</idno>
		<ptr target="http://arxiv.org/abs/1905.00414.arXiv:1905.00414" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Do vision transformers see like convolutional neural networks?</title>
		<author>
			<persName><forename type="first">M</forename><surname>Raghu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kornblith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dosovitskiy</surname></persName>
		</author>
		<idno>CoRR abs/2108.08810</idno>
		<ptr target="https://arxiv.org/abs/2108.08810.arXiv:2108.08810" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Hastik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Hegel</surname></persName>
		</author>
		<idno type="DOI">10.17169/refubium-30108</idno>
		<title level="m">Bilddaten in den Digitalen Geisteswissenschaften</title>
				<meeting><address><addrLine>Berlin</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Distant viewing: analyzing large visual corpora</title>
		<author>
			<persName><forename type="first">T</forename><surname>Arnold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Tilton</surname></persName>
		</author>
		<idno type="DOI">10.1093/llc/fqz013</idno>
	</analytic>
	<monogr>
		<title level="j">Digital Scholarship in the Humanities</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="3" to="16" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">The visual digital turn: Using neural networks to study historical images</title>
		<author>
			<persName><forename type="first">M</forename><surname>Wevers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Smits</surname></persName>
		</author>
		<idno type="DOI">10.1093/llc/fqy085</idno>
	</analytic>
	<monogr>
		<title level="j">Digital Scholarship in the Humanities</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="194" to="207" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Fine-tuning convolutional neural networks for fine art classification</title>
		<author>
			<persName><forename type="first">E</forename><surname>Cetinic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lipic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Grgic</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.eswa.2018.07.026</idno>
	</analytic>
	<monogr>
		<title level="j">Expert Syst. Appl</title>
		<imprint>
			<biblScope unit="volume">114</biblScope>
			<biblScope unit="page" from="107" to="118" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">Artist Identification with Convolutional Neural Networks</title>
		<author>
			<persName><forename type="first">N</forename><surname>Viswanathan</surname></persName>
		</author>
		<ptr target="http://cs231n.stanford.edu/reports/2017/pdfs/406.pdf" />
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Large-scale classification of fine-art paintings: Learning the right metric on the right feature</title>
		<author>
			<persName><forename type="first">B</forename><surname>Saleh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename></persName>
		</author>
		<idno>CoRR abs/1505.00855</idno>
		<ptr target="http://arxiv.org/abs/1505.00855.arXiv:1505.00855" />
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Recognizing art style automatically in painting with deep learning</title>
		<author>
			<persName><forename type="first">A</forename><surname>Lecoutre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Négrevergne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Yger</surname></persName>
		</author>
		<ptr target="http://proceedings.mlr.press/v77/lecoutre17a.html" />
	</analytic>
	<monogr>
		<title level="m">Proceedings Asian Conference on Machine Learning, ACML 2017</title>
				<meeting>Asian Conference on Machine Learning, ACML 2017<address><addrLine>Seoul, Korea</addrLine></address></meeting>
		<imprint>
			<publisher>PMLR</publisher>
			<date type="published" when="2017">Nov. 15-17. , 2017</date>
			<biblScope unit="page" from="327" to="342" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<title level="m" type="main">Learning scale-variant and scale-invariant features for deep image classification</title>
		<author>
			<persName><forename type="first">N</forename><surname>Van Noord</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">O</forename><surname>Postma</surname></persName>
		</author>
		<idno>CoRR abs/1602.01255</idno>
		<ptr target="http://arxiv.org/abs/1602.01255.arXiv:1602.01255" />
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Painter prediction from artworks with transfer learning</title>
		<author>
			<persName><forename type="first">C</forename><surname>Cömert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Ozbayoglu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Kasnakoğlu</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICMRE51691.2021.9384828</idno>
	</analytic>
	<monogr>
		<title level="m">7th Intl. Conference on Mechatronics and Robotics Engineering (ICMRE)</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="204" to="208" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Deep learning for historical books: classification of printing technology for digitized images</title>
		<author>
			<persName><forename type="first">C</forename><surname>Im</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</author>
		<idno type="DOI">10.1007/s11042-021-11754-7</idno>
	</analytic>
	<monogr>
		<title level="j">Multimedia Tools and Applications</title>
		<imprint>
			<biblScope unit="volume">81</biblScope>
			<biblScope unit="page" from="5867" to="5888" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Wie sich die Bilder ähneln</title>
		<author>
			<persName><forename type="first">W</forename><surname>Helm</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Schmideler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Im</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kollmann</surname></persName>
		</author>
		<author>
			<persName><surname>Müller</surname></persName>
		</author>
		<idno type="DOI">10.26298/melusina.8f8w-y749-wsdb</idno>
	</analytic>
	<monogr>
		<title level="m">Fabrikation von Erkenntnis: Experimente in den Digital Humanities</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Attesting similarity: Supporting the organization and study of art image collections with computer vision</title>
		<author>
			<persName><forename type="first">S</forename><surname>Lang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ommer</surname></persName>
		</author>
		<idno type="DOI">10.1093/llc/fqy006</idno>
	</analytic>
	<monogr>
		<title level="j">Digital Scholarship in the Humanities</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="845" to="856" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Applying computer vision systems to historical book illustrations: Challenges and first results</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Im</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Schmideler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Helm</surname></persName>
		</author>
		<ptr target="https://ceur-ws.org/Vol-2865/poster7.pdf" />
	</analytic>
	<monogr>
		<title level="m">Post-Proceedings 5th Conference Digital Humanities in the Nordic Countries (DHN 2020)</title>
				<meeting><address><addrLine>Riga, Latvia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">Oct. 21-23. 2020</date>
			<biblScope unit="volume">2865</biblScope>
			<biblScope unit="page" from="255" to="260" />
		</imprint>
	</monogr>
	<note>CEUR Workshop Proceedings</note>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<title level="m" type="main">Efficientnet: Rethinking model scaling for convolutional neural networks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<idno>CoRR abs/1905.11946</idno>
		<ptr target="http://arxiv.org/abs/1905.11946.arXiv:1905.11946" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">An empirical survey on explainable ai technologies: Recent trends, use-cases, and categories from technical and application perspectives</title>
		<author>
			<persName><forename type="first">M</forename><surname>Nagahisarchoghaei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Nur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Cummins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Nur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">M</forename><surname>Karimi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Nandanwar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bhattacharyya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Rahimi</surname></persName>
		</author>
		<idno type="DOI">10.3390/electronics12051092</idno>
	</analytic>
	<monogr>
		<title level="j">Electronics</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
