From Visual Forms to Metaphors – Targeting Cultural Competence in Image Analysis Lars Oestreicher 1 and Jan von Bonsdorff 2 1 Department of Information Technology, Uppsala University, Box 337, 751 05 Uppsala, Sweden 2 Department of Art History, Uppsala University, Box 630, 751 26 Uppsala, Sweden. Abstract Image analysis has taken a large step forward with the development within machine learning. Today, recognizing images as well as constituent parts of images (faces, objects, etc.) is a relatively common task within machine learning. However, there is still a big difference between recognizing the content of a picture and understanding the meaning of the image. In the current project we have chosen an interdisciplinary approach to this problem, including art history, machine learning and computational linguistics. Current approaches pay large attention to details of the image when trying to describe what is in the picture, resulting, e.g., in that smiling faces will support the interpretation of the image as “positive” or “happy”, even if the picture itself is a scary scene. Other problematic issues are irony and other polyvalent messages with a large amount of ambiguity that enables for example humorous interpretations of a picture. As a starting point, we have chosen to identify visual agency, i.e., how and why pictures, when regarded as acting agents, effectively may catch the attention of the viewer. Our objective for this first phase of the project is to investigate multi-modal models’ capacity for recognizing such high-level image content as, for example, context, agency, visual narration, and metaphors. Ultimately, the goal is to improve cultural competence and visual literacy of neural networks through art-historical and humanities expertise. In the paper we will describe our current approach, the general ideas behind it, and the methods that will be used. Keywords1 Multi-modal machine learning, high-level image content, visual metaphors, cultural competence, pictorial conventions 1. Introduction Gottfried Boehm tells us that the image includes a duality; it shows away from itself, while still maintaining a materiality: “Bilder sind spannungsgeladene, real-irreale Körper” (Images are bodies fraught with tension, simultaneously material and immaterial) [1]. In this way, it is more or less obvious that a picture is in some sense “larger” than the bare sum of its parts. Let us call this the Jack-in-the- box quality. When you open the “box” of the image, you tend to get more than you ask for. This is our conception of the image: Any well-conceived image is an active interpellative instant, speaking, violently interrupting, yelling, and tugging at the observer’s sleeve. For us the image or the work of art can work as a kind of golem, a being with restricted life-like properties. Still, if this golem wants to communicate, it has to use some kind of sign system known to others. It is this systematized will-to-communication that we want to single out and convey to the neural networks. Today, we have a new player in the field of image interpretation, namely systems based on artificial intelligence, primarily in the shape of machine learning within the area of image analysis. The systems that are being developed can give more and more precise descriptions of the content in the pictures, in terms of what objects the image contains. It is even possible to see physical relations (even in depth, The 6th Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022), Uppsala, Sweden, March 15-18, 2022. Email: LarsOe@it.uu.se (A 1); Jan.Von.Bonsdorff@konstvet.uu.se (A 2) ORCID: 0000-0003-2560-425X (A 1); 0000-0002-9482-9690 (A 2) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 343 such as distance) or even shallow relationships [2] between the objects in the image. By shallow, we mean such that can be described by simple expressions, such as “Man wears glasses” or “Woman holds spanner”. Even so, the more higher-level message of the image is still not addressed by these systems. These shallow interpretations still reside on a relatively simple level, mostly providing the superficial interpretation of the objects as mentioned above. In this way, most machine learning applications would for example have serious problems making sense of some of the more complex pictures by Albrecht Dürer (see Figure 1). Identifying the objects in the picture is quite simple (as long as the image recognition software has been trained on the type of objects that are involved in the picture). Figure 1. Albrecht Dürer, Melencolia I (Engraving 1514). For example, it would be relatively easy for an artificial intelligence to identify discrete, well-known elements in the engraving: the two angels (one large, one small), the dog, the scientific instruments, the ladder, the rainbow. Projects like for example Saint George on a Bike (Barcelona Supercomputing Center) have shown a working model of making discrete iconographic motifs in medieval fine art accessible to machine learning and artificial intelligence [3]. What is more, the open Web platform iArt (https://labs.tib.eu/iart), specializes in differentiated fine art searches, using a complex modular system architecture [4]. The iArt search engine masters iconographically based classification principles that, e.g., examine objects for biblical motifs or general genre themes. But back to Dürer’s engraving, where uncertainties appear at all levels. What about the planet in the background – or is it a comet? What kind of space does the angel reside in? Is it a terrace high above sea level? Do the architectural elements imply an open or a closed room? What kind of geometrical form is the weird polyhedron? The AI could perhaps work this out, if equipped with a specialized data base. The same applies for the rich and well-documented iconographic tradition of personifications and allegories of Melancholy and Geometry. But what about making a relevant synthesis of the vast corpus of literary and scientific knowledge connected with the concept of Melancholy and more specifically with the engraving, trying to reach the level “what is it all about”? Typical confident statements, at times summarizing years of art-historical studies, such as “This engraving is an allegory of the creative mind in repose” or “This is a mind-map of Dürer's scientific thinking” or “The engraver is fooling the audience with nonsensical content” are certainly beyond the current capacity of the machine. We do not strive to reach any confident closures, but rather a tool that can make worthwhile suggestions, see unexpected connections and follow mental leaps. 2. Theoretical Background The project depends heavily on both the latest developments within artificial intelligence and current theoretical trends within the field of art history. The intersection between these areas has attracted much attention lately, where the AIA is discussed both as producing artist and as analytic perceiver. The 344 perception of an image by a machine learning system still acts on a fairly low level, and there is a lack of research on the deeper levels of the understanding/interpretation of art in this area, which is the focus of the project presented in this paper. 2.1. Art History and Meaning Most images do not only portray a certain scene, but rather try to tell a (visual) story, and our approach does not start on the formal level of object detection, but commences at the instance when images start to do more than just show things, i.e., when they engage the observer and start to tell stories. The visual stories are difficult to describe in words, since the images are so visual in their character that linguistic expressions do not seem capable of capturing an adequate meaning of all the aspects of an image. Many aspects of an image are also dependent on the background of the observer, his or her background, cultural context etc. For example, the interpretation of a painting with a religious scene benefits from an understanding of the religious belief that lies beneath the motif of the painting [2]. Different details of a painting will trigger different associative patterns depending on whether the image is viewed from the cultural perspective or from a more direct interpretation of the scene. Of course, the respective details of the picture can be described with a similar understanding of the elements. However, the interpretation of the visual story will most likely differ significantly. Figure 2. A first sample advertisement from the material collected in the pilot study (from 1956). If the interpretation framework is not known, the visual story or narrative will often not be recognised, or only recognised with large difficulty [5]. Of course, all possible contexts of an image cannot be known. The amount of unknown and unreadable contexts seems to grow with distance in time. Cave paintings from different periods as well as Neolithic rock art can seem impenetrable except in the most general meaning. Much of the beliefs and incentives of these long-lost cultures are known only through the visual recordings with no possibility of comparative corroboration through other historical sources. Interestingly enough, the difficulty of the close reading of a culture also pertains – to a degree – to our chosen visual sources, the visual ephemera of advertisements from the 20th century. Some content may be readily identified, like the smiling lady in the tooth paste ad from 1956 (See Figure 2). The visually stated metaphor (to be precise, a simile) reads: “The lady's teeth are like pearls”. In this connection, it seems fruitful to turn to conceptual metaphorics as formulated by, among many others, Lakoff and Johnson [6]. Visual metaphors are wonderfully suggestive and effective in engaging the viewer’s interest and starting up narratives of the kind we are looking for; the viewer has to make mental leaps, and the effort of the resulting combinations seems to bring the audience immediate reward in 345 terms of an expected story. This suggests a very swift accumulation of meaning through nearly instantaneous interpretation. This is the “Jack-in-the-box” quality we mentioned earlier. The attribute is a feature typical of visual metaphors. The attribute can be an aspect of the interpreted thing itself: “white” teeth can be made “pearly and lustrous”. But usually, a metaphor (verbal or visual) contains two domains, that are not the same but contain overlapping features, like Shakespeare’s famous metaphor of the “eye of heaven”, that is, the sun. In this humble, but quite suggestive toothpaste advertisement, “teeth” are given positive attributes; they are “shiny” as the light (the Swedish word “tänder” can mean both “light up” and “teeth”). Providing the woman with well-known attributes fixes the firm staging of a possible story-to-be-told. The pointers to strong connotations work as a kind of semantic reinforcement or amplification [5]. When the object is clearly depicted with familiar attributes, it aids the identification, it securely establishes the objecthood of the thing in the physical world, and, in this case, it is meant to impress. The agent (here the young lady) is endowed with a stronger presence when bestowed with amplified characteristics, here “lustrous, shiny teeth”. This is as far as we can go with the general content of the image. What cannot be known without other sources, is that the attributes of “light” and “gleam” also alludes to a real-world fact: The tooth paste brand “Stomatol” was the text of the first electric sign advertisement in Stockholm in 1909, a circumstance still well-known to the readers of Swedish magazines at the time of the advertisement from 1956, but not so familiar today. Thus, some content is more “general”, some “special”. In our annotation practice we may not be able to grasp some of the more specialized meanings. In these cases, the general meanings will have to do. But – we will come a long way with specialized art historical background knowledge and terminology, so we feel well equipped to tackle the aspect of content loss through time. 2.2. Artificial Intelligence for Meaning Detection In the area of artificial intelligence, the project will utilize a combination of methods from image analysis and natural language processing. We will in the following look at each area and then discuss the possibilities in a combined method. Starting from a set of annotated images, the main idea is that the system will be able to perform an analysis of the annotations as a means to categorize the images in themes or topic areas. The topic areas can then be anchored onto the themes in terms of, for example, a more pragmatic interpretation (in the end even considering metaphorical meanings in the images). 2.2.1. Computational Image Analysis The use of computers within the visual Arts field has been researched almost as long as computers have had the capacity to calculate large arrays (of pixel values). The programs at that time has mostly been as a supportive tool for artists, in the shape of, for example, Adobe Photoshop, and the more artistically oriented Corel Painter application. The issue of image generation and manipulation has also been a popular field of development in the early 1980s, mostly with miniature programs such as “Pico” [7]. Computers were from a very early time considered as very potent actors as producers in the Visual Arts (see, e.g. the volume by Spalter [8]). However, the capacity for automated analysis of images was still lacking, both in terms of hardware and software resources. With increasing computational capacity, the area of computational image analysis gained a large interest, especially within medical applications, for example as a tool for early diagnosis of illnesses, through the early identification of cell anomalies, such as cancer cells. Initially this analysis was based on traditional software solutions, using large statistical packages. Image analysis has grown into an expansive area of research already before the introduction of deep learning. The application area has grown and the research within general image analysis has taken a large step forwards in the last decades, much due to the progressive development within machine learning and deep learning, currently reaching image recognition and object detection precision of more than 90 percent in the average case. As an example, we will look at the simplest principles for image analysis, where the analysis is based on the recognition of gradually more and more complex features that are found in the image. The algorithms used in this type of systems depend on finding 346 features in the pictures on higher and higher levels. The analysis methods applied here are basically of a statistical kind, but in a less guided form than traditional statistical analysis methods. While training a machine learning model, the analysis is directed towards finding possible patterns at different levels of abstraction in the training data (the pictures), which in itself are reduced to matrices of numbers, representing images, sentences, data structures or any other kind of data. This means that the training and use of a ML-system, in one aspect, is insensitive to the type of data, but only tries to find recurring data patterns in the numbers provided as input. In the lower levels the simplest features will be recognized as, e.g., lines or points which are then clustered and recognized at higher and higher levels of abstraction from the basic pixel level. In a mid- level feature detector, we find some shapes that can be seen in the image as belonging to parts of the motif, such as wheels (circular shapes) and sweeping arches, which are then even more recognizable in the high-level feature map. The classifiers then detect the features in the images, and will combine features into clusters that signify object types in an image. Through this principle the networks, when trained, can detect different kinds of objects in the images, and through these also draw conclusions of image themes, such as whether the picture shows the already mentioned big city or a mountain village, a harbour or a supermarket parking, for example. Through these methods of feature analysis, it is also possible to train networks for an analysis of anomalies, such as, the method used to detect cancer cells in tissue samples, or for the detection of retinal changes in case of certain diseases. There are a large number of application areas where this type of analysis has brought huge possibilities to the field. There are also constant improvements to these application areas from public competitions, such as those managed on the competition oriented Kaggle website (https://www.kaggle.com). Today the simpler methods for image analysis with Deep Learning can be applied successfully even as hobby programming, and what were previously research problems are today often not even suitable for theses on the lower levels of education. However, it is also a fact that more complex image analysis applications still require a large amount of work, both in programming, as well as runtime processing time. The application of machine learning to art in general, and more specifically to imagery is also an expansive field, with quite a few examples of interesting applications, such as the detection of art style, and the re-application of art styles to pictures and even to photographs, for example in the shape of numerous applications for photo manipulation being available in mobile devices. These techniques are now so wide-spread that this has become a large ethical issue with, for example, fake pictures of celebrities being spread [9]. With this plethora of applications for machine learning in combination with images, we are tempted to assume that most of the problems have been approached in one way or another by machine learning applications. When we look at the approach as outlined above, most of the image analysis and manipulation applications are still based primarily on a massive statistical analysis of features. But, image analysis and interpretation from the perspective of art is much more complex than just detecting denotative features or the general drawing style [10]. In the images there are other subtler communicative aspects that are essential to the interpretation of the pictures, but which are difficult to catch with the current image analysis methods. These less obvious nuances may even sometimes not be covered by a sampling of the objects in the picture, or by the manner of painting, but depend on more subtle aspects of the picture. In fact, some of the factual clues to an interpretation of a picture may not even be possible to specify in words or expressions. However, human observers are often able to classify images according to their message from this perspective. 2.2.2. Natural Language Processing for Images Also, within the research in computational linguistics, there has been a large development with the introduction of machine learning. Previous systems were mostly based on a use of semantic or conceptual network representations of knowledge, where the texts were translated into large, interconnected networks of nodes (see for example [11], [12]). More grammatically oriented attempts were also used, for example, based on case grammar descriptions [13]. These approaches have more or less disappeared, in favour of more statistical method. Many early language translation systems used 347 statistical methods as a base. The precision and capacity of these early systems were low. With increasing hard- and software capacities, the precision improved greatly over time. A major change came with the introduction of neural networks, and the field has expanded into many different directions, from language translation to more specific systems, used for summarizing, question answering and sentiment analysis (for example, of movie reviews). Many of these systems make use of pretrained language models, which can be fine-tuned to be used for new tasks (also referred to as transfer learning). A prominent, recent example of such systems is the Bert system [14], which aims at building a model for language understanding in different domains. A later development resulted in the Visual-linguistic Bert (VL-Bert), which combines methods for image analysis with the Bert system [15]. VL-Bert is supposed to facilitate visual common-sense reasoning as well as answering questions about images. This is one step towards the higher-level analysis that is the goal of our project. Another major change is the event of transformer-based, multimodal neural networks building on massive amounts of raw data. In January 2021, OpenAI introduced CLIP (Contrastive Language-Image Pre-Training) [16]. CLIP is a neural network which efficiently learns visual concepts from natural language supervision. It learns directly from raw text about images (without resorting to manually labelled data). CLIP relies on 400 million image-text pairs from the internet, that is, images with captions. CLIP models can be applied to nearly any visual classification task without needing additional training examples (like ImageNet would). And furthermore, CLIP allows researchers to design their own classifiers and removes the need for task-specific training data. The significance of the approach used in CLIP lies in that, in contrast to most other systems, it will not only tell which class a certain object in the image belongs to, but it can provide an adequate result on a natural language search prompt of what the image depicts. In this way it will add some (still superficial) contextual knowledge to the picture, as one step in the direction we would like to work. We will not go into details about the technical aspects of the CLIP system here, but there are some features that are worth noting. The system has for example been trained on text/image pairs, such as could be gained from social media, such as Instagram, where people are publishing photos together with their own taglines, which often describes the motif in more detail, than just “a cat”. The system also does not “kill” the connotations through the creation of distinct classes (which can be translated into non-meaningful, but computable entities, such as numbers or token strings). The CLIP system is built in such a way that it will provide a description which is close to natural language, or, provide a picture that shows what the description entails. It is a system that can work in two directions, either providing textual descriptions from images, and providing images that are found from textual descriptions. According to the company openAI, the system contains of two types of encoders (one for images and one for texts) in combination with “zero-shot transfer, natural language supervision, and multimodal learning” (https://openai.com/blog/clip/). It is in this context important to note that neither CLIP nor the natural language application GPT-3 (produced by the same company) can be said to really understand the expressions inherent in the images or the texts that it produces or analyses. The output of this type of system is essentially a statistical prediction, based on the large number of calculations made over huge amounts of data samples. In short, the system analyses a very large amounts of pictures together with descriptions of the pictures made by humans. From this input data it generates a prediction of what would be the most likely description given by a human to the new images. However, it cannot be said to analyze and interpret neither the images, nor the texts in the human sense. The situation is almost identical to the dilemma described by Searle, in his “Chinese room” example [17], where he, in short, states that if an entity (human or machine) learns and uses a large number of rules for the handling of language, it can produce perfect sentences and even answers to questions without understanding a single word in that language. Thus, the understanding of the visual stories in the pictures, as described previously in this paper, are still way beyond reach with the current technology. With this background, it is of course also relevant to question whether it will be possible for a machine learning agent to perceive not only the primary objects in the image, but also create a representation of the visual story being told? How much of the interpretation of an image and of its visual story relies on an actual understanding of the context and the symbolic meaning? This is where our project takes its starting point. 348 3. Method and Material What kind of methods and models do we need when annotating the visual material? We have already mentioned our choice of starting the description with the notion of agency and the beginning of a story (“narrative kernel”) often found in visual metaphors. A feature of visual metaphors is the one we mentioned already as “the jack-in-the-box quality”: Images – or rather, their interpretations – tend to start from a latent quality, a mere potential, to emerge as qualified meaning-carriers. Let us present one example on the drift into a level of denser signification from our experiments on exemplary image annotations: In many advertisements, a wide-angle, one-point perspective view may often be used as a trope for “modernity and progress” (See Figure 3). This ad for a German car brand (“Build a Bridge over Bad Roads!”) shows a bumpy sand road and a conceptual grey bridge in curvilinear orthogonals, converging to a vanishing point on the left hand. Speed lines also converge at the vanishing point, hinting that the car is quickly closing in from a great distance. The car is depicted in a worm's-eye view. The sketch of the axle track is inspired by technical drafting and adds to the techy impression. This kind of scenery using steep linear perspective is a kind of trope for societal advances. Figure 3. A second advertisement (published 1958) from the material collected in the pilot study. In Figure 4 two other examples are shown: In the 50’s, the Shell company presented a series of ads about different collaborations with other companies, here a cable and wire company. The ads are high- quality, with a weird fusion between Surrealism and the depiction of technology as a road into the future. The undated tie ad (Raxon Fabrics) is not as skilfully devised as the cable ad, but nonetheless interesting in its use of spatial cues. The forced perspective has no apparent role in suggesting space for the ties hanging in a conceptual void. The orthogonals with the pole as a kind of vanishing point constitute more like a sign on its own, a sign of progressive times and the future. Through interpretation of images, meaning seems to slide from the uncategorized formal feature into a region where storytelling can begin. Thus, we need not a rigid, hierarchical model as a basis for our methodology, but a model that allows for movement, opening, and emergence between levels of meaning. Until now, we have gathered a smaller corpus of randomly selected issues from the richly illustrated monthly and weekly magazines of for example Veckojournalen and Bonniers Månadstidning from the 1920s to the 1950s, a dataset of approximately 12,600 discrete images. 4. A Combined Approach Currently, AI-based search engines are less knowledgeable about image content and context than what is desirable for meaningful performance ranking. Therefore, in the first stage of our pilot project, 349 we want to apply multi-modal machine learning models that can connect text and images, like CLIP, in an image database. The data consists of visual ephemera like advertisements and photo journalism from Swedish weekly and monthly journals from the 1920’s to the 1950’s, as well as assorted images of fine art. In a later stage, we want to test different kinds of specialized image annotations in natural language. Thus, our objective is to investigate the multi-modality models’ capacity for recognizing such high- level image content as for example context, agency, visual narration, and metaphors. We are also interested in how a space built from such pair-wise distance would look and its properties in relation to qualitative theory. The first problem we want to tackle is finding the most effective method of handling a large number of annotated and unannotated images. Many current approaches contain major obstacles: typically, visual datasets, often based on crowdsourcing, are labour intensive and costly to build. Furthermore, the visual concepts and classifiers the datasets are able to learn are narrow and difficult to supplement. Connected to this first problem is also the extraction of our image-and-text datasets. Figure 4. Two ads (the one left from 1958, the one to the right undated) that display a typical spatial solution of “modernity and progress”. The second problem we will deal with is how the annotation of an image has to be prepared so that the AI understands, not only what is shown, but what the image is about. We ask ourselves what formations in the picture would be optimal for the learning process of the neural networks, and believe that visual narratives, agency, and visual metaphorics are highly relevant. This is a methodology for teaching the AI relevant modes of visual literacy and cultural competence. This competence is coded through historically developed visual conventions. Thus, specialized art-historical, semiotic, and narratological concepts describing pictorial conventions will be utilized in the annotations. We will also test the optimal amount of additional textual information paired with the image. 5. Concluding Discussion As we have seen, the development within artificial intelligence has been very rapid, and not least within image and natural language analysis. However, most of the methods used today are based on statistical predictions of the material. As such, it is difficult to talk about cognitive abilities, when it comes to understanding of external information. This is also still to some extent a sparsely researched area. There are some systems, that show a remarkable skill in using multi-modal text and image parity, for example CLIP, as we have described previously. However, as mentioned above, these systems will still not be able to transcend the border between identification and understanding. Identification is very important for many applications, but for understanding we assume that a different approach may be necessary. Our wish is for a system architecture for specialized image retrieval where: firstly, the richness of natural language descriptions is preserved, secondly, where art-historical terms describing pictorial conventions should be utilized, thirdly, a certain amount of contextual information is included, and lastly, where the semantic gap can be traversed, that is, the discrepancy between the information that 350 can be derived from the low-level image data (color, shapes) and the interpretation that human viewers of an image base on their visual literacy and cultural competence. Acknowledgements This project is partly funded with grants from CIRCUS (Support for cross-cutting research projects), which is gratefully acknowledged. Thanks also go to the University Library at Uppsala University, which has given access to the journals from which the pictures for analysis are taken. References [1] G. Boehm, Wie Bilder Sinn erzeugen. Die Macht des Zeigens. Berlin: Berlin University Press, 2007. [2] J. Zhang, Y. Kalantidis, M. Rohrbach, M. Paluri, A. Elgammal, and M. Elhoseiny, “Large-Scale Visual Relationship Understanding,” AAAI, vol. 27, 2018, Accessed: Jan. 25, 2022. Available: https://www.semanticscholar.org/paper/Large-Scale-Visual-Relationship-Understanding-Zhang- Kalantidis/847d0b91b60d8b4082d32bcbd898185c831af1d7 [3] M. Marinescu, A. Reshetnikov, and J. Moore, “Improving object detection in paintings based on time contexts”, Virtual proceedings, pp. 926–932. doi: 10.1109/ICDMW51313.2020.00133. [4] M. Springstein, S. Schneider, J. Rahnama, E. Hüllermeier, H. Kohle, and R. Ewerth, “iART: A Search Engine for Art-Historical Images to Support Research in the Humanities,” in Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event China, Oct. 2021, pp. 2801–2803. doi: 10.1145/3474085.3478564. [5] J. von Bonsdorff, “Visual Metaphors, Reinforcing Attributes, and Panofsky’s Primary Level of Interpretation,” in The Locus of Meaning in Medieval Art - Iconography, Iconology and Interpreting the Visual Imagery of the Middle Ages., vol. 2019, L. Liepe, Ed. Berlin, Germany: Medieval Institute Publications, pp. 110–127. [6] G. Lakoff and M. Johnson, Metaphors We Live By. Chicago: University of Chicago Press, 1980. [7] G. J. Holzmann, Beyond Photography - The Digital Darkroom, vol. 1988. Prentice Hall. [8] A. M. Spalter, The Computer in the Visual Arts, vol. 1999. Reading, Massachusetts: Addison- Wesley. [9] J. Webber, “The Ethics/ Skills Interface in Image Manipulation,” Australasian Journal of Information Systems, vol. 7, no. 2, Art. no. 2, 2000, doi: 10.3127/ajis.v7i2.265. [10] A. Elgammal, M. Mazzone, B. Liu, and D. Kim, “The Shape of Art History in the Eyes of the Machine,” presented at the AAAI, New Orleans USA, Feb. 2018. [11] R. C. Schank, “Conceptual Dependency: A Theory of Natural Language Understanding”, Cognitive Psychology, vol. 3, pp. 552–631, 1972. [12] S. L. Lytinen, “Conceptual dependency and its descendants,” Computers & Mathematics with Applications, vol. 23, no. 2–5, pp. 51–73, Jan. 1992, doi: 10.1016/0898-1221(92)90136-6. [13] C. J. Fillmore, THE CASE FOR CASE. 1967. Accessed: Feb. 04, 2022. Available: https://eric.ed.gov/?id=ED019631 [14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv:1810.04805 [cs], May 2019, Accessed: Feb. 09, 2022. [Online]. Available: http://arxiv.org/abs/1810.04805 [15] W. Su et al., “VL-BERT: Pre-training of Generic Visual-Linguistic Representations,” arXiv:1908.08530 [cs], Feb. 2020, Accessed: Feb. 09, 2022. [Online]. Available: http://arxiv.org/abs/1908.08530 [16] A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” in Proceedings of the 38th International Conference on Machine Learning, Jul. 2021, pp. 8748– 8763. Accessed: Feb. 22, 2022. Available: https://proceedings.mlr.press/v139/radford21a.html [17] J. R. Searle, “Minds, brains, and programs,” Behavioral and Brain Sciences, vol. 3, no. 3, pp. 417–424, Sep. 1980, doi: 10.1017/S0140525X00005756. 351