Knowledge Extraction for Art History: the Case of Vasari’s The Lives of The Artists (1568) Cristian Santini1,2 , Mary Ann Tan1,2 , Oleksandra Bruns1,2 , Tabea Tietz1,2 , Etienne Posthumus1 and Harald Sack1,2 1 FIZ Karlsruhe – Leibniz Institute for Information Infrastructure, Karlsruhe, Germany 2 Karlsruhe Institute of Technology, Institute AIFB, Karlsruhe, Germany Abstract Knowledge Extraction (KE) techniques are used to convert unstructured information present in texts to Knowledge Graphs (KGs) which can be queried and explored. Despite their potential for cultural heritage domains, such as Art History, these techniques often encounter limitations if applied to domain-specific data. In this paper we present the main challenges that KE has to face on art-historical texts, by using as case study Giorgio Vasari’s The Lives of The Artists. This paper discusses the following NLP tasks for art-historical texts, namely entity recognition and linking, coreference resolution, time extraction, motif extraction and artwork extraction. Several strategies to annotate art-historical data for these tasks and evaluate NLP models are also proposed. Keywords Knowledge Extraction, Art History, Cultural Heritage, NLP 1. Introduction To understand the meaning of works of art, art historians investigate and analyze not only the physical properties and subject matter, but the historical context of their creation. A rich source of historical knowledge is evident in textual documents that contain artifact descriptions, historical memoirs, letters, and biographies, such as Giorgio Vasari’s The Lives of The Artists (1568). Nowadays, the information contained in these historical texts is not only of interest for art researchers, but also for computer scientists who want to fully unlock the knowledge contained therein with the help of Natural Language Processing (NLP). NLP techniques for Knowledge Extraction (KE), like Entity Recognition (ER), Entity Linking (EL) and Relation Extraction (RE), are used to convert unstructured information present in texts to Knowledge Graphs (KGs) which can be queried and explored. However, domain specific resources, such as art-historical texts, pose a number of challenges when these tasks are conducted. For example, most NLP techniques don’t take into account the problems specific to historical texts: the complexity of historic languages, the variety of the types of documents (press articles, letters, diaries, etc.), the specific pragmatics of the authors, and the possible noise Qurator 2022: 3rd Conference on Digital Curation Technologies, September 19-23, 2022, Berlin, Germany $ cristian.santini@fiz-karlsruhe.de (C. Santini); ann.tan@fiz-karlsruhe.de (M. A. Tan); oleksandra.bruns@fiz-karlsruhe.de (O. Bruns); tabea.tietz@fiz-karlsruhe.de (T. Tietz); etienne.posthumus@partners.fiz-karlsruhe.de (E. Posthumus); harald.sack@fiz-karlsruhe.de (H. Sack) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) present in the input due to data processing techniques such as Optical Character Recognition (OCR). Moreover, there is a lack of domain-specific resources, especially for art-historical data, to train Machine Learning (ML) models to extract or classify specific information for art-historians. For this reason, an identification of challenges specific to KE on art-historical texts is needed. This paper discusses a series of challenges which emerged when dealing with KE on The Lives of The Artists, a collection of artists’ biographies written by the Italian painter Giorgio Vasari in the XVIth century. This book is still one of most authoritative bibliographic references for Art History from the Gothic to the Mannerist era and to extract a knowledge graph from this work would entail the possibility for researchers to explore and query information related to European artists from the XIIIth to the XVIth century, their creative endeavours and the dynamics of their conservation, which is still not available in general-purpose KGs, such as Wikidata. Moreover, this digitization process would allow, from an historiographical perspective, to compare statements made by Vasari with those availed by contemporary art historians. For this reason, a well-designed evaluation dataset which enables a fair comparison of different NLP tools on this text is needed. Scope of this work is the analysis of the issues involved in the curation of art-historical datasets and the presentation of potential solutions for their annotation. The challenges of KE for Art History presented in this paper are related to the following NLP tasks: A) the recognition and resolution of entities associated with various concepts, i.e. ER and EL; B) the resolution of coreferences between entities and pronouns in long texts; C) the extraction and normalization of temporal information; D) the extraction of artwork- related information; E) the possibility of considering, during the evaluation, multiple correct annotations in the ground truth. More specifically, D) does not only concern the identification of contextual information, such as used materials, locations or dates, but also to the extraction or classification of artworks’ content. In this context, the identification of so-called motifs, i.e. iconographic themes which involve one or multiple subjects, is of course of interest. The contribution of this paper is the discussion of these aforementioned challenges along with the delineation of the possible solutions that NLP practitioners can adopt to annotate art-historical data and create KE resources for this domain. In order to provide examples to motivate our strategies, we used as case-study excerpt extracted from The Lives of The Artists by Vasari (1568). The remainder of the paper is structured as follows. Section 2 presents related work on KE and the Digital Humanities. Section 3 describes the main challenges for KE on art-historical texts by using as case study The Lives of The Artists. Section 4 concludes the paper. 2. Related Work A general overview of the variety of knowledge extraction tasks which have been carried out for cultural heritage domains is provided in [1]. This work describes the scenarios in which NLP techniques are applied (data processing, knowledge extraction, metadata extraction) and a series of general challenges for cultural heritage data. However, an overview of the specific challenges for specific domains is lacking. A recent survey on the challenges of ER and EL for historical documents is provided in [2], where the authors list four main issues: the variety of documents (newspapers, letters, memoirs, books), the presence of noise in the input data due to data processing techniques (such as OCR), the problem of old historical language varieties and the lack of standardized resources. In fact, as motivated by this study, the lack of standardized benchmarks for historical data pushes researchers to focus their effort either on specific tasks, such as entity recognition and linking, or on specific document types, such as old historical press data or literary texts. One of the main KE tasks carried out in the Digital Humanities is EL, due to the importance of correctly resolving mentions to entities to a KG. [3] utilized DBpedia Spotlight to disambiguate named entities on the Bentham Corpus. In [4], three third-party entity extraction tools were evaluated on a case study based on the descriptive fields of the Smithsonian Cooper-Hewitt National Design Museum in New York. With respect to these studies, common limitation is the fact that third-party tools are used on English-only texts and the creation of publicly available datasets to train and evaluate domain-specific models is not considered. [5] evaluated an EL architecture specifically designed for recognizing and linking mentions to persons in French literary criticism texts and scientific essays from the 19th and early 20th centuries. In [6] the problem of entity linking on multilingual historical press articles with OCR errors was addressed by evaluating and End-to-End EL model on a domain-specific corpus: the NewsEye Benchmark Dataset1 . However, a main limitation of these studies is the fact that the proposed approaches consider only a restricted set of entities, either persons [5] or proper nouns [6], so these systems may not work well in the art-history domain, where significant variations in vocabulary and lexical and syntactic structures are present. Named Entity Recognition (NER), Coreference Resolution and Event Detection was addressed in [7] by creating a benchmark from literary works extracted from project Gutenberg, containing 100 different English-language novels. However, this resource reuses the guidelines of a general- domain dataset, i.e. ACE 2 . Specific to this dataset is the inclusion of common nouns (boy, kitchen) and nested structures (such as [[the cook]’s sister]). The annotation and normalization of temporal information, i.e. Temporal Tagging, for historical documents was proposed in [8], where a multilingual corpus of historical documents extracted from Wikipedia with annotated temporal information is presented. As main challenges for temporal tagging on these documents, the authors mention the importance of context, underspecificity, the portability of domain- specific guidelines to contemporary texts and the consideration of language specific differences. However, since the corpus consists of Wikipedia articles, this study does not take into account the problem of historical lexical variations used to express time. Few studies have investigated EL and, more generally, knowledge extraction techniques for the Art History domain. [9] proposed an EL corpus for creative works extracted from contemporary texts. In their study, the authors focused on the annotation of nested entities and their impact on the evaluation of three general-purpose EL tools: Dbpedia Spotlight, AIDA and Recognyze. Despite its focus on the challenges of textual data containing mentions to creative works, this work does not take into consideration further challenges of the annotation, such as the definition of domain-specific tagsets or the inclusion of multiple correct annotations for the same entity. 1 https://zenodo.org/record/4573313#.YtUp8HZByn4 2 https://catalog.ldc.upenn.edu/LDC2006T06 3. Challenges of Knowledge Extraction in The Lives of The Artists In this section, challenges of the following KE tasks on art-historical texts will be discussed: entity recognition, entity linking, coreference resolution, time extraction, motif extraction and artwork extraction. These challenges emerged as part of a preliminary research on the creation of a KE pipeline for Vasari’s The Lives of The Artists. The goal of the aforementioned research is to apply a series of NLP techniques to art-historical texts in order to verify their potential in complementing the information present in existing KGs. However, a number of challenges specific to art-historical texts were found. In each of the following subsections, potential strategies to overcome these challenges and create task-specific annotations are presented. 3.1. Entity Recognition One of the first steps in a Knowledge Extraction pipeline is the identification of entity mentions inside a text, namely Entity Recognition (ER). A problem to be faced when dealing with art- historical texts is related to the fact that entities of interest are also expressed as common nouns: materials from which artifacts are made, techniques, styles, subjects, etc. For this challenge, fine-grained tagsets, such as the one from the recently released MultiNERD [10] can be reused. This tagset contains entity types applicable to both named entities and common nouns which occur in art-historical data, such as Persons (PER), Organizations (ORG), Locations (LOC), Event (EVE), Animal (ANIM), Plant (PLANT) and Mythological entity (MYTH). To extend the annotation to cultural products and related information, this tagset can be extended with the following entity types: • Artifact (ART): both tangible, such as books (Old Testament), artworks, instruments; and intangible, such as subject matters (philosophy), languages, iconographies (Mary with the Child). • Material (MAT): materials from which artifacts can be made (wax) • Technique (TECH): technique with which an artifact is made (fresco) But meanwhile the [Florentines]𝑃 𝐸𝑅 , hearing in the year 1515 that [[Pope]𝑃 𝐸𝑅 Leo X]𝑃 𝐸𝑅 wished to grace his native [city]𝐿𝑂𝐶 with his presence, ordained for his [reception]𝐸𝑉 𝐸 extraordinary [festivities]𝐸𝑉 𝐸 and a sumptuous and magnificent [spectacle]𝐸𝑉 𝐸 , with so many [arches]𝐴𝑅𝑇 , [façades]𝐴𝑅𝑇 , [temples]𝐴𝑅𝑇 , [colossal figures]𝐴𝑅𝑇 , and other [statues]𝐴𝑅𝑇 and [ornaments]𝐴𝑅𝑇 , that there had never been seen up to that time anything richer, more gorgeous, or more beautiful; for there was then flourishing in that [city]𝐿𝑂𝐶 a greater abundance of fine and exalted [intellects]𝑃 𝐸𝑅 than had ever been known at any other period. E 1: Annotation considering common nouns, proper nouns, noun phrases and nested entities. 3.2. Entity Linking A further step in KE is the resolution of identified entity mentions by using a KG, i.e. EL. One of the first challenges that should be solved is to find a KG with extensive coverage of the entity types mentioned in Section 3.1 . Wikidata is a strong candidate due to its wide coverage. However, by using a general-purpose KG, entities from historical texts, such as those mentioned in Vasari’s text, may not be found, and therefore they are to be labelled as NIL. Moreover, another challenge is related to historical languages and the presence of lexical variations to define known entities. For example, what Vasari often mentions as Greek manner refers to the Byzantine art. In addition, figures of speech such as metonymy are to be considered. E 2 shows how the surface form Fointainebleau is resolved by a link to the implicitly mentioned palace and not to the geographical place. In E 2 each entity is resolved by using the corresponding title of the English Wikipedia page. The [work]𝐶𝑟𝑒𝑎𝑡𝑖𝑣𝑒_𝑊 𝑜𝑟𝑘 is now in the [collection]𝐶𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛_(𝑚𝑢𝑠𝑒𝑢𝑚) of [[King]𝐾𝑖𝑛𝑔 Francis of [France]𝐹 𝑟𝑎𝑛𝑐𝑒 ]𝐹 𝑟𝑎𝑛𝑐𝑖𝑠_𝐼_𝑜𝑓 _𝐹 𝑟𝑎𝑛𝑐𝑒 , at [Fontainebleau]𝑃 𝑎𝑙𝑎𝑐𝑒_𝑜𝑓 _𝐹 𝑜𝑛𝑡𝑎𝑖𝑛𝑒𝑏𝑙𝑒𝑎𝑢 . E 2: Annotation showing how entities are resolved. 3.3. Coreference Resolution Art-historical texts may often be long documents, such as biographies or correspondence. Therefore, coreferences between previously explicitly mentioned entities and personal pronouns should be taken into account. E 3 shows a sentence in Vasari’s text where the first clause does not have as subject an explicit mention to an entity but only a personal pronoun. For coreference resolution, a possible strategy is to annotate the surface forms of the entities from the entity recognition dataset as well as personal pronouns, as previously performed in [7] for non- contemporary novels. Following the guidelines proposed in the same work, nested honorifics (Mr.) and titles (Pope) are excluded from the coreference resolution task. [He]1 also restored the [right arm]2 of the ancient [Laocoon]3 , which had been broken off and never found, and [Baccio]1 made one of the full size in [wax]4 , which so resembled the ancient [work]3 that [it]2 showed how [Baccio]1 understood [his]1 art; and this [model]5 served [him]1 as a [pattern]6 for making the whole [arm]7 of [his]1 own [Laocoon]8 . E 3: Annotation considering coreference. 3.4. Time Resolution The extraction of temporal information is also relevant to capture further information about the contexts of artworks and artists’ lives. One of the challenges of time extraction in historical documents is presented by non-specific time expressions, such as references to centuries (6th century), historical periods (The Baroque era) and fuzzy time expressions (By the year 1420). For general time expressions, annotation frameworks such as TimeML [11] have been proposed to normalize them; moreover, the use of information inside KG, such as Wikidata, allow to extract dates of historical periods using start date and end date properties. E 4 shows how information related to historical periods can be annotated by using TIMEX3INTERVAL [12]. In the time of Pope Paul II E 4: Example of annotated and normalized time expression. Figure 1: Example of annotation with nested motifs. However, a further complexity found in Vasari is posed by references to calendar dates which use mentions of popular events, such as festivities. For example, in this book the date of birth of Raffaello is mentioned as the Good Friday of the year 1483. To normalize these expressions, we can rely on the most consolidated opinion among art-historians, which are using the Gregorian Calendar to indicate the Good Friday of 1483 as March 28. 3.5. Motif Extraction Analysis of artistic themes or motifs in artworks is a vital endeavor for art historians. Motifs can be either named entities (S. Sebastian), long noun phrases (The Feast in the House of Simon the Pharisee) or common nouns (vase). It is clear that these types of entities cannot be resolved with an encyclopedic KG such as Wikidata, since it is not designed for iconographic subjects. Moreover, for the extraction of this information two tasks should be carried out: a) the recog- nition of sequences of words which describe iconographic motifs, and b) the identification of these motifs by using an appropriate classification system. This can be addressed by modelling the problem as a specific sequence labelling task where motifs are identified inside a text and annotated through the ICONCLASS3 classification system [13]. This task is distinguished from ER and EL since the relations between multiple entities and specific events should be considered. Moreover, another challenge is related to the fact that the same motif can be described with nested ICONCLASS notation, as shown in Figure 1. 3.6. Artwork Extraction The identification of artworks is crucial when dealing with art-historic data. However, artistic works are not always mentioned explicitly in historical texts. In fact, complex lexical vari- ations and long descriptions are often used by non-contemporary art-historians to refer to artists’endeavours. Figure 2 shows an example of two paintings referenced by Vasari: despite the fact that the explicit title is missing, the artwork references can still be resolved by a human annotator to a unique entry in a Knowledge Base (KB) (in this case Wikipedia). For this reason, further KE tasks should be considered, besides those previously mentioned: a) recognition of explicit as well as implicit artwork references; and b) resolution of these descriptions by using a unique identifier present in an external resource, such as a KB or a KG. 3 www.iconclass.org Figure 2: Example of recognition and linking of artworks’ references: the second sentence (from the top) shows a long description linked to a specific painting. Sequences of words referring to artworks are often triggered by verb phrases which are referred to the activity of an artist (painted). Moreover, for artwork resolution, the previously described steps of ER and EL, coreference resolution, time extraction and motif extraction should come into play. The second example (from the top) in Figure 2 shows how to correctly interpret the creator of a work a personal pronoun has to be resolved. Finally, artwork descriptions allow us to extract further contextual information related to an artifact, such as its location and commissioner, or information related to its depicted motifs, which might complement the already existing knowledge about an artwork. 3.7. Evaluation With the KE tasks identified in the preceding sections, a comprehensive evaluation strategy is necessary. In this context, the pragmatics behind the evaluation of NLP tools should take into consideration the potential ambiguity of several linguistic expressions related to cultural concepts. As an example, nested entities are relevant when dealing with ER and iconographies, since these often consists of sets of elements, as in [[Mary] with the [Child]]. Another problem is related to the possibility of multiple correct answers for entity resolution systems. As an example, artwork descriptions in Vasari’s text which mention the deity of Love may be resolved by two different annotators by using the Wikidata entities Eros (Q121973) or Cupid (Q5011), being both alias of the same entity. With respect to this issue, leveraging annotators’ disagreement can be beneficial, as discussed in [14]. A further challenge applies to motifs: for example, the subject of the Last Supper can be classified in ICONCLASS using 2 entries: 73D2, which identifies the biblical episode, and 73D24, which is a subclass of the first and identifies the specific scene. While a human annotator can try to use always the most precise class in the ICONCLASS hierarchy, an automatic system may classify a motif with a superclass or subclass of the ICONCLASS code in the ground-truth. For this, evaluation measures for hierarchical classification, such as those proposed in [15], can be considered. 4. Conclusion and Future Work Art-historical texts are part of our cultural heritage and as such can offer additional insights on how ideas developed around artistic subjects and which conditions determined the evolution of artistic tendencies. In this paper a series of challenges of KE for Art History are presented. The takeaway lesson is that, to model a specific domain, the concepts introduced and the limitations of existing general approaches should be carefully considered. Moreover, complex linguistic features, ambiguity and historical ideas that are not anymore part of our cultural frame pose challenges to NLP models. In future work, a fully annotated sample dataset extracted from Lives of The Artists will be provided to enable a valid comparison of KE tools for art-historical texts, as well as complete and detailed guidelines for the annotation and the evaluation of the proposed KE tasks. References [1] C. Sporleder, Natural Language Processing for Cultural Heritage Domains, Language and Linguistics Compass 4 (2010) 750–768. URL: https://onlinelibrary.wiley.com/doi/abs/ 10.1111/j.1749-818X.2010.00230.x. doi:10.1111/j.1749-818X.2010.00230.x, _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1749-818X.2010.00230.x. [2] M. Ehrmann, A. Hamdi, E. L. Pontes, M. Romanello, A. Doucet, Named Entity Recognition and Classification on Historical Documents: A Survey, 2021. URL: http://arxiv.org/abs/ 2109.11406, arXiv:2109.11406 [cs]. [3] P. Ruiz, T. Poibeau, Mapping the Bentham Corpus: Concept-based Navigation, Jour- nal of Data Mining and Digital Humanities Atelier Digit_Hum (2019). URL: https: //hal.archives-ouvertes.fr/hal-01915730. doi:10.46298/jdmdh.5044, publisher: Epi- sciences.org. [4] S. van Hooland, M. De Wilde, R. Verborgh, T. Steiner, R. Van de Walle, Exploring entity recognition and disambiguation for cultural heritage collections, Digital Scholarship in the Humanities 30 (2015) 262–279. URL: https://doi.org/10.1093/llc/fqt067. doi:10.1093/ llc/fqt067. [5] C. Brando, F. Frontini, J.-G. Ganascia, REDEN: Named Entity Linking in Digital Lit- erary Editions Using Linked Data Sets, Complex Systems Informatics and Modeling Quarterly (2016) 60–80. URL: https://csimq-journals.rtu.lv/article/view/csimq.2016-7.04. doi:10.7250/csimq.2016-7.04, number: 7. [6] E. Linhares Pontes, L. A. Cabrera-Diego, J. G. Moreno, E. Boros, A. Hamdi, A. Doucet, N. Sidere, M. Coustaty, MELHISSA: a multilingual entity linking architecture for historical press articles, International Journal on Digital Libraries 23 (2022) 133–160. URL: https: //doi.org/10.1007/s00799-021-00319-6. doi:10.1007/s00799-021-00319-6. [7] D. Bamman, S. Popat, S. Shen, An annotated dataset of literary entities, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 2138–2144. URL: https: //aclanthology.org/N19-1220. doi:10.18653/v1/N19-1220. [8] J. Strötgen, T. Bögel, J. Zell, A. Armiti, T. V. Canh, M. Gertz, Extending HeidelTime for Temporal Expressions Referring to Historic Dates, in: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), European Language Resources Association (ELRA), Reykjavik, Iceland, 2014, pp. 2390–2397. URL: http://www.lrec-conf.org/proceedings/lrec2014/pdf/849_Paper.pdf. [9] A. M. Brasoveanu, A. Weichselbraun, L. Nixon, In Media Res: A Corpus for Evaluating Named Entity Linking with Creative Works, in: Proceedings of the 24th Conference on Computational Natural Language Learning, Association for Computational Linguistics, Online, 2020, pp. 355–364. URL: https://aclanthology.org/2020.conll-1.28. doi:10.18653/ v1/2020.conll-1.28. [10] S. Tedeschi, R. Navigli, MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation), in: Findings of the Association for Computational Linguistics: NAACL 2022, Association for Computational Linguistics, Seattle, United States, 2022, pp. 801–812. URL: https://aclanthology.org/2022.findings-naacl. 60. [11] R. Saurí, J. Moszkowicz, B. Knippen, R. Gaizauskas, A. Setzer, J. Pustejovsky, TimeML Annotation Guidelines Version 1.2.1 (2006). [12] E. Kuzey, J. Strötgen, V. Setty, G. Weikum, Temponym Tagging: Temporal Scopes for Textual Phrases, in: Proceedings of the 25th International Conference Companion on World Wide Web, WWW ’16 Companion, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 2016, pp. 841–842. URL: https://doi.org/10.1145/2872518.2889289. doi:10.1145/2872518.2889289. [13] L. D. Couprie, Iconclass: an iconographic classification system, Art Li- braries Journal 8 (1983) 32–49. URL: https://www.cambridge.org/core/journals/ art-libraries-journal/article/abs/iconclass-an-iconographic-classification-system/ DD119669E055893AB632E5C7CE6FF417. doi:10.1017/S0307472200003436, publisher: Cambridge University Press. [14] L. Aroyo, C. Welty, Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation, AI Magazine 36 (2015) 15–24. URL: https://ojs.aaai.org/index.php/aimagazine/article/view/ 2564. doi:10.1609/aimag.v36i1.2564, number: 1. [15] A. Kosmopoulos, I. Partalas, E. Gaussier, G. Paliouras, I. Androutsopoulos, Evaluation mea- sures for hierarchical classification: a unified view and novel approaches, Data Mining and Knowledge Discovery 29 (2015) 820–865. URL: https://doi.org/10.1007/s10618-014-0382-x. doi:10.1007/s10618-014-0382-x.