ARTchives: aa Linked ARTchives: LinkedOpen Open Data Native Data native Catalogue ofofArt catalogue artHistorians’ historians’Archives* archives Marilena Daquino1[0000 0002 1113 7550] , Lucia Giagnolini1[0000 0002 4876 2691] , and Francesca Tomasi1[0000 0002 6631 8607] Digital Humanities Advanced Research Centre (/DH.arc), Department of Classical Philology and Italian Studies, University of Bologna, Italy {marilena.daquino2, francesca.tomasi}@unibo.it lucia.giagnolini@studio.unibo.it Abstract. Art historians’ personal archives include a variety of sources documenting creators’ work, opinions, and methodologies. Such a wealth of information is fundamental to trace the trajectories of art history through the lenses of historiographical research. However, the potential of such collections is still unveiled, and performing cross-collection research is not possible via online catalogues. The ARTchives project aims at crowdsourcing curated information on notable art historians’ archives and providing scholars with a centralised access point to this heritage. In this paper we present the agile cataloguing process developed to support ARTchives contributors. ARTchives is based on a Linked Open Data native cataloguing system that leverages Semantic Web technologies and Natural Language Processing to facilitate data entry, editorial process, and data quality.1 Keywords: Linked Open Data · Art History · Archives. 1 Introduction Art historians’ personal archives include a variety of sources (papers, exper- tises, correspondances, photographs etc.) documenting creators’ work, opinions, primary sources, and scientific methodologies. Such a wealth of information is fundamental to trace the trajectories of art history through the lenses of histo- riographical research. However, such a vast heritage is only partially available online and the extent and scope of such collections is still unveiled. The objective of ARTchives2 is to create a knowledge graph of art histori- ans’ archives for historiographical research purposes. Scholars can identify and retrieve archival fonds relevant to their studies, gather bibliographic sources, and can answer research questions related to historiographical topics with quantita- tive analysis methods - such as historians’ network analysis, topic analysis of debates, collections interlinking. 1 M. Daquino is responsible for Section 2 and 3; L. Giagnolini is responsible for Section 4. All authors are responsible for Section 1 and 5. 2 http://artchives.fondazionezeri.unibo.it/ * Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 M. Daquino et al. Nonetheless, crowdsourcing curated information is a hard task. Several issues may a↵ect data quality, such as data duplication, incompleteness, and vagueness. In order to efficiently support curators contributing to ARTchives, we developed an agile cataloguing process that leverages Semantic Web technologies and Nat- ural Language Processing techniques to allow archivists to save time and provide high-quality contents at the same time. The remainder of the paper is as follows. In section 2 we give a brief overview of archival and cataloguing systems leveraging Semantic Web technologies. In section 3 we describe the cataloguing system developed for ARTchives. In section 4 we briefly address benefits arisen by the usage of such technologies in pursuing quantitative art historical analysis, and in section 5 we conclude and present future works. 2 Related Work Galleries, Libraries, Archives, and Museums (GLAM) have been leveraging Se- mantic Web technologies data for over a decade. Consortia of museums and archives [11, 12, 8] foster the adoption of LOD as a lingua franca to develop aggregators and serve high-quality data to scholars and developers. Nevertheless, only few pioneers abandoned legacy cataloguing and archiving systems to fully embrace the Linked Open Data (LOD) paradigm and man- age their catalogues through LOD native management systems [14]. Institutions seem to prefer to maintain legacy systems for managing data life-cycle (ad- dressing aspects such as data entry, review, validation, and publication), and to provide dedicated services to access their 5 stars data, whether these represent complete collections [10], subsets [7], or project-related data [6]. Along with official releases of cultural heritage data, crowdsourcing cam- paigns have been launched by institutions to enrich their data with experts’ knowledge [9]. Likewise, scholarly projects leverage cultural heritage Linked Data to collaboratively develop new resources and data aggregators ([5] for an updated overview of projects). To the best of our knowledge, among the latter only the Listening Experience Database (LED) [1] adopts Semantic Web technologies to support data management, from data collection to publication. Currently, LED relies on an application developed to serve project-related goals and its reusabil- ity is not immediate in new projects. In recent years, a few content management systems have been introduced to facilitate LOD publication via reusable platforms. Omeka S3 is a popular platform for collaborative data collection and creation of virtual exhibitions. Data are served as JSON-LD via API, but these cannot be accessed in other syntaxes or queried via a SPARQL endpoint. Moreover, while user groups (roles) can be defined, editors do not have any means to supervise changes in the records. Another popular tool is Semantic MediaWiki4 , used in well-known projects like Wikidata. The system allows a fine-grained editorial control and serves data 3 https://omeka.org/s/. 4 https://www.semantic-mediawiki.org/wiki/Semantic MediaWiki. ARTchives: a Linked Open Data native catalogue of art historians’ archives 3 as LOD. However, integration and reuse of external data sources is a time- consuming activity that can only be performed manually. In ARTchives we rely on the experience of LED to develop an agile, efficient, reusable Linked Data native cataloguing system that tackles all aspects involved in collaborative data collection. 3 Linked Open Data native cataloguing in ARTchives Data management system. ARTchives is an open catalogue of archival descrip- tions of notable art historians’ personal archives. It is based on an open source data management system initially developed to answer ARTchives purposes and principles, namely: – REUSE. Terms belonging to selected data sources are suggested while filling in the form for creating a new record. Reused sources include Wikidata, the Open Library (Internet Archive), the Getty ULAN, and the Getty Art and Architecture Thesaurus. Only terms missing in aforementioned sources are here given of a bespoke identifier. – ENHANCEMENT. Long free-text descriptions entered by users are parsed to extract machine-readable data so as to avoid contributors repeating in- formation (i.e. as both free-text and selected keywords). – ACCURACY. Cataloguers can accept or reject aforementioned suggestions from the system and ensure contents comply with editorial standards. – COLLABORATIVE. Contributors can access and modify all records, includ- ing the ones created by other institutions. – CONSISTENCY. Records are peer-reviewed by the editorial board before publishing. – CONTINUOUS PUBLISHING. Records can be published on a rolling basis and can be temporarily unpublished for review purposes. The system leverages Linked Open Data since the creation of data and throughout all the curatorial/editorial management phases, therefore di↵eren- tiating itself from systems described in section 2. Moreover, the original data management system5 has been recently adapted to be customizable and reusable as-is in other crowdsourcing projects6 . In detail: – a configuration file allows adopters to select information relevant to their dataset, e.g. URI base, prefix, endpoint API; – a JSON mapping document allows to specify data entry requirements, such as form field types (e.g. text box or dropdown), expected values, services to be called (e.g. autocomplete based on Wikidata), and the mapping between fields, ontology terms, and custom controlled vocabularies; 5 ARTchives source code is available at: https://github.com/marilenadaquino/ARTchives 6 Code available at: https://github.com/marilenadaquino/crowdsourcing under CC- BY license. 4 M. Daquino et al. – HTML templates are available and can be easily customised to serve brows- ing and search interfaces over the catalogue; – dereferencing mechanisms are up to the adopter, who can choose and set up redirection rules by means of their persistent URI provider (e.g. w3id). Fig. 1. ARTchives overview Fig. 1 presents an overview of ARTchives data management system. The form for data entry is created according to settings specified by the user in a JSON document. While editing (creating, modifying, or reviewing) a record, both ARTchives triplestore (Blazegraph) and external services like DBPedia spotlight7 and Wikidata APIs are called to provide suggestions. Every time a record is created/modified, data are sent to the ingestion module, developed as a Python framework (based on Webpy). The latter relies on the mapping module, which is in charge to transform data into RDF according to the ontology terms specified in the JSON mapping document, and to update the graph created for collecting data of the record. The data management system is under continuous development to become a flexible tool for collaborative scholarly projects. A beta-version of the system has been tested with cataloguers of the six institutions sponsoring the project, namely: Federico Zeri Foundation (Bologna), Bibliotheca Hertziana (Rome), Getty Research Institute (Los Angeles), Kunsthistorisches Institut in Florenz (Florence), Scuola Normale Superiore (Pisa), and Università Roma Tre (Rome). Beyond ARTchives, other projects [4] actively provide new requirements to foster development and research. Editorial process. In ARTchives an archival record includes around 26 fields - compliant with archival content standards ISAD(G) and ISAAR - describing re- spectively the keeper of the archival collection, the creator of the collection, and 7 https://www.dbpedia-spotlight.org/ ARTchives: a Linked Open Data native catalogue of art historians’ archives 5 the collection itself8 . Every archival record is formally represented as a named graph [2]. Named graphs enable us to add RDF statements to describe those graphs, including for instance statements on their provenance (such as activi- ties, dates, and agents involved in the creation and modification of a record). Provenance information is described by means of the well-known W3C-endorsed PROV Ontology [13]. Moreover, named graphs allow us to prevent inconsistency of competing descriptions for the same entities, for instance when di↵erent cat- aloguers describe the same creator of multiple collections. The editorial process in ARTchives addresses four phases: record creation, record modification, review, and publication. Records can be created and mod- ified by any accredited user (so far, these include mainly archivists and profes- sionals of cultural heritage institutions). Members of the ARTchives editorial board peer-review contributions and decide when to publish the record. A pub- lished record can be searched and browsed from the website and can be retrieved as Linked Data from the SPARQL endpoint9 . Every time a change is made to a record, both content data and provenance information are updated on the triplestore and on the file system. Data collection support. When creating or modifying a record, contributors are supported in a few tasks, namely: (1) data reconciliation, (2) duplicate avoidance, (3) keyword extraction, (4) data integration. In detail, when field values address real-world entities or concepts that are shared in the art history community, autocomplete suggestions are provided by live querying external selected sources and the knowledge base of ARTchives. Suggestions appear in the form of lists of terms, each term including a label, a short description (to disambiguate homonyms) and a link to the external record (e.g. Wikidata entity). If no matches are found, users can add a new entity that is added to the knowledge base of ARTchives. When filling in specific fields (i.e. keepers and art historians’ names), the system alerts the user in case they are entering information about an existing entity in ARTchives, preventing duplicates. Several fields require contributors to enter long free-text descriptions (e.g. historians’ biographies, scope and content of collections), which include a wealth of information that cannot be processed as machine readable data. To prevent such a loss, two concurrent Named Entity Recognition (NER) tools (i.e. DBpedia spotlight API and compromise.js) extract entities (e.g. people, places, subjects). The latter are reconciled to Wikidata and keywords are shown to users for ap- proval/discard. Approved terms are included in the cataloguing data as subjects associated respectively to people and collections, avoiding user to input them again in the section of the record dedicated to subjects. Whenever Wikidata terms are reused - either via autocomplete or via NER -, the system queries Wikidata SPARQL endpoint to retrieve relevant context information and store it in the ARTchives Knowledge base for analysis purposes. 8 ARTchives documentation http://artchives.fondazionezeri.unibo.it/documentation 9 http://artchives.fondazionezeri.unibo.it/sparql 6 M. Daquino et al. For instance, subjects of collections like artists, artworks, and artistic periods are enriched with time spans; historians biographical information is enriched with birth and death places. Finally, it is worth noting that collections and keepers are geo-localised via OpenStreetMap APIs10 . Data sustainability and data modelling choices. Long-term availability of schol- arly projects is often hampered by time and resource constraints. Therefore, the wealth of data produced by noble initiatives becomes often unavailable in the mid/long-term. To prevent that, ARTchives reuses as much as possible Wiki- data, both at schema level (using classes and properties) and at instance level (reusing individuals as suggested field values), with the idea to directly contribute to Wikidata in the near future with selected, curated metadata. Moreover, lever- aging external ontologies only facilitates small-medium crowdsourcing projects, which do not have to develop and maintain bespoke ontologies. To pursue this objective, ARTchives data are realeased under a CC0 waiver. An analysis and estimate of ARTchives potential contribution to Wikidata is ongoing. 4 ARTchives Linked Open Data for quantitative art history As aforementioned, one of the objectives of ARTchives is to adopt quantitative methods to answer art history and historiographical research questions. Being the crowdsourcing phase still in early stage, reliable large-scale analyses can- not be performed yet. However, a number of exploratory data analyses (EDA) performed over ARTchives actively contribute to refine project requirements in terms of data completeness, interlinking, and bias.11 In particular, we inves- tigated historians’ networks and types of relations that are relevant in the Art History community. Through data visualization techniques we were able to show well-known geographical and relational patterns, such as as historians’ communi- ties based on provenance and places of activities, highlighting for instance Italian and German clusters. Less obvious patterns include institutional networks, high- lighted by the correlation of their relevance in historians’ biographies. Results of the analysis drew our attention on some recurrent patterns, such as the closeness of art historians’ due to shared institutions and research topics, and the relevance of art historians’ documents in other historians’ archival col- lections based on the aforementioned closeness. While few obvious patterns are immediately recognizable, the lack of extensive data and the incompleteness of some records prevent us from identifying other known relations and, possibly, opening up new research paths. We believe this aspect should be further inves- tigated, since the lack of knowledge may turn into an opportunity. In particular, we envisage the definition of inference rules based on heuristics (recurrent pat- terns) to associate similar collections and supervised classification methods to 10 https://www.openstreetmap.org/ 11 https://mybinder.org/v2/gh/LuciaGiagnolini12/Tesi/main ARTchives: a Linked Open Data native catalogue of art historians’ archives 7 predict relations between art historians, institutions and contents of the collec- tions. In so doing we aim at unveiling patterns that can be generalised as peculiar of the Art History domain, improve ARTchives data completeness, and further develop methods to support experts in retrieving archival collections relevant to their studies. Lastly, it is worth noting a few experiments leveraging both ARTchives and Wikidata have been performed by independent scholars to address biases in the scope of Art Historical Linked Open Data. A notable example is the project Martrioska12 which highlights the gender bias in art history, how this a↵ects the completeness of data aggregators, and how this gap can be filled with computa- tional methods. 5 Conclusion In this paper we presented the data management system of ARTchives, an on- going crowdsourcing project to aggregate curated information on art historians’ personal archives. Both specific and generic project requirements stimulated the development of a Linked Open Data native cataloguing system that could e↵ec- tively support consistent, accurate cataloguing and editorial processes. Future works include the alignment of terms to RIC Ontology [3] to allow archives reusing ARTchives data seamlessly. ARTchives data management system fully embraces the Linked Open Data paradigm, fostering data reuse and efficient cataloguing, and ensuring data qual- ity and consistency across information systems. Future developments include ex- tension of the code base to support small-medium projects in producing 5 stars data that leverage user-friendly repositories (e.g. github) instead of or along with a triplestore for data storage and update. Lastly, preliminary results of the EDA require us to further investigate the well-known issue of incompleteness of crowdsourced data. Lack of complete data may turn into an opportunity to develop computational methods tailored on the domain at hand for data enrichment and recommendation. Future works will address heuristics for archival collections interlinking and recommendation. 6 Acknowledgements This work is partially supported by a project that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101004746 (Polifonia: a digital harmoniser for musical heritage knowledge, H2020-SC6-TRANSFORMATIONS). 12 https://martrioska.github.io/ 8 M. Daquino et al. References 1. Adamou, A., Brown, S., Barlow, H., Allocca, C., d’Aquin, M.: Crowdsourcing linked data on listening experiences through reuse and enhancement of library data. International Journal on Digital Libraries 20(1), 61–79 (2019) 2. Carroll, J.J., Bizer, C., Hayes, P., Stickler, P.: Named graphs. Journal of Web Semantics 3(4), 247–267 (2005) 3. Clavaud, F., EGAD, I.: International council on archives records in contexts on- tology (ica ric-o) version 0.1 (2019) 4. Daquino, M., Daga, E., d’Aquin, M., Gangemi, A., Holland, S., Laney, R., Penuela, A.M., Mulholland, P.: Characterizing the landscape of musical data on the web: State of the art and challenges (2017) 5. Davis, E., Heravi, B.: Linked data and cultural heritage: A systematic review of participation, collaboration, and motivation. Journal on Computing and Cultural Heritage (JOCCH) 14(2), 1–18 (2021) 6. Davis, K.: Old metadata in a new world: Standardizing the getty provenance index for linked data. Art Libraries Journal 44(4), 162–166 (2019) 7. Deliot, C.: Publishing the british national bibliography as linked open data. Cat- alogue & Index 174, 13–18 (2014) 8. Delmas-Glass, E., Sanderson, R.: Fostering a community of pharos scholars through the adoption of open standards. Art Libraries Journal 45(1), 19–23 (2020) 9. Dijkshoorn, C., De Boer, V., Aroyo, L., Schreiber, G.: Accurator: Nichesourcing for cultural heritage. arXiv preprint arXiv:1709.09249 (2017) 10. Dijkshoorn, C., Jongma, L., Aroyo, L., Van Ossenbruggen, J., Schreiber, G., Ter Weele, W., Wielemaker, J.: The rijksmuseum collection as linked data. Se- mantic Web 9(2), 221–230 (2018) 11. Doerr, M., Gradmann, S., Hennicke, S., Isaac, A., Meghini, C., Van de Sompel, H.: The europeana data model (edm). In: World Library and Information Congress: 76th IFLA general conference and assembly. vol. 10, p. 15 (2010) 12. Knoblock, C.A., Szekely, P., Fink, E., Degler, D., Newbury, D., Sanderson, R., Blanch, K., Snyder, S., Chheda, N., Jain, N., et al.: Lessons learned in building linked data for the american art collaborative. In: International Semantic Web Conference. pp. 263–279. Springer (2017) 13. Lebo, T., Sahoo, S., McGuinness, D., Belhajjame, K., Cheney, J., Corsar, D., Garijo, D., Soiland-Reyes, S., Zednik, S., Zhao, J.: Prov-o: The prov ontology (2013) 14. Malmsten, M.: Exposing library data as linked data. IFLA satellite preconference sponsored by the Information Technology Section” Emerging trends in (2009)