I.PaC: the National Data Space for Cultural Heritage Margherita Porena1,2,* , Antonella Negri1 and Luigi Cerullo1 1 Istituto centrale per la digitalizzazione del patrimonio culturale - Digital Library, Via di San Michele, 18, Rome, 00153, Italy 2 Alma Mater Studiorum - Università di Bologna, Via Zamboni, 33, Bologna, 40126, Italy Abstract The article describes I.PaC (Infrastructure and Digital Services for Cultural Heritage), the digital framework designed as a central hub for managing descriptive data and digital objects from cultural institutions at a national level. The paper investigates the use of Artificial Intelligence (AI) within the I.PaC infrastructure to enhance the quality of descriptive data, to add value to digital objects, and to assist users in navigating cultural portals. Keywords Cultural Heritage, National data space, Generative AI 1. Introduzione 2. I.PaC: The National Data Space The great number of Italian cultural properties presents for Cultural Heritage numerous challenges in terms of accessibility, conserva- I.PaC - Infrastructure and Digital Services for Cultural Her- tion, and enhancement of cultural heritage. To address itage [2] - is the data space dedicated to the preservation, these challenges, a dedicated digital infrastructure for management, and valorization of the Italian digital cul- cultural heritage has been developed with the aim of: tural heritage. • making cultural heritage accessible to a global au- This digital space collects descriptive data and digital dience, enabling the discovery of artworks, monu- objects related to Italian cultural properties from archives, ments, and historical documents from anywhere libraries, museums, and cultural sites across the country. in the world and improving their accessibility and The comprehensive repository ensures that valuable cul- fruition; tural artifacts and their associated metadata are preserved for future generations and made accessible to researchers, • encouraging the digitalization of cultural proper- educators, and the general public. ties, ensuring their preservation for future gener- The services provided by I.PaC are organized into four ations; main areas: (1) digital assets management and processing: • promoting education and scientific research, by this area offers the necessary tools to preserve, process, providing students and researchers with simpli- and present digital objects linked to cultural heritage. It fied access to valuable materials and information includes functionalities for the digitization, cataloging, on cultural heritage, which might otherwise be and long-term storage of cultural assets, ensuring their difficult to obtain; integrity and accessibility over time; (2) domain and (3) • acting as a catalyst for cultural tourism, stimu- cross-domain graphs: these services support the repre- lating the local economy, and further enhancing sentation, querying, and retrieval of information about cultural heritage. cultural entities and their semantic relationships. By con- One of the core components of this ecosystem is I.PaC structing detailed graphs, I.PaC enables the recreation - Infrastructure and Digital Services for Cultural Her- of the context and history of cultural objects, providing itage [1], which serves as a hub for the conservation, deeper insights and facilitating complex research queries management, and enrichment of Italian digital cultural that span multiple domains; (4) Teca multimediale: this heritage. The platform strives to eliminate barriers to the user interface, offered as a Software-as-a-Service (SaaS), access to cultural information and to solve issues related allows users to create, modify, search, and delete digital to the management of heterogeneous data in terms of resources within I.PaC. It supports advanced searches, format, category and domain. making it easier for users to find and interact with the cultural data they need. The Teca multimediale also in- Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- tegrates multimedia capabilities, enabling the seamless nized by CINI, May 29-30, 2024, Naples, Italy presentation of various digital formats. * Corresponding author. $ margherita.porena@cultura.gov.it (M. Porena); For the first three areas, I.PaC is exploring the use of ar- antonella.negri@cultura.gov.it (A. Negri); tificial intelligence models to improve, enrich, and extract luigi.cerullo@cultura.gov.it (L. Cerullo) data. These AI models are designed to enhance the accu- © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings pass other types of entities, such as events and liter- ary works. This is essential for ensuring that different records referring to the same object are accurately rec- onciled, maintaining the integrity and efficiency of the I.PaC graph. Currently, two additional AI applications within I.PaC are being tested: • the development of models aimed at linking identified entities to controlled vocabularies and domain-specific terminological tools that de- Figure 1: Four main I.PaC service areas scribe cultural properties, • the enrichment of the graph with information extracted from textual contexts. racy and depth of cultural data, promoting continuous For the first aspect, many cultural properties are de- evolution in the way cultural information is managed and scribed using unstructured texts that do not refer to stan- shared [3]. By leveraging AI, I.PaC aims to facilitate more dardized vocabularies or thesauri, making access to infor- efficient data processing, uncover hidden connections be- mation less immediate. The project aims to create models tween cultural entities, and provide users with richer, that link these descriptions to standard categories from more contextualized information about Italy’s cultural controlled vocabularies, despite the challenge posed by heritage. the highly specialized and domain-specific nature of such terminologies. For the second aspect, the team is working on AI mod- 3. AI applied to descriptive data els that extract data from unstructured texts to integrate In the graphs area, one of the main problems is that the it into the I.PaC data model in a structured form, thus data managed by the I.PaC graph comes from various simplifying the search process and increasing the infor- sources, which may assign different identifiers to other- mative value of the graph. wise identical entities. This can lead to an overabundance of entities that, in reality, refer to the same object. A typ- 4. AI applied to digital objects ical example is "Agent" entities (like Leonardo Da Vinci), which is registered in multiple systems with different One of the primary goals of I.PaC is to manage a great identifiers, creating in this way different entities. number of digital objects that come from various cultural To solve this problem, innovative AI algorithms have institutions and organizations across the country. been employed. These algorithms intelligently analyze Among the various functionalities offered, I.PaC is the context of each entity, taking into account important experimenting with a content processing system using a details like ates and places of birth, qualifications, and range of artificial intelligence techniques, from Machine biographical information. By doing so, they can group Learning to generative models. This initiative aims to entities that are nominally different but semantically iden- achieve two primary objectives: on one hand, to generate tical, effectively reducing duplication. new digital content or media; on the other, to extract In the context of agent reconciliation, AI faces a par- meaningful information from existing content. ticularly challenging task due to the often limited de- In this initial phase, 7 specific use cases have been scriptive data available. Frequently, the only information selected to test how AI can enhance digital resources and provided is the agent’s full name, with no chronological enrich the graph. These use cases are: references or additional identifying details. In such cases, the AI must employ advanced techniques to analyze the • (1) Text extraction from ancient and modern works associated with the agents. For artworks, the AI monographs: this involves extracting text frome can attempt to identify stylistic similarities by examining the digitization of the monograph, creating an ab- features such as brushwork, technique, and recurring mo- stract, extracting named entities, identifying the tifs. For bibliographic works, it can focus on similarities subject, determining the table of contents, and related to the subject matter, comparing themes related identifying the physical structure of the resource, to the work. These methods enable the AI to suggest ensuring that images are arranged according to potential matches, overcoming the limitations imposed the correct pagination or foliation indicated in by the lack of detailed data. the resource. The challenge in this case lies in Plans are in place to expand this approach to encom- analyzing ancient monographs, which often have particularly deteriorated text, instances of bleed- through, and highly complex layouts where text is arranged on the page in various shapes. • (2) Processing of digitized journals: unlike the previous case, this task involves identifying the articles within a periodical, associating each arti- cle with its corresponding text section, title and subtitle, and author. The challenge here is the vast variety of layouts that need to be recognized. Additional difficulties include identifying sections that are physically separate but logically part of Figure 2: Example of metadata extraction from maps the same article, handling articles that continue on different, often distant, pages of the resource, and dealing with advertisements that can physi- • (5) Metadata extraction from maps: this use case cally and logically separate various parts of the involves developing technologies capable of ex- same article. tracting data from digitized maps, such as place • (3) Audio and video elaboration. In this context, names, the scale used, and any symbols marked the AI must be capable of extracting text from au- on the map along with their legend associations. dio files that may be corrupted. For each resource, Specifically, for cadastral maps, the AI must also it will need to generate an abstract: if the resource recognize the cadastral parcels indicated in the is musical, the abstract should consider only the image. The challenge in this task is that many descriptive metadata; for spoken resources, the ancient and modern maps have handwritten data, abstract should be based on the content of the making it difficult to recognize different types of extracted text. handwriting. Additionally, there is often no ex- • (4) Image processing: Within the I.PaC ecosystem, tended textual context available that could help millions of images related to cultural heritage the AI correct extraction errors using semantic will be hosted. This use case aims to process context. these images to identify the main subject and the • (6) Extraction of musical notation from digitized entities they comprise, mapping this information sheet music: This use case focuses on develop- to nationally recognized controlled vocabularies ing technologies capable of extracting musical (such as the Thesaurus del Nuovo Soggettario notation from digitized sheet music and enabling di Firenze or the Iconclass classification). Each the playback of the extracted notation. The AI recognized entity must be associated with the must accurately recognize and interpret various coordinates of the section of the resource where it musical symbols, notes, and annotations present is located, making it easily representable in a IIIF in the sheet music. This involves dealing with manifest, [4]1 . The goal is to create a description challenges such as varying quality of digitized of the image that can also be reproduced via audio images, handwritten annotations, and different files (to improve information accessibility) and musical notation styles. The goal is to create a to identify similar images, including a similarity digital representation of the music that can be score for each recognized similar image. easily read, edited, and played back, preserving the integrity and accuracy of the original sheet 1 The International Image Interoperability Framework (IIIF) is a stan- music. dard developed to facilitate the access and sharing of digital images • (7) Extraction of information from catalog records: by libraries, archives, museums, and other institutions with image collections. IIIF enables interoperability between different platforms Over time, numerous paper catalog records have and viewing systems, allowing users to access, view, and annotate been created to describe cultural heritage items, high-resolution images uniformly and consistently, regardless of representing a valuable informational resource their origin. A manifest in this context is a JSON document that that needs to be recovered. In many cases, the provides detailed information about a digital resource, such as an only information available about certain cultural image or a collection of images. The manifest contains metadata that define various aspects of the resource, such as bibliographic heritage items is contained in these paper records. information, structure (e.g., pages of a manuscript), and coordi- This use case involves extracting information nates for annotating specific sections of the image. Through the from these digitized catalog records to map the ex- manifest, IIIF-compatible applications can present and manage im- tracted metadata to the current national informa- ages in a standardized way, supporting advanced functionalities tion representation models. The challenge here like zooming, magnification, page navigation, and collaborative annotations. lies in the significant variation in the layouts used in these catalogue records and the differing infor- mation each type of records requires. It is not pos- sible to identify a specific layout or consistently recurring data (except for some basic information, such as the catalog number or the classification of the item). Therefore, the technology must be capable of extracting the information, recogniz- ing its semantics, and mapping it to the relevant descriptive data model. Technologies for the last three use cases have already been successfully tested, demonstrating the feasibility and effectiveness of the proposed solutions. However, in the coming months, these successfully tested technolo- Figure 3: Alphy, the AI-powered generative chatbot for the gies will require fine-tuning to improve performance and Alphabetica portal navigation achieve increasingly precise results. For the other use cases, a proof of concept (PoC) is currently being carried out by two competing companies. Upon completion of knowledge base, making the user experience this phase, the best results will be evaluated, and the more informative and engaging. All information most suitable solution will be selected. The final choice generated by the AI is highlighted in the chat, will consider both the technologies used and the devel- ensuring compliance with current regulations. oped pipeline, which must be capable of processing the resource in an automatic manner, ensuring all required Currently, another chatbot is being developed for navi- outputs. Human intervention will only be necessary for gating the General Catalog of Cultural Heritage [6], which result validation, thus ensuring an efficient and scalable contains data on cultural properties from Italian museum process for managing cultural heritage resources. and other cultural institutions. Unlike the first case, this experiment aims to use generative AI to process RDF data, organized according to the ArCo ontology network [7] 5. Generative AI to enhance and accessible through SPARQL queries. The goal is to information retrieval convert natural language questions into SPARQL queries, thus facilitating access to information. In this context, I.PaC provides also services to enhance information re- the generative AI must use Retrieval-Augmented Gen- trieval in the form of chatbots that use generative ar- eration (RAG) [8] because it needs to comprehend the tificial intelligence to assist users in navigating portals semantics of the ontology and suggest research paths. dedicated to cultural heritage. This approach allows the AI to provide more accurate The first project to have been realized, still in pub- and contextually relevant responses by dynamically in- lic experimentation, is Alphy, designed with the goal of tegrating and retrieving pertinent information from the assisting users in navigating and accessing information knowledge graph, thereby enhancing the overall user in Alphabetica [5], the portal of Italian libraries created experience in accessing and exploring the vast cultural by the Istituto Centrale per il Catalogo Unico delle Bib- heritage data. lioteche Italiane (ICCU). The application of generative artificial intelligence is crucial in three key phases of the interaction process between the chatbot and the user: 6. Conclusions • user intent interpretation: during this phase, the In conclusion, the development and implementation of AI analyzes the user’s input to accurately identify the I.PaC - Infrastructure and Digital Services for Cultural their intentions; Heritage - represents an important advancement in the • mapping intentions to three search templates: in management and valorization of Italian cultural heritage. this phase, the system guides the user’s inten- Through leveraging cutting-edge artificial intelligence tions towards three key templates: Works, Pro- technologies, from machine learning to generative mod- tagonists, and Themes; els, I.PaC not only aims to preserve and make accessible • analysis and enrichment of results: in the third cultural properties but also to innovate the way these phase, the chatbot reviews the results obtained treasures are studied and known. The exploration into AI- from the Alphabetica indexes, enriching the driven enhancements, including descriptive data analysis response with additional information from its and digital object processing, can bridge the gap between historical legacy and modern accessibility. The introduc- tion of AI-powered chatbots like Alphy for navigating cultural portals points out the commitment to enhancing user experience and information retrieval. Thanks to the continuous refinement of AI applications and extension of digital services, I.PaC is a powerful example of how culture, technology, and education come together, ensur- ing that cultural heritage is not only preserved but made accessible in new ways for generations to come. References [1] I.pac - infrastruttura e servizi digitali per il patrimo- nio culturale, 2024. URL: https://ipac.cultura.gov.it/. [2] L. Cerullo, A. Negri, L’infrastruttura software per il patrimonio culturale (ispc) come abilitatore di un ecosistema digitale nazionale del patrimonio cultur- ale, DigItalia 18 (2023). [3] R. Parry, Recoding the Museum: Digital Heritage and the Technologies of Change, Routledge, 2007. [4] R. S. Stuart Snydman, T. Cramer, The international image interoperability framework (iiif): A commu- nity technology approach for web-based images, Archiving conference 12 (2015). [5] Alphabetica, 2021. URL: https://alphabetica.it/. [6] Catalogo generale dei beni culturali, 2021. URL: https: //catalogo.beniculturali.it/. [7] V. A. Carriero, A. Gangemi, M. L. Mancinelli, L. Mar- inucci, A. G. Nuzzolese, V. Presutti, C. Veninata, Arco: The italian cultural heritage knowledge graph, in: Proc of ISWC, 2019, pp. 36–52. [8] M. D. D. M. Hamed Zamani, Fernando Diaz, M. Ben- dersky, Retrieval-enhanced machine learning, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval (Madrid, Spain) (SIGIR ’22), Asso- ciation for Computing Machinery, New York, NY, USA, 2022, p. 2875–2886. doi:https://doi.org/ 10.1145/3477495.3531722.