I.PaC: the National Data Space for Cultural Heritage
                                Margherita Porena1,2,* , Antonella Negri1 and Luigi Cerullo1
                                1
                                    Istituto centrale per la digitalizzazione del patrimonio culturale - Digital Library, Via di San Michele, 18, Rome, 00153, Italy
                                2
                                    Alma Mater Studiorum - Università di Bologna, Via Zamboni, 33, Bologna, 40126, Italy


                                                   Abstract
                                                   The article describes I.PaC (Infrastructure and Digital Services for Cultural Heritage), the digital framework designed as
                                                   a central hub for managing descriptive data and digital objects from cultural institutions at a national level. The paper
                                                   investigates the use of Artificial Intelligence (AI) within the I.PaC infrastructure to enhance the quality of descriptive data, to
                                                   add value to digital objects, and to assist users in navigating cultural portals.

                                                   Keywords
                                                   Cultural Heritage, National data space, Generative AI


                                1. Introduzione                                                                                             2. I.PaC: The National Data Space
                                The great number of Italian cultural properties presents                                                       for Cultural Heritage
                                numerous challenges in terms of accessibility, conserva-
                                                                                                        I.PaC - Infrastructure and Digital Services for Cultural Her-
                                tion, and enhancement of cultural heritage. To address
                                                                                                        itage [2] - is the data space dedicated to the preservation,
                                these challenges, a dedicated digital infrastructure for
                                                                                                        management, and valorization of the Italian digital cul-
                                cultural heritage has been developed with the aim of:
                                                                                                        tural heritage.
                                       • making cultural heritage accessible to a global au-                This digital space collects descriptive data and digital
                                          dience, enabling the discovery of artworks, monu- objects related to Italian cultural properties from archives,
                                          ments, and historical documents from anywhere libraries, museums, and cultural sites across the country.
                                          in the world and improving their accessibility and The comprehensive repository ensures that valuable cul-
                                          fruition;                                                     tural artifacts and their associated metadata are preserved
                                                                                                        for future generations and made accessible to researchers,
                                       • encouraging the digitalization of cultural proper-
                                                                                                        educators, and the general public.
                                          ties, ensuring their preservation for future gener-
                                                                                                            The services provided by I.PaC are organized into four
                                          ations;
                                                                                                        main areas: (1) digital assets management and processing:
                                       • promoting education and scientific research, by
                                                                                                        this area offers the necessary tools to preserve, process,
                                          providing students and researchers with simpli-
                                                                                                        and present digital objects linked to cultural heritage. It
                                          fied access to valuable materials and information
                                                                                                        includes functionalities for the digitization, cataloging,
                                          on cultural heritage, which might otherwise be
                                                                                                        and long-term storage of cultural assets, ensuring their
                                          difficult to obtain;
                                                                                                        integrity and accessibility over time; (2) domain and (3)
                                       • acting as a catalyst for cultural tourism, stimu- cross-domain graphs: these services support the repre-
                                          lating the local economy, and further enhancing sentation, querying, and retrieval of information about
                                          cultural heritage.                                            cultural entities and their semantic relationships. By con-
                                   One of the core components of this ecosystem is I.PaC structing detailed graphs, I.PaC enables the recreation
                                - Infrastructure and Digital Services for Cultural Her- of the context and history of cultural objects, providing
                                itage [1], which serves as a hub for the conservation, deeper insights and facilitating complex research queries
                                management, and enrichment of Italian digital cultural that span multiple domains; (4) Teca multimediale: this
                                heritage. The platform strives to eliminate barriers to the user interface, offered as a Software-as-a-Service (SaaS),
                                access to cultural information and to solve issues related allows users to create, modify, search, and delete digital
                                to the management of heterogeneous data in terms of resources within I.PaC. It supports advanced searches,
                                format, category and domain.                                            making it easier for users to find and interact with the
                                                                                                        cultural data they need. The Teca multimediale also in-
                                Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- tegrates multimedia capabilities, enabling the seamless
                                nized by CINI, May 29-30, 2024, Naples, Italy                           presentation of various digital formats.
                                *
                                  Corresponding author.
                                $ margherita.porena@cultura.gov.it (M. Porena);
                                                                                                            For the first three areas, I.PaC is exploring the use of ar-
                                antonella.negri@cultura.gov.it (A. Negri);                              tificial intelligence models to improve, enrich, and extract
                                luigi.cerullo@cultura.gov.it (L. Cerullo)                               data. These AI models are designed to enhance the accu-
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                             Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                              pass other types of entities, such as events and liter-
                                                              ary works. This is essential for ensuring that different
                                                              records referring to the same object are accurately rec-
                                                              onciled, maintaining the integrity and efficiency of the
                                                              I.PaC graph.
                                                                 Currently, two additional AI applications within I.PaC
                                                              are being tested:

                                                                   • the development of models aimed at linking
                                                                     identified entities to controlled vocabularies and
                                                                     domain-specific terminological tools that de-
Figure 1: Four main I.PaC service areas                              scribe cultural properties,
                                                                   • the enrichment of the graph with information
                                                                     extracted from textual contexts.
racy and depth of cultural data, promoting continuous             For the first aspect, many cultural properties are de-
evolution in the way cultural information is managed and       scribed using unstructured texts that do not refer to stan-
shared [3]. By leveraging AI, I.PaC aims to facilitate more    dardized vocabularies or thesauri, making access to infor-
efficient data processing, uncover hidden connections be-      mation less immediate. The project aims to create models
tween cultural entities, and provide users with richer,        that link these descriptions to standard categories from
more contextualized information about Italy’s cultural         controlled vocabularies, despite the challenge posed by
heritage.                                                      the highly specialized and domain-specific nature of such
                                                               terminologies.
                                                                  For the second aspect, the team is working on AI mod-
3. AI applied to descriptive data                              els that extract data from unstructured texts to integrate
In the graphs area, one of the main problems is that the it into the I.PaC data model in a structured form, thus
data managed by the I.PaC graph comes from various simplifying the search process and increasing the infor-
sources, which may assign different identifiers to other- mative value of the graph.
wise identical entities. This can lead to an overabundance
of entities that, in reality, refer to the same object. A typ- 4. AI applied to digital objects
ical example is "Agent" entities (like Leonardo Da Vinci),
which is registered in multiple systems with different One of the primary goals of I.PaC is to manage a great
identifiers, creating in this way different entities.          number of digital objects that come from various cultural
   To solve this problem, innovative AI algorithms have institutions and organizations across the country.
been employed. These algorithms intelligently analyze             Among the various functionalities offered, I.PaC is
the context of each entity, taking into account important experimenting with a content processing system using a
details like ates and places of birth, qualifications, and range of artificial intelligence techniques, from Machine
biographical information. By doing so, they can group Learning to generative models. This initiative aims to
entities that are nominally different but semantically iden- achieve two primary objectives: on one hand, to generate
tical, effectively reducing duplication.                       new digital content or media; on the other, to extract
   In the context of agent reconciliation, AI faces a par- meaningful information from existing content.
ticularly challenging task due to the often limited de-           In this initial phase, 7 specific use cases have been
scriptive data available. Frequently, the only information selected to test how AI can enhance digital resources and
provided is the agent’s full name, with no chronological enrich the graph. These use cases are:
references or additional identifying details. In such cases,
the AI must employ advanced techniques to analyze the                • (1) Text extraction from ancient and modern
works associated with the agents. For artworks, the AI                 monographs: this involves extracting text frome
can attempt to identify stylistic similarities by examining            the digitization of the monograph, creating an ab-
features such as brushwork, technique, and recurring mo-               stract, extracting named entities, identifying the
tifs. For bibliographic works, it can focus on similarities            subject, determining the table of contents, and
related to the subject matter, comparing themes related                identifying the physical structure of the resource,
to the work. These methods enable the AI to suggest                    ensuring that images are arranged according to
potential matches, overcoming the limitations imposed                  the correct pagination or foliation indicated in
by the lack of detailed data.                                          the resource. The challenge in this case lies in
   Plans are in place to expand this approach to encom-                analyzing ancient monographs, which often have
           particularly deteriorated text, instances of bleed-
           through, and highly complex layouts where text
           is arranged on the page in various shapes.
         • (2) Processing of digitized journals: unlike the
           previous case, this task involves identifying the
           articles within a periodical, associating each arti-
           cle with its corresponding text section, title and
           subtitle, and author. The challenge here is the
           vast variety of layouts that need to be recognized.
           Additional difficulties include identifying sections
           that are physically separate but logically part of                  Figure 2: Example of metadata extraction from maps
           the same article, handling articles that continue
           on different, often distant, pages of the resource,
           and dealing with advertisements that can physi-
                                                                             • (5) Metadata extraction from maps: this use case
           cally and logically separate various parts of the
                                                                               involves developing technologies capable of ex-
           same article.
                                                                               tracting data from digitized maps, such as place
         • (3) Audio and video elaboration. In this context,                   names, the scale used, and any symbols marked
           the AI must be capable of extracting text from au-                  on the map along with their legend associations.
           dio files that may be corrupted. For each resource,                 Specifically, for cadastral maps, the AI must also
           it will need to generate an abstract: if the resource               recognize the cadastral parcels indicated in the
           is musical, the abstract should consider only the                   image. The challenge in this task is that many
           descriptive metadata; for spoken resources, the                     ancient and modern maps have handwritten data,
           abstract should be based on the content of the                      making it difficult to recognize different types of
           extracted text.                                                     handwriting. Additionally, there is often no ex-
         • (4) Image processing: Within the I.PaC ecosystem,                   tended textual context available that could help
           millions of images related to cultural heritage                     the AI correct extraction errors using semantic
           will be hosted. This use case aims to process                       context.
           these images to identify the main subject and the                 • (6) Extraction of musical notation from digitized
           entities they comprise, mapping this information                    sheet music: This use case focuses on develop-
           to nationally recognized controlled vocabularies                    ing technologies capable of extracting musical
           (such as the Thesaurus del Nuovo Soggettario                        notation from digitized sheet music and enabling
           di Firenze or the Iconclass classification). Each                   the playback of the extracted notation. The AI
           recognized entity must be associated with the                       must accurately recognize and interpret various
           coordinates of the section of the resource where it                 musical symbols, notes, and annotations present
           is located, making it easily representable in a IIIF                in the sheet music. This involves dealing with
           manifest, [4]1 . The goal is to create a description                challenges such as varying quality of digitized
           of the image that can also be reproduced via audio                  images, handwritten annotations, and different
           files (to improve information accessibility) and                    musical notation styles. The goal is to create a
           to identify similar images, including a similarity                  digital representation of the music that can be
           score for each recognized similar image.                            easily read, edited, and played back, preserving
                                                                               the integrity and accuracy of the original sheet
1
    The International Image Interoperability Framework (IIIF) is a stan-       music.
    dard developed to facilitate the access and sharing of digital images    • (7) Extraction of information from catalog records:
    by libraries, archives, museums, and other institutions with image
    collections. IIIF enables interoperability between different platforms
                                                                               Over time, numerous paper catalog records have
    and viewing systems, allowing users to access, view, and annotate          been created to describe cultural heritage items,
    high-resolution images uniformly and consistently, regardless of           representing a valuable informational resource
    their origin. A manifest in this context is a JSON document that           that needs to be recovered. In many cases, the
    provides detailed information about a digital resource, such as an         only information available about certain cultural
    image or a collection of images. The manifest contains metadata
    that define various aspects of the resource, such as bibliographic
                                                                               heritage items is contained in these paper records.
    information, structure (e.g., pages of a manuscript), and coordi-          This use case involves extracting information
    nates for annotating specific sections of the image. Through the           from these digitized catalog records to map the ex-
    manifest, IIIF-compatible applications can present and manage im-          tracted metadata to the current national informa-
    ages in a standardized way, supporting advanced functionalities            tion representation models. The challenge here
    like zooming, magnification, page navigation, and collaborative
    annotations.
                                                                               lies in the significant variation in the layouts used
        in these catalogue records and the differing infor-
        mation each type of records requires. It is not pos-
        sible to identify a specific layout or consistently
        recurring data (except for some basic information,
        such as the catalog number or the classification
        of the item). Therefore, the technology must be
        capable of extracting the information, recogniz-
        ing its semantics, and mapping it to the relevant
        descriptive data model.

   Technologies for the last three use cases have already
been successfully tested, demonstrating the feasibility
and effectiveness of the proposed solutions. However, in
the coming months, these successfully tested technolo-          Figure 3: Alphy, the AI-powered generative chatbot for the
gies will require fine-tuning to improve performance and        Alphabetica portal navigation
achieve increasingly precise results. For the other use
cases, a proof of concept (PoC) is currently being carried
out by two competing companies. Upon completion of                      knowledge base, making the user experience
this phase, the best results will be evaluated, and the                 more informative and engaging. All information
most suitable solution will be selected. The final choice               generated by the AI is highlighted in the chat,
will consider both the technologies used and the devel-                 ensuring compliance with current regulations.
oped pipeline, which must be capable of processing the
resource in an automatic manner, ensuring all required            Currently, another chatbot is being developed for navi-
outputs. Human intervention will only be necessary for          gating the General Catalog of Cultural Heritage [6], which
result validation, thus ensuring an efficient and scalable      contains data on cultural properties from Italian museum
process for managing cultural heritage resources.               and other cultural institutions. Unlike the first case, this
                                                                experiment aims to use generative AI to process RDF data,
                                                                organized according to the ArCo ontology network [7]
5. Generative AI to enhance                                     and accessible through SPARQL queries. The goal is to
   information retrieval                                        convert natural language questions into SPARQL queries,
                                                                thus facilitating access to information. In this context,
I.PaC provides also services to enhance information re-         the generative AI must use Retrieval-Augmented Gen-
trieval in the form of chatbots that use generative ar-         eration (RAG) [8] because it needs to comprehend the
tificial intelligence to assist users in navigating portals     semantics of the ontology and suggest research paths.
dedicated to cultural heritage.                                 This approach allows the AI to provide more accurate
   The first project to have been realized, still in pub-       and contextually relevant responses by dynamically in-
lic experimentation, is Alphy, designed with the goal of        tegrating and retrieving pertinent information from the
assisting users in navigating and accessing information         knowledge graph, thereby enhancing the overall user
in Alphabetica [5], the portal of Italian libraries created     experience in accessing and exploring the vast cultural
by the Istituto Centrale per il Catalogo Unico delle Bib-       heritage data.
lioteche Italiane (ICCU). The application of generative
artificial intelligence is crucial in three key phases of the
interaction process between the chatbot and the user:           6. Conclusions
     • user intent interpretation: during this phase, the       In conclusion, the development and implementation of
       AI analyzes the user’s input to accurately identify      the I.PaC - Infrastructure and Digital Services for Cultural
       their intentions;                                        Heritage - represents an important advancement in the
     • mapping intentions to three search templates: in         management and valorization of Italian cultural heritage.
       this phase, the system guides the user’s inten-          Through leveraging cutting-edge artificial intelligence
       tions towards three key templates: Works, Pro-           technologies, from machine learning to generative mod-
       tagonists, and Themes;                                   els, I.PaC not only aims to preserve and make accessible
     • analysis and enrichment of results: in the third         cultural properties but also to innovate the way these
       phase, the chatbot reviews the results obtained          treasures are studied and known. The exploration into AI-
       from the Alphabetica indexes, enriching the              driven enhancements, including descriptive data analysis
       response with additional information from its            and digital object processing, can bridge the gap between
historical legacy and modern accessibility. The introduc-
tion of AI-powered chatbots like Alphy for navigating
cultural portals points out the commitment to enhancing
user experience and information retrieval. Thanks to the
continuous refinement of AI applications and extension
of digital services, I.PaC is a powerful example of how
culture, technology, and education come together, ensur-
ing that cultural heritage is not only preserved but made
accessible in new ways for generations to come.


References
[1] I.pac - infrastruttura e servizi digitali per il patrimo-
    nio culturale, 2024. URL: https://ipac.cultura.gov.it/.
[2] L. Cerullo, A. Negri, L’infrastruttura software per
    il patrimonio culturale (ispc) come abilitatore di un
    ecosistema digitale nazionale del patrimonio cultur-
    ale, DigItalia 18 (2023).
[3] R. Parry, Recoding the Museum: Digital Heritage
    and the Technologies of Change, Routledge, 2007.
[4] R. S. Stuart Snydman, T. Cramer, The international
    image interoperability framework (iiif): A commu-
    nity technology approach for web-based images,
    Archiving conference 12 (2015).
[5] Alphabetica, 2021. URL: https://alphabetica.it/.
[6] Catalogo generale dei beni culturali, 2021. URL: https:
    //catalogo.beniculturali.it/.
[7] V. A. Carriero, A. Gangemi, M. L. Mancinelli, L. Mar-
    inucci, A. G. Nuzzolese, V. Presutti, C. Veninata, Arco:
    The italian cultural heritage knowledge graph, in:
    Proc of ISWC, 2019, pp. 36–52.
[8] M. D. D. M. Hamed Zamani, Fernando Diaz, M. Ben-
    dersky, Retrieval-enhanced machine learning, in:
    Proceedings of the 45th International ACM SIGIR
    Conference on Research and Development in Infor-
    mation Retrieval (Madrid, Spain) (SIGIR ’22), Asso-
    ciation for Computing Machinery, New York, NY,
    USA, 2022, p. 2875–2886. doi:https://doi.org/
    10.1145/3477495.3531722.