=Paper=
{{Paper
|id=Vol-3762/484
|storemode=property
|title=I.PaC: the National Data Space for Cultural Heritage
|pdfUrl=https://ceur-ws.org/Vol-3762/484.pdf
|volume=Vol-3762
|authors=Margherita Porena,Antonella Negri,Luigi Cerullo
|dblpUrl=https://dblp.org/rec/conf/ital-ia/PorenaNC24
}}
==I.PaC: the National Data Space for Cultural Heritage==
I.PaC: the National Data Space for Cultural Heritage
Margherita Porena1,2,* , Antonella Negri1 and Luigi Cerullo1
1
Istituto centrale per la digitalizzazione del patrimonio culturale - Digital Library, Via di San Michele, 18, Rome, 00153, Italy
2
Alma Mater Studiorum - Università di Bologna, Via Zamboni, 33, Bologna, 40126, Italy
Abstract
The article describes I.PaC (Infrastructure and Digital Services for Cultural Heritage), the digital framework designed as
a central hub for managing descriptive data and digital objects from cultural institutions at a national level. The paper
investigates the use of Artificial Intelligence (AI) within the I.PaC infrastructure to enhance the quality of descriptive data, to
add value to digital objects, and to assist users in navigating cultural portals.
Keywords
Cultural Heritage, National data space, Generative AI
1. Introduzione 2. I.PaC: The National Data Space
The great number of Italian cultural properties presents for Cultural Heritage
numerous challenges in terms of accessibility, conserva-
I.PaC - Infrastructure and Digital Services for Cultural Her-
tion, and enhancement of cultural heritage. To address
itage [2] - is the data space dedicated to the preservation,
these challenges, a dedicated digital infrastructure for
management, and valorization of the Italian digital cul-
cultural heritage has been developed with the aim of:
tural heritage.
• making cultural heritage accessible to a global au- This digital space collects descriptive data and digital
dience, enabling the discovery of artworks, monu- objects related to Italian cultural properties from archives,
ments, and historical documents from anywhere libraries, museums, and cultural sites across the country.
in the world and improving their accessibility and The comprehensive repository ensures that valuable cul-
fruition; tural artifacts and their associated metadata are preserved
for future generations and made accessible to researchers,
• encouraging the digitalization of cultural proper-
educators, and the general public.
ties, ensuring their preservation for future gener-
The services provided by I.PaC are organized into four
ations;
main areas: (1) digital assets management and processing:
• promoting education and scientific research, by
this area offers the necessary tools to preserve, process,
providing students and researchers with simpli-
and present digital objects linked to cultural heritage. It
fied access to valuable materials and information
includes functionalities for the digitization, cataloging,
on cultural heritage, which might otherwise be
and long-term storage of cultural assets, ensuring their
difficult to obtain;
integrity and accessibility over time; (2) domain and (3)
• acting as a catalyst for cultural tourism, stimu- cross-domain graphs: these services support the repre-
lating the local economy, and further enhancing sentation, querying, and retrieval of information about
cultural heritage. cultural entities and their semantic relationships. By con-
One of the core components of this ecosystem is I.PaC structing detailed graphs, I.PaC enables the recreation
- Infrastructure and Digital Services for Cultural Her- of the context and history of cultural objects, providing
itage [1], which serves as a hub for the conservation, deeper insights and facilitating complex research queries
management, and enrichment of Italian digital cultural that span multiple domains; (4) Teca multimediale: this
heritage. The platform strives to eliminate barriers to the user interface, offered as a Software-as-a-Service (SaaS),
access to cultural information and to solve issues related allows users to create, modify, search, and delete digital
to the management of heterogeneous data in terms of resources within I.PaC. It supports advanced searches,
format, category and domain. making it easier for users to find and interact with the
cultural data they need. The Teca multimediale also in-
Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- tegrates multimedia capabilities, enabling the seamless
nized by CINI, May 29-30, 2024, Naples, Italy presentation of various digital formats.
*
Corresponding author.
$ margherita.porena@cultura.gov.it (M. Porena);
For the first three areas, I.PaC is exploring the use of ar-
antonella.negri@cultura.gov.it (A. Negri); tificial intelligence models to improve, enrich, and extract
luigi.cerullo@cultura.gov.it (L. Cerullo) data. These AI models are designed to enhance the accu-
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
pass other types of entities, such as events and liter-
ary works. This is essential for ensuring that different
records referring to the same object are accurately rec-
onciled, maintaining the integrity and efficiency of the
I.PaC graph.
Currently, two additional AI applications within I.PaC
are being tested:
• the development of models aimed at linking
identified entities to controlled vocabularies and
domain-specific terminological tools that de-
Figure 1: Four main I.PaC service areas scribe cultural properties,
• the enrichment of the graph with information
extracted from textual contexts.
racy and depth of cultural data, promoting continuous For the first aspect, many cultural properties are de-
evolution in the way cultural information is managed and scribed using unstructured texts that do not refer to stan-
shared [3]. By leveraging AI, I.PaC aims to facilitate more dardized vocabularies or thesauri, making access to infor-
efficient data processing, uncover hidden connections be- mation less immediate. The project aims to create models
tween cultural entities, and provide users with richer, that link these descriptions to standard categories from
more contextualized information about Italy’s cultural controlled vocabularies, despite the challenge posed by
heritage. the highly specialized and domain-specific nature of such
terminologies.
For the second aspect, the team is working on AI mod-
3. AI applied to descriptive data els that extract data from unstructured texts to integrate
In the graphs area, one of the main problems is that the it into the I.PaC data model in a structured form, thus
data managed by the I.PaC graph comes from various simplifying the search process and increasing the infor-
sources, which may assign different identifiers to other- mative value of the graph.
wise identical entities. This can lead to an overabundance
of entities that, in reality, refer to the same object. A typ- 4. AI applied to digital objects
ical example is "Agent" entities (like Leonardo Da Vinci),
which is registered in multiple systems with different One of the primary goals of I.PaC is to manage a great
identifiers, creating in this way different entities. number of digital objects that come from various cultural
To solve this problem, innovative AI algorithms have institutions and organizations across the country.
been employed. These algorithms intelligently analyze Among the various functionalities offered, I.PaC is
the context of each entity, taking into account important experimenting with a content processing system using a
details like ates and places of birth, qualifications, and range of artificial intelligence techniques, from Machine
biographical information. By doing so, they can group Learning to generative models. This initiative aims to
entities that are nominally different but semantically iden- achieve two primary objectives: on one hand, to generate
tical, effectively reducing duplication. new digital content or media; on the other, to extract
In the context of agent reconciliation, AI faces a par- meaningful information from existing content.
ticularly challenging task due to the often limited de- In this initial phase, 7 specific use cases have been
scriptive data available. Frequently, the only information selected to test how AI can enhance digital resources and
provided is the agent’s full name, with no chronological enrich the graph. These use cases are:
references or additional identifying details. In such cases,
the AI must employ advanced techniques to analyze the • (1) Text extraction from ancient and modern
works associated with the agents. For artworks, the AI monographs: this involves extracting text frome
can attempt to identify stylistic similarities by examining the digitization of the monograph, creating an ab-
features such as brushwork, technique, and recurring mo- stract, extracting named entities, identifying the
tifs. For bibliographic works, it can focus on similarities subject, determining the table of contents, and
related to the subject matter, comparing themes related identifying the physical structure of the resource,
to the work. These methods enable the AI to suggest ensuring that images are arranged according to
potential matches, overcoming the limitations imposed the correct pagination or foliation indicated in
by the lack of detailed data. the resource. The challenge in this case lies in
Plans are in place to expand this approach to encom- analyzing ancient monographs, which often have
particularly deteriorated text, instances of bleed-
through, and highly complex layouts where text
is arranged on the page in various shapes.
• (2) Processing of digitized journals: unlike the
previous case, this task involves identifying the
articles within a periodical, associating each arti-
cle with its corresponding text section, title and
subtitle, and author. The challenge here is the
vast variety of layouts that need to be recognized.
Additional difficulties include identifying sections
that are physically separate but logically part of Figure 2: Example of metadata extraction from maps
the same article, handling articles that continue
on different, often distant, pages of the resource,
and dealing with advertisements that can physi-
• (5) Metadata extraction from maps: this use case
cally and logically separate various parts of the
involves developing technologies capable of ex-
same article.
tracting data from digitized maps, such as place
• (3) Audio and video elaboration. In this context, names, the scale used, and any symbols marked
the AI must be capable of extracting text from au- on the map along with their legend associations.
dio files that may be corrupted. For each resource, Specifically, for cadastral maps, the AI must also
it will need to generate an abstract: if the resource recognize the cadastral parcels indicated in the
is musical, the abstract should consider only the image. The challenge in this task is that many
descriptive metadata; for spoken resources, the ancient and modern maps have handwritten data,
abstract should be based on the content of the making it difficult to recognize different types of
extracted text. handwriting. Additionally, there is often no ex-
• (4) Image processing: Within the I.PaC ecosystem, tended textual context available that could help
millions of images related to cultural heritage the AI correct extraction errors using semantic
will be hosted. This use case aims to process context.
these images to identify the main subject and the • (6) Extraction of musical notation from digitized
entities they comprise, mapping this information sheet music: This use case focuses on develop-
to nationally recognized controlled vocabularies ing technologies capable of extracting musical
(such as the Thesaurus del Nuovo Soggettario notation from digitized sheet music and enabling
di Firenze or the Iconclass classification). Each the playback of the extracted notation. The AI
recognized entity must be associated with the must accurately recognize and interpret various
coordinates of the section of the resource where it musical symbols, notes, and annotations present
is located, making it easily representable in a IIIF in the sheet music. This involves dealing with
manifest, [4]1 . The goal is to create a description challenges such as varying quality of digitized
of the image that can also be reproduced via audio images, handwritten annotations, and different
files (to improve information accessibility) and musical notation styles. The goal is to create a
to identify similar images, including a similarity digital representation of the music that can be
score for each recognized similar image. easily read, edited, and played back, preserving
the integrity and accuracy of the original sheet
1
The International Image Interoperability Framework (IIIF) is a stan- music.
dard developed to facilitate the access and sharing of digital images • (7) Extraction of information from catalog records:
by libraries, archives, museums, and other institutions with image
collections. IIIF enables interoperability between different platforms
Over time, numerous paper catalog records have
and viewing systems, allowing users to access, view, and annotate been created to describe cultural heritage items,
high-resolution images uniformly and consistently, regardless of representing a valuable informational resource
their origin. A manifest in this context is a JSON document that that needs to be recovered. In many cases, the
provides detailed information about a digital resource, such as an only information available about certain cultural
image or a collection of images. The manifest contains metadata
that define various aspects of the resource, such as bibliographic
heritage items is contained in these paper records.
information, structure (e.g., pages of a manuscript), and coordi- This use case involves extracting information
nates for annotating specific sections of the image. Through the from these digitized catalog records to map the ex-
manifest, IIIF-compatible applications can present and manage im- tracted metadata to the current national informa-
ages in a standardized way, supporting advanced functionalities tion representation models. The challenge here
like zooming, magnification, page navigation, and collaborative
annotations.
lies in the significant variation in the layouts used
in these catalogue records and the differing infor-
mation each type of records requires. It is not pos-
sible to identify a specific layout or consistently
recurring data (except for some basic information,
such as the catalog number or the classification
of the item). Therefore, the technology must be
capable of extracting the information, recogniz-
ing its semantics, and mapping it to the relevant
descriptive data model.
Technologies for the last three use cases have already
been successfully tested, demonstrating the feasibility
and effectiveness of the proposed solutions. However, in
the coming months, these successfully tested technolo- Figure 3: Alphy, the AI-powered generative chatbot for the
gies will require fine-tuning to improve performance and Alphabetica portal navigation
achieve increasingly precise results. For the other use
cases, a proof of concept (PoC) is currently being carried
out by two competing companies. Upon completion of knowledge base, making the user experience
this phase, the best results will be evaluated, and the more informative and engaging. All information
most suitable solution will be selected. The final choice generated by the AI is highlighted in the chat,
will consider both the technologies used and the devel- ensuring compliance with current regulations.
oped pipeline, which must be capable of processing the
resource in an automatic manner, ensuring all required Currently, another chatbot is being developed for navi-
outputs. Human intervention will only be necessary for gating the General Catalog of Cultural Heritage [6], which
result validation, thus ensuring an efficient and scalable contains data on cultural properties from Italian museum
process for managing cultural heritage resources. and other cultural institutions. Unlike the first case, this
experiment aims to use generative AI to process RDF data,
organized according to the ArCo ontology network [7]
5. Generative AI to enhance and accessible through SPARQL queries. The goal is to
information retrieval convert natural language questions into SPARQL queries,
thus facilitating access to information. In this context,
I.PaC provides also services to enhance information re- the generative AI must use Retrieval-Augmented Gen-
trieval in the form of chatbots that use generative ar- eration (RAG) [8] because it needs to comprehend the
tificial intelligence to assist users in navigating portals semantics of the ontology and suggest research paths.
dedicated to cultural heritage. This approach allows the AI to provide more accurate
The first project to have been realized, still in pub- and contextually relevant responses by dynamically in-
lic experimentation, is Alphy, designed with the goal of tegrating and retrieving pertinent information from the
assisting users in navigating and accessing information knowledge graph, thereby enhancing the overall user
in Alphabetica [5], the portal of Italian libraries created experience in accessing and exploring the vast cultural
by the Istituto Centrale per il Catalogo Unico delle Bib- heritage data.
lioteche Italiane (ICCU). The application of generative
artificial intelligence is crucial in three key phases of the
interaction process between the chatbot and the user: 6. Conclusions
• user intent interpretation: during this phase, the In conclusion, the development and implementation of
AI analyzes the user’s input to accurately identify the I.PaC - Infrastructure and Digital Services for Cultural
their intentions; Heritage - represents an important advancement in the
• mapping intentions to three search templates: in management and valorization of Italian cultural heritage.
this phase, the system guides the user’s inten- Through leveraging cutting-edge artificial intelligence
tions towards three key templates: Works, Pro- technologies, from machine learning to generative mod-
tagonists, and Themes; els, I.PaC not only aims to preserve and make accessible
• analysis and enrichment of results: in the third cultural properties but also to innovate the way these
phase, the chatbot reviews the results obtained treasures are studied and known. The exploration into AI-
from the Alphabetica indexes, enriching the driven enhancements, including descriptive data analysis
response with additional information from its and digital object processing, can bridge the gap between
historical legacy and modern accessibility. The introduc-
tion of AI-powered chatbots like Alphy for navigating
cultural portals points out the commitment to enhancing
user experience and information retrieval. Thanks to the
continuous refinement of AI applications and extension
of digital services, I.PaC is a powerful example of how
culture, technology, and education come together, ensur-
ing that cultural heritage is not only preserved but made
accessible in new ways for generations to come.
References
[1] I.pac - infrastruttura e servizi digitali per il patrimo-
nio culturale, 2024. URL: https://ipac.cultura.gov.it/.
[2] L. Cerullo, A. Negri, L’infrastruttura software per
il patrimonio culturale (ispc) come abilitatore di un
ecosistema digitale nazionale del patrimonio cultur-
ale, DigItalia 18 (2023).
[3] R. Parry, Recoding the Museum: Digital Heritage
and the Technologies of Change, Routledge, 2007.
[4] R. S. Stuart Snydman, T. Cramer, The international
image interoperability framework (iiif): A commu-
nity technology approach for web-based images,
Archiving conference 12 (2015).
[5] Alphabetica, 2021. URL: https://alphabetica.it/.
[6] Catalogo generale dei beni culturali, 2021. URL: https:
//catalogo.beniculturali.it/.
[7] V. A. Carriero, A. Gangemi, M. L. Mancinelli, L. Mar-
inucci, A. G. Nuzzolese, V. Presutti, C. Veninata, Arco:
The italian cultural heritage knowledge graph, in:
Proc of ISWC, 2019, pp. 36–52.
[8] M. D. D. M. Hamed Zamani, Fernando Diaz, M. Ben-
dersky, Retrieval-enhanced machine learning, in:
Proceedings of the 45th International ACM SIGIR
Conference on Research and Development in Infor-
mation Retrieval (Madrid, Spain) (SIGIR ’22), Asso-
ciation for Computing Machinery, New York, NY,
USA, 2022, p. 2875–2886. doi:https://doi.org/
10.1145/3477495.3531722.