<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>I.PaC: the National Data Space for Cultural Heritage</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Margherita Porena</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonella Negri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luigi Cerullo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alma Mater Studiorum - Università di Bologna</institution>
          ,
          <addr-line>Via Zamboni, 33, Bologna, 40126</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Istituto centrale per la digitalizzazione del patrimonio culturale - Digital Library</institution>
          ,
          <addr-line>Via di San Michele, 18, Rome, 00153</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The article describes I.PaC (Infrastructure and Digital Services for Cultural Heritage), the digital framework designed as a central hub for managing descriptive data and digital objects from cultural institutions at a national level. The paper investigates the use of Artificial Intelligence (AI) within the I.PaC infrastructure to enhance the quality of descriptive data, to add value to digital objects, and to assist users in navigating cultural portals.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Cultural Heritage</kwd>
        <kwd>National data space</kwd>
        <kwd>Generative AI</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduzione</title>
    </sec>
    <sec id="sec-2">
      <title>2. I.PaC: The National Data Space for Cultural Heritage</title>
      <p>The great number of Italian cultural properties presents
numerous challenges in terms of accessibility, conserva- I.PaC - Infrastructure and Digital Services for Cultural
Hertion, and enhancement of cultural heritage. To address itage [2] - is the data space dedicated to the preservation,
these challenges, a dedicated digital infrastructure for management, and valorization of the Italian digital
culcultural heritage has been developed with the aim of: tural heritage.</p>
      <p>
        • making cultural heritage accessible to a global au- This digital space collects descriptive data and digital
dience, enabling the discovery of artworks, monu- objects related to Italian cultural properties from archives,
ments, and historical documents from anywhere libraries, museums, and cultural sites across the country.
in the world and improving their accessibility and The comprehensive repository ensures that valuable
culfruition; tural artifacts and their associated metadata are preserved
• encouraging the digitalization of cultural proper- for future generations and made accessible to researchers,
ties, ensuring their preservation for future gener- educators, and the general public.
ations; The services provided by I.PaC are organized into four
main areas: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) digital assets management and processing:
• promoting education and scientific research, by this area ofers the necessary tools to preserve, process,
providing students and researchers with simpli- and present digital objects linked to cultural heritage. It
ifed access to valuable materials and information includes functionalities for the digitization, cataloging,
on cultural heritage, which might otherwise be and long-term storage of cultural assets, ensuring their
dificult to obtain; integrity and accessibility over time; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) domain and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
• acting as a catalyst for cultural tourism, stimu- cross-domain graphs: these services support the
reprelating the local economy, and further enhancing sentation, querying, and retrieval of information about
cultural heritage. cultural entities and their semantic relationships. By
conOne of the core components of this ecosystem is I.PaC structing detailed graphs, I.PaC enables the recreation
- Infrastructure and Digital Services for Cultural Her- of the context and history of cultural objects, providing
itage [1], which serves as a hub for the conservation, deeper insights and facilitating complex research queries
management, and enrichment of Italian digital cultural that span multiple domains; (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) Teca multimediale: this
heritage. The platform strives to eliminate barriers to the user interface, ofered as a Software-as-a-Service (SaaS),
access to cultural information and to solve issues related allows users to create, modify, search, and delete digital
to the management of heterogeneous data in terms of resources within I.PaC. It supports advanced searches,
format, category and domain. making it easier for users to find and interact with the
cultural data they need. The Teca multimediale also
integrates multimedia capabilities, enabling the seamless
presentation of various digital formats.
      </p>
      <p>For the first three areas, I.PaC is exploring the use of
artificial intelligence models to improve, enrich, and extract
data. These AI models are designed to enhance the
accuItal-IA 2024: 4th National Conference on Artificial Intelligence,
organized by CINI, May 29-30, 2024, Naples, Italy
* Corresponding author.
$ margherita.porena@cultura.gov.it (M. Porena);
antonella.negri@cultura.gov.it (A. Negri);
luigi.cerullo@cultura.gov.it (L. Cerullo)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License</p>
      <p>Attribution 4.0 International (CC BY 4.0).
pass other types of entities, such as events and
literary works. This is essential for ensuring that diferent
records referring to the same object are accurately
reconciled, maintaining the integrity and eficiency of the
I.PaC graph.</p>
      <p>Currently, two additional AI applications within I.PaC
are being tested:
• the development of models aimed at linking
identified entities to controlled vocabularies and
domain-specific terminological tools that
describe cultural properties,
• the enrichment of the graph with information
extracted from textual contexts.
racy and depth of cultural data, promoting continuous For the first aspect, many cultural properties are
deevolution in the way cultural information is managed and scribed using unstructured texts that do not refer to
stanshared [3]. By leveraging AI, I.PaC aims to facilitate more dardized vocabularies or thesauri, making access to
inforeficient data processing, uncover hidden connections be- mation less immediate. The project aims to create models
tween cultural entities, and provide users with richer, that link these descriptions to standard categories from
more contextualized information about Italy’s cultural controlled vocabularies, despite the challenge posed by
heritage. the highly specialized and domain-specific nature of such
terminologies.
3. AI applied to descriptive data For the second aspect, the team is working on AI
models that extract data from unstructured texts to integrate
it into the I.PaC data model in a structured form, thus
simplifying the search process and increasing the
informative value of the graph.</p>
      <p>In the graphs area, one of the main problems is that the
data managed by the I.PaC graph comes from various
sources, which may assign diferent identifiers to
otherwise identical entities. This can lead to an overabundance
of entities that, in reality, refer to the same object. A
typical example is "Agent" entities (like Leonardo Da Vinci),
which is registered in multiple systems with diferent
identifiers, creating in this way diferent entities.</p>
      <p>To solve this problem, innovative AI algorithms have
been employed. These algorithms intelligently analyze
the context of each entity, taking into account important
details like ates and places of birth, qualifications, and
biographical information. By doing so, they can group
entities that are nominally diferent but semantically
identical, efectively reducing duplication.</p>
      <p>In the context of agent reconciliation, AI faces a
particularly challenging task due to the often limited
descriptive data available. Frequently, the only information
provided is the agent’s full name, with no chronological
references or additional identifying details. In such cases,
the AI must employ advanced techniques to analyze the
works associated with the agents. For artworks, the AI
can attempt to identify stylistic similarities by examining
features such as brushwork, technique, and recurring
motifs. For bibliographic works, it can focus on similarities
related to the subject matter, comparing themes related
to the work. These methods enable the AI to suggest
potential matches, overcoming the limitations imposed
by the lack of detailed data.</p>
      <p>Plans are in place to expand this approach to
encom</p>
    </sec>
    <sec id="sec-3">
      <title>4. AI applied to digital objects</title>
      <p>One of the primary goals of I.PaC is to manage a great
number of digital objects that come from various cultural
institutions and organizations across the country.</p>
      <p>Among the various functionalities ofered, I.PaC is
experimenting with a content processing system using a
range of artificial intelligence techniques, from Machine
Learning to generative models. This initiative aims to
achieve two primary objectives: on one hand, to generate
new digital content or media; on the other, to extract
meaningful information from existing content.</p>
      <p>
        In this initial phase, 7 specific use cases have been
selected to test how AI can enhance digital resources and
enrich the graph. These use cases are:
• (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Text extraction from ancient and modern
monographs: this involves extracting text frome
the digitization of the monograph, creating an
abstract, extracting named entities, identifying the
subject, determining the table of contents, and
identifying the physical structure of the resource,
ensuring that images are arranged according to
the correct pagination or foliation indicated in
the resource. The challenge in this case lies in
analyzing ancient monographs, which often have
particularly deteriorated text, instances of
bleedthrough, and highly complex layouts where text
is arranged on the page in various shapes.
• (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Processing of digitized journals: unlike the
previous case, this task involves identifying the
articles within a periodical, associating each
article with its corresponding text section, title and
subtitle, and author. The challenge here is the
vast variety of layouts that need to be recognized.
      </p>
      <p>
        Additional dificulties include identifying sections
that are physically separate but logically part of
the same article, handling articles that continue
on diferent, often distant, pages of the resource,
and dealing with advertisements that can
physically and logically separate various parts of the
same article.
• (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Audio and video elaboration. In this context,
the AI must be capable of extracting text from
audio files that may be corrupted. For each resource,
it will need to generate an abstract: if the resource
is musical, the abstract should consider only the
descriptive metadata; for spoken resources, the
abstract should be based on the content of the
extracted text.
• (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) Image processing: Within the I.PaC ecosystem,
millions of images related to cultural heritage
will be hosted. This use case aims to process
these images to identify the main subject and the
entities they comprise, mapping this information
to nationally recognized controlled vocabularies
(such as the Thesaurus del Nuovo Soggettario
di Firenze or the Iconclass classification). Each
recognized entity must be associated with the
coordinates of the section of the resource where it
is located, making it easily representable in a IIIF
manifest, [4]1. The goal is to create a description
of the image that can also be reproduced via audio
ifles (to improve information accessibility) and
to identify similar images, including a similarity
score for each recognized similar image.
1The International Image Interoperability Framework (IIIF) is a
standard developed to facilitate the access and sharing of digital images
by libraries, archives, museums, and other institutions with image
collections. IIIF enables interoperability between diferent platforms
and viewing systems, allowing users to access, view, and annotate
high-resolution images uniformly and consistently, regardless of
their origin. A manifest in this context is a JSON document that
provides detailed information about a digital resource, such as an
image or a collection of images. The manifest contains metadata
that define various aspects of the resource, such as bibliographic
information, structure (e.g., pages of a manuscript), and
coordinates for annotating specific sections of the image. Through the
manifest, IIIF-compatible applications can present and manage
images in a standardized way, supporting advanced functionalities
like zooming, magnification, page navigation, and collaborative
annotations.
• (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) Metadata extraction from maps: this use case
involves developing technologies capable of
extracting data from digitized maps, such as place
names, the scale used, and any symbols marked
on the map along with their legend associations.
      </p>
      <p>
        Specifically, for cadastral maps, the AI must also
recognize the cadastral parcels indicated in the
image. The challenge in this task is that many
ancient and modern maps have handwritten data,
making it dificult to recognize diferent types of
handwriting. Additionally, there is often no
extended textual context available that could help
the AI correct extraction errors using semantic
context.
• (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) Extraction of musical notation from digitized
sheet music: This use case focuses on
developing technologies capable of extracting musical
notation from digitized sheet music and enabling
the playback of the extracted notation. The AI
must accurately recognize and interpret various
musical symbols, notes, and annotations present
in the sheet music. This involves dealing with
challenges such as varying quality of digitized
images, handwritten annotations, and diferent
musical notation styles. The goal is to create a
digital representation of the music that can be
easily read, edited, and played back, preserving
the integrity and accuracy of the original sheet
music.
• (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) Extraction of information from catalog records:
      </p>
      <p>Over time, numerous paper catalog records have
been created to describe cultural heritage items,
representing a valuable informational resource
that needs to be recovered. In many cases, the
only information available about certain cultural
heritage items is contained in these paper records.</p>
      <p>This use case involves extracting information
from these digitized catalog records to map the
extracted metadata to the current national
information representation models. The challenge here
lies in the significant variation in the layouts used
in these catalogue records and the difering
information each type of records requires. It is not
possible to identify a specific layout or consistently
recurring data (except for some basic information,
such as the catalog number or the classification
of the item). Therefore, the technology must be
capable of extracting the information,
recognizing its semantics, and mapping it to the relevant
descriptive data model.</p>
      <p>Technologies for the last three use cases have already
been successfully tested, demonstrating the feasibility
and efectiveness of the proposed solutions. However, in
the coming months, these successfully tested technolo- Figure 3: Alphy, the AI-powered generative chatbot for the
gies will require fine-tuning to improve performance and Alphabetica portal navigation
achieve increasingly precise results. For the other use
cases, a proof of concept (PoC) is currently being carried
out by two competing companies. Upon completion of knowledge base, making the user experience
this phase, the best results will be evaluated, and the more informative and engaging. All information
most suitable solution will be selected. The final choice generated by the AI is highlighted in the chat,
will consider both the technologies used and the devel- ensuring compliance with current regulations.
oped pipeline, which must be capable of processing the
resource in an automatic manner, ensuring all required
outputs. Human intervention will only be necessary for
result validation, thus ensuring an eficient and scalable
process for managing cultural heritage resources.</p>
      <p>Currently, another chatbot is being developed for
navigating the General Catalog of Cultural Heritage [6], which
contains data on cultural properties from Italian museum
and other cultural institutions. Unlike the first case, this
experiment aims to use generative AI to process RDF data,
5. Generative AI to enhance organized according to the ArCo ontology network [7]
and accessible through SPARQL queries. The goal is to
information retrieval convert natural language questions into SPARQL queries,
thus facilitating access to information. In this context,
I.PaC provides also services to enhance information re- the generative AI must use Retrieval-Augmented
Gentrieval in the form of chatbots that use generative ar- eration (RAG) [8] because it needs to comprehend the
tificial intelligence to assist users in navigating portals semantics of the ontology and suggest research paths.
dedicated to cultural heritage. This approach allows the AI to provide more accurate</p>
      <p>The first project to have been realized, still in pub- and contextually relevant responses by dynamically
inlic experimentation, is Alphy, designed with the goal of tegrating and retrieving pertinent information from the
assisting users in navigating and accessing information knowledge graph, thereby enhancing the overall user
in Alphabetica [5], the portal of Italian libraries created experience in accessing and exploring the vast cultural
by the Istituto Centrale per il Catalogo Unico delle Bib- heritage data.
lioteche Italiane (ICCU). The application of generative
artificial intelligence is crucial in three key phases of the
interaction process between the chatbot and the user: 6. Conclusions
• user intent interpretation: during this phase, the In conclusion, the development and implementation of
AI analyzes the user’s input to accurately identify the I.PaC - Infrastructure and Digital Services for Cultural
their intentions; Heritage - represents an important advancement in the
• mapping intentions to three search templates: in management and valorization of Italian cultural heritage.
this phase, the system guides the user’s inten- Through leveraging cutting-edge artificial intelligence
tions towards three key templates: Works, Pro- technologies, from machine learning to generative
modtagonists, and Themes; els, I.PaC not only aims to preserve and make accessible
• analysis and enrichment of results: in the third cultural properties but also to innovate the way these
phase, the chatbot reviews the results obtained treasures are studied and known. The exploration into
AIfrom the Alphabetica indexes, enriching the driven enhancements, including descriptive data analysis
response with additional information from its and digital object processing, can bridge the gap between
historical legacy and modern accessibility. The
introduction of AI-powered chatbots like Alphy for navigating
cultural portals points out the commitment to enhancing
user experience and information retrieval. Thanks to the
continuous refinement of AI applications and extension
of digital services, I.PaC is a powerful example of how
culture, technology, and education come together,
ensuring that cultural heritage is not only preserved but made
accessible in new ways for generations to come.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] I.pac - infrastruttura e servizi digitali per il patrimonio culturale</article-title>
          ,
          <year>2024</year>
          . URL: https://ipac.cultura.gov.it/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Cerullo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Negri</surname>
          </string-name>
          ,
          <string-name>
            <surname>L'</surname>
          </string-name>
          <article-title>infrastruttura software per il patrimonio culturale (ispc) come abilitatore di un ecosistema digitale nazionale del patrimonio culturale</article-title>
          ,
          <source>DigItalia</source>
          <volume>18</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Parry</surname>
          </string-name>
          , Recoding the Museum:
          <article-title>Digital Heritage and the Technologies of Change</article-title>
          , Routledge,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Stuart Snydman</surname>
          </string-name>
          , T. Cramer,
          <article-title>The international image interoperability framework (iiif): A community technology approach for web-based images</article-title>
          ,
          <source>Archiving conference 12</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Alphabetica</surname>
          </string-name>
          ,
          <year>2021</year>
          . URL: https://alphabetica.it/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>[6] Catalogo generale dei beni culturali</article-title>
          ,
          <year>2021</year>
          . URL: https: //catalogo.beniculturali.it/.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V. A.</given-names>
            <surname>Carriero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gangemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Mancinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Marinucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Nuzzolese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Presutti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Veninata</surname>
          </string-name>
          ,
          <article-title>Arco: The italian cultural heritage knowledge graph</article-title>
          ,
          <source>in: Proc of ISWC</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>36</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>M. D. D. M. Hamed Zamani</surname>
            , Fernando Diaz,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bendersky</surname>
          </string-name>
          ,
          <article-title>Retrieval-enhanced machine learning</article-title>
          ,
          <source>in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          (Madrid, Spain) (
          <source>SIGIR '22)</source>
          , Association for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>2875</fpage>
          -
          <lpage>2886</lpage>
          . doi:https://doi.org/ 10.1145/3477495.3531722.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>