<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1344/BID2022.48.05</article-id>
      <title-group>
        <article-title>CIDOC-CRM and the First Prototype of a Semantic Portal for the CHExRISH Project</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luiz do Valle Miranda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Krzysztof Kutt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grzegorz J. Nalepa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Jagiellonian Human-Centered AI Lab, Mark Kac Center for Complex Systems Research, Institute of Applied Computer Science, Faculty of Physics</institution>
          ,
          <addr-line>Astronomy and Applied Computer Science</addr-line>
          ,
          <institution>Jagiellonian University</institution>
          ,
          <addr-line>prof. Stanisława Łojasiewicza 11, 30-348 Kraków</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>5</volume>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>In this paper we present work in progress towards the integration of cultural heritage resources from diferent units of the Jagiellonian University as linked data and the prototyping of their presentation in a semantic portal. CIDOC-CRM has been chosen as the data model behind such an interoperability given its wide use and its lfexibility. Challenges arose when converting bibliographical authorship relations into CRM's event-centered structure and when migrating instances' hierarchical classification into CRM's “E55 Type” and “P127 has broader term” conversion standard. Despite following CRM's data modeling best practices, these challenges reappeared while publishing the data in an Omeka S-powered website, thus showing some lack of compatibility between these two frameworks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;CIDOC-CRM</kwd>
        <kwd>Cultural Heritage</kwd>
        <kwd>Semantic Portal</kwd>
        <kwd>Linked Data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        A major motivational factor for the use of Linked Data (LD) in Cultural Heritage (CH) projects is the
aspiration towards interoperability among diferently collected data within an institution or between
diferent institutions. Aligned with such interoperability, modeling, and publishing CH data as LD aims
at the aggregation of resource with data published within the Semantic Web (SW) framework that are
accessible via their unique URIs. Furthermore, using Knowledge Graphs (KG) and ontologies as the base
technology for such a CH-LD endeavor provides a rich description of spatial, temporal, and personal
relations of the included collections that allows for further processing in the form of algorithms that
enrich the collection [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Taking into account such benefits of LD in CH, the ongoing CHExRISH project at the Jagiellonian
University (JU) aims to create a prototype for the Jagiellonian University Heritage Metadata Portal
(JUHMP). JUHMP is envisioned as a semantic portal that first and foremost provides browsing and
searching capabilities over JU heritage data currently being stored and analyzed separately in diferent
units, including the Archive (AUJ), the Museum (MUJ), and the Library (BJ). JUHMP is also conceived
to incorporate tools for assisting researchers in solving Digital Humanities (DH) problems, such as
Network Analysis [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and VR visualization [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], as well as the use of Artificial Intelligence (AI) tools
for recommendation and knowledge discovery [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], thus approximating it to what [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] calls a third
generation semantic portal. For the first prototype of JUHMP (JUHMP_v1.1), Omeka S 1 was chosen
given its purported ease of installation at its support for linked data (see [6, 7]).
      </p>
      <p>A significant challenge in this project is the integration of diverse data from every unit, additionally
since each unit has collections modeled (e.g., Dublin Core, MARC-21), stored (e.g., SQL, CSV, XML, RDF)
and accessed (e.g., OAI-PMH, custom REST API) in diferent formats. This challenge is addressed with the
selection of a flexible and extendable ontology as the data model for the unified base. After considering
diferent ontologies such as the Europeana Data Model 2 (EDM) [8], the Records in Contexts Ontology3
(RiC-O) [9] and the Gemeinsame Normdatei (GND) Ontology4, the decision was the CIDOC-CRM5 [10].
CIDOC-CRM is a widely used ontology where cultural heritage data from diferent institutions, including
libraries, museums, and archives, can be connected by means of the events relating records with, among
others, people, places, types, concepts, and institutions. CIDOC-CRM is an international standard (ISO
21127:2014) designed to serve as a shared language among CH institutions. Given its status and its
adoption, many tools for semantic portal’s enrichment are being developed based on this data model
(see [11, 12]).</p>
      <p>The purpose of this paper is to present and discuss the current workflow and the challenges
encountered in the process of data integration and the development of the semantic portal based on
CIDOC-CRM v7.1.26. While the CHExRISH project aims to integrate a variety of collections related to
JU heritage, JUHMP_v1.1 currently only includes selected data from the AUJ’s Corpus Academicum
Cracoviense7 (CAC) and BJ’s Jagiellonian Digital Library8 (JDL).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Integrating data of the CAC and JDL</title>
      <p>CAC is an electronic database with around 67,000 records on students and graduates of the University of
Kraków during the period 1364–17809. The Web application for accessing CAC data displays records on
academic, occupational, personal, and editorial events in a given person’s life. Fig. 1 partially reproduces
the CAC record for Nicolaus Copernicus. Among the records, we can see Copernicus’s birth and death
date, some institutions where he has worked and his position, two of his publications, and the place
where he studied and where he finished his doctoral studies.</p>
      <p>CAC interface is integrated with a PostgreSQL database. For the initial prototyping of JUHMP_v1.1,
a dump of the database with the database diagram was used. A first challenge arose when academic
and occupational events were modeled as CRM’s “E7 Activity” class. As shown in Fig. 1, academic
events have several properties, including event type, degree type, and the scientific discipline. The CRM
standard solution for aggregating details in an activity is by creating an instance of class “E55 Type”
connected to the former with property “P2 has type”. Even though this solution was able to capture the
academic events’ properties, it was not able to represent each of the properties as a related group. This
grouping was achieved by creating more instances of the class “E55 Type”, respectively “Educational
Activity”, “Academic Degree”, and “Field of Study” linked to the former types with CRM’s “P2 has type”
property. A similar structure was taken for occupational activities. The problem of attaching properties
to instances of a class for further specification was found in the case of names and surnames. They
were modeled as instances of “E41 Appellation”, and also linked to instances of “E55 Type”, respectively
“First Name” and “Surname” to diferentiate between them. Fig. 2 presents a visualization of the current
solution of using CIDOC-CRM to model Copernicus’ personal record and bibliography. At the current
moment publication activities as described in CAC were not included in JUHMP_v1.1 given the dificulty
of resolving naming diferences to the works described in JDL.</p>
      <p>JDL, a BJ’s digital library, is a platform for the preservation and dissemination of (not only) antique
collections of the Jagiellonian Library as digitized resources. Among the digitized resources are the
collection of manuscripts, the collection of old prints, the collection of prints from the 19th and 20th
centuries among others. Data from JDL follows the Dublin Core10 (DC) data model and can be harvested
2https://pro.europeana.eu/page/edm-documentation
3https://www.ica.org/resource/records-in-contexts-ontology/
4https://d-nb.info/standards/elementset/gnd
5https://cidoc-crm.org/
6For a comprehensive overview of the CIDOC-CRM v7.1.2 vocabulary, refer to https://cidoc-crm.org/Version/version-7.1.2.
This vocabulary is consistently applied throughout the article.
7https://cac.historia.uj.edu.pl/
8https://jbc.bj.uj.edu.pl/dlibra
9See [13] for more details on the history of CAC and the provenance of its data.
10https://www.dublincore.org/
using the OAI-PMH protocol. Listing 1 contains the metadata for Copernicus’ “De Revolvtionibvs
Orbium Coelestium” as shared by JDL.</p>
      <p>Since the digital objects that are associated with CAC entities are not distinguished in any way in
the JDL, it was necessary to harvest all data accessible via OAI-PMH to be processed into the JUHMP
Ontology. Around 700,000 bibliographic records were retrieved. Only bibliographic records pertaining
to CAC entities were taken into consideration. However, the mapping of authors from CAC to JDL is
not obvious, since in Poland there is no authority file for the organization of personal names shared by
all GLAM institutions (there are standards for Polish libraries [14], followed by BJ, but they are not
followed by archives and museums, incl. AUJ and MUJ). For JUHMP_v1.1 a mapping file for 10 records
was manually created. Additionally, a simple conversion script was written in Python that was able to
map with certainty 99 entities from CAC to JDL. These records were also included in JUHMP_v1.1. In
possession of such a conversion file for personal names from CAC to JDL, it was possible to filter the
bibliographic records to those referring to CAC entities in the fields “dc:creator”, “dc:contributor” and
“dc:subject”.</p>
      <p>Modeling the relationship between a bibliographic record and the people related to it is another
challenge. In CRM, a bibliographic record can be modeled both with the “E33 Linguistic Object” and
the “E22 Human-Made Object” to represent respectively the linguistic expression and the physical
inscription of the expression. There is, however, no property in CRM that directly connects an author
or a contributor to instances of these classes. This link is achieved by mediation of events that represent</p>
      <p>Listing 1: DC description of Copernicus’ “De Revolvtionibvs Orbium Coelestium, Libri VI” from JDL.
the generation of these objects linked to an author via the property “P14 carried out by”. For linguistic
objects, “E65 Creation” is used, while for human-made objects, “E12 Production”. As discussed in [15],
CIDOC-CRM supports the reification of the property “P14 carried out by” and the use of the property
of properties “P14.1 in the role” to specify the role by connecting the latter to an instance of CRM’s
“E55 Type”. Fig. 2 presents an example of such modelling practices. Finally, it is worth mentioning
that the conversion between both JDL and CAC data from their respective original data formats to the
population of a CIDOC-CRM-based ontology was achieved via custom Python scripts using the rdflib 11
library.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Prototyping with Omeka S</title>
      <p>While the target JUHMP is planned as a complex portal including several modules to assist digital
humanities research and intelligent search and exploration, the first prototype, JUHMP_v1.1, has
significantly simpler requirements. JUHMP_v1.1 aims to be a web application to showcase the integration
of two diferent data models originating from diferent JU units over a manually selected set of records.
Furthermore, JUHMP_v1.1 is supposed to be based on existing semantic cultural heritage platforms,
given the limited resources for developing such a prototype. In JUHMP_v1.1 the user’s interaction with
these records is supposed to happen in two ways. First, users should be able to browse over a list of
included people’s records, being able to easily access linked resources, such as bibliographic records, or
connected themes. Second, users should be able to query the KG via a SPARQL endpoint.</p>
      <p>Based on the comparison table presented by [6] and the analysis in [16], Omeka S appeared to be the
most adequate platform for JUHMP_v1.112. Omeka S is presented as an easy-to-use platform compatible
with the principles of linked open data, with extensive documentation, a growing community of users,
and allowing the addition of modules for the extension of its functionalities. The presentation of records
in Omeka S is based on the creation of digital collections with items added either manually, imported via
CSV or via an API. Furthermore, a SPARQL module for Omeka S13 is already widely available, allowing
easy integration in the web application of an interface for writing SPARQL queries and executing them
against an Apache Jena Fuseki14 server.</p>
      <p>As CIDOC-CRM is not one of Omeka S pre-built ontologies, its import is one of the preliminary steps
before the addition of any records. CRM can be easily imported, as Omeka S provides a page with forms
to add new ontologies, which only requires an RDF file containing the ontology. This process ensures
the availability of the CRM vocabulary in the Omeka S platform enabling its use for both importing
data and presenting resources in a semantically enriched manner. However, the addition of records to
Omeka S as linked resources is not so straightforward.</p>
      <p>While the structure of the payload of a successful request to the “\api\items” endpoint is based on the
JSON-LD format15, the mere conversion of an RDF file to its equivalent JSON-LD is not enough to add a
12One alternative to Omeka S that was considered was ResearchSpace (https://researchspace.org/). Although ResearchSpace
ofers native compatibility with CIDOC-CRM, configuring a user-friendly interface requires significantly more development
efort compared to Omeka S. Consequently, Omeka S was selected for this initial prototype.
13https://omeka.org/s/modules/Sparql/
14https://jena.apache.org/documentation/fuseki2/
15https://omeka.org/s/docs/developer/api/rest_api_reference/
linked resource to Omeka S. The “\api\items” endpoint’s payload requires fields such as “property_id”
and “type” for each item, that need to be added with additional scripting. Furthermore, to draw on the
LD browsing capabilities of Omeka S, the resources cannot be linked via their URIs, but rather using
Omeka S’ internal “value_resource_id”. Thus, another additional step needed to programmatically add
resources to Omeka S is the creation of empty items for each resource, getting their “value_resource_id”,
substituting the URI references to the resource id references, and finally patching the empty items with
their respective linked data. Although Omeka S does not contain an integrated solution for a seamless
integration of linked items from an RDF file, with the help of additional scripting it is possible to import
a set of linked data points to the Omeka S repository preserving a structure that closely mirrors the
RDF description. Fig. 3 presents the excerpt from Copernicus’ personal record from CAC presented in
Omeka S.</p>
      <p>The situation of presenting CIDOC-CRM data in Omeka S poses another challenge. After adding
an item to be displayed in a collection, all the triples that have such a given item as a subject will be
displayed in the page with the name of the property according to the respective ontology. One way to
customize the presentation of an item is by associating a resource template with it. Among other things,
resource templates can create an alternate label to some properties or make some properties invisible.
However, two shortcomings were identified with item presentation. First, Omeka S does not allow for
displaying fields with values nested in another item. A widespread practice in CIDOC-CRM (and CH in
a whole) is modeling dates as time-spans. In CRM, for example, the class “E52 Time-Span” is used, with
its instances being linked to date data via the properties “P82a begin of the begin” and “P82b end of the
end”. Taking the Copernicus’ personal record as an example, it is not possible to display a Copernicus’
birthdate in his page, since it is nested in an instance of class “E67 Birth” and further in an instance
of “E52 Time-Span”, the same goes for death date, or publication date of a bibliographic record. The
issue is present with not only dates, but also with listing books an author wrote or institutions where a
person studied or worked.</p>
      <p>The second shortcoming relates to the challenge mentioned in Sect. 2 on further classification of
instances of CIDOC-CRM classes in hierarchies that are not in the ontology using the “E55 type” class.
As visible on Fig. 3, both name and surname are indistinguishably presented under the header “is
identified by”. The same is the case for academic events and occupational events under “participated
in”. This limitation is due to the fact the Omeka S does not allow customization or filtering of fields
according to some nested value.</p>
      <p>Since CIDOC-CRM is a high-level ontology and intrinsically flexible, both nested data types and
vocabulary-based classification are essential aspects of modeling according to this standard. Given the
lack of these customization possibilities, the LD-based browsing and presenting capabilities of Omeka S
are partially incompatible with data represented in CIDOC-CRM16. One way to create a customized
presentation in Omeka S is expanding the CIDOC-CRM ontology with properties explicitly linking
the data points to be presented, for example, creating a property called “death date” that is a short-cut
between an instance of “E21 Person” and a date value. The drawback of this approach is the necessity
for further ontology development work and maintenance, and the divergence between the data model
present in Omeka S repository and in the SPARQL endpoint.</p>
      <p>Despite the challenges encountered during the implementation of Omeka S, it remains, in our case,
the most viable option for an early-stage prototype of a semantic portal, especially given the constraints
in available time and resources for portal design and development.
16The NestedDataType module (https://github.com/sinanatra/NestedDataType) addresses a similar challenge and presents
a solution to some of the issues here presented. However, it is not yet suitable for production use without additional
configuration and bug fixes. Further experimentation is required to fully assess the value of adopting such a solution for this
particular context.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and further works</title>
      <p>The current experience of implementing a prototype for the aggregation of data based on diferent
controlled vocabularies and data standards has shown many challenges in this process. While some
current issues faced while prototyping JUHMP in a first version were solved, such as modeling CAC and
JDL data according to the CIDOC-CRM ontology or importing CIDOC-CRM based data into Omeka S,
the issue of presenting CIDOC-CRM based data in Omeka S continues being open. Furthermore, the
most pressing issues for data integration in the CHExRISH project were avoided by manual mapping or
by exclusion from the first prototype. Among these issues, there is the mapping of personal data and
bibliographical records between units, and the mapping of these data with external platforms, such as
GeoNames17 and Wikidata18.</p>
      <p>The next step in the CHExRISH project is integrating metadata of MUJ’s objects and the complete
bibliographic dataset from BJ, exported in MARC2119. Despite the complexity of this task, the initial
prototype demonstrated that CIDOC-CRM is flexible enough to accommodate CH data from diverse
formats and models. The challenge now is to select and develop systems that not only preserve this
lfexibility but actively leverage it to enhance interoperability and usability.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>This publication benefited from the use of language models—gpt-4o, gpt-4o-mini and DeepSeek-V3—to
support proofreading and enhance readability. All generated text was reviewed and edited, and the
authors take full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This publication was funded by a flagship project “CHExRISH: Cultural Heritage Exploration and
Retrieval with Intelligent Systems at Jagiellonian University” under the Strategic Programme Excellence
Initiative at Jagiellonian University.</p>
      <p>The research for this publication has been supported by a grant from the Priority Research Area
DigiWorld under the Strategic Programme Excellence Initiative at Jagiellonian University.
17https://www.geonames.org/
18https://www.wikidata.org/
19https://www.loc.gov/marc/bibliographic/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Hyvönen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rantala</surname>
          </string-name>
          ,
          <article-title>Knowledge-based relation discovery in cultural heritage knowledge graphs</article-title>
          ,
          <source>in: Proceedings of the Digital Humanities in the Nordic Countries 4th Conference (DHN</source>
          <year>2019</year>
          ),
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. N.</given-names>
            <surname>Warren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Shore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Otis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Finegold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Shalizi</surname>
          </string-name>
          ,
          <article-title>Six degrees of francis bacon: A statistical method for reconstructing large historical social networks</article-title>
          ,
          <source>Digit. Humanit. Q</source>
          .
          <volume>10</volume>
          (
          <year>2016</year>
          ). URL: https://api.semanticscholar.org/CorpusID:40536368.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Development of a virtual interactive system for Dahua Lou loom based on knowledge ontology-driven technology</article-title>
          ,
          <source>Heritage Science</source>
          <volume>11</volume>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1186/s40494-023-01027-x, publisher: Springer Science and Business Media Deutschland GmbH Type: Article.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Melo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. P.</given-names>
            <surname>Rodrigues</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Varagnolo</surname>
          </string-name>
          ,
          <article-title>A strategy for archives metadata representation on CIDOCCRM and knowledge discovery</article-title>
          ,
          <source>Semantic Web</source>
          <volume>14</volume>
          (
          <year>2023</year>
          )
          <fpage>553</fpage>
          -
          <lpage>584</lpage>
          . doi:
          <volume>10</volume>
          .3233/SW-222798, publisher: IOS Press.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Hyvönen</surname>
          </string-name>
          ,
          <article-title>Using the semantic web in digital humanities: Shift from data publishing to dataanalysis and serendipitous knowledge discovery</article-title>
          ,
          <source>Semantic Web</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <fpage>187</fpage>
          -
          <lpage>193</lpage>
          . doi:
          <volume>10</volume>
          .3233/ SW-190386.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>