=Paper= {{Paper |id=Vol-3019/p25 |storemode=property |title=RiC-O Converter: a Software to Convert EAC-CPF and EAD 2002 XML Files to RDF Datasets Conforming to Records in Contexts Ontology |pdfUrl=https://ceur-ws.org/Vol-3019/LinkedArchives_2021_paper_14.pdf |volume=Vol-3019 |authors=Thomas Francart,Florence Clavaud,Pauline Charbonnier }} ==RiC-O Converter: a Software to Convert EAC-CPF and EAD 2002 XML Files to RDF Datasets Conforming to Records in Contexts Ontology== https://ceur-ws.org/Vol-3019/LinkedArchives_2021_paper_14.pdf
    RiC-O
    RiC-O Converter:
          Converter: aa Software
                        Software to
                                  to Convert
                                     Convert EAC-CPF
                                              EAC-CPFand
                                                       and
     EAD 2002
     EAD  2002 XML Files
                     files to
                            toRDF
                              RDFDatasets
                                    DatasetsConforming
                                             Conformingto
                                                        to
               Records
                Recordsin inContexts
                             ContextsOntology*
                                      Ontology

               Thomas Francart1, Florence Clavaud2, and Pauline Charbonnier2
                                        1
                               Sparna, Tours, France
                           2
                  Archives nationales, Pierrefitte-sur-Seine, France
    thomas.francart@sparna.fr, florence.clavaud@culture.gouv.fr,
                pauline.charbonnier@culture.gouv.fr



             Abstract. RiC-O Converter is an open-source command-line tool to convert
          EAD finding aids and EAC-CPF authority records to RDF files conforming to
          Records in Contexts ontology, in a robust manner. It was developed for the Ar-
          chives nationales of France (ANF) but is aimed to be reused by other archival
          institutions, and to this aim is fully documented in English. It is based on XSLT
          stylesheets that take into account the variability of EAD content. It enabled the
          ANF to convert 15000 EAC-CPF files and 29000 EAD files into an homogene-
          ous knowledge graph. Such a graph opens new perspectives for navigating and
          linking from/to archival metadata.


          Keywords: Records in Contexts (RiC), RiC Ontology (RiC-O), RDF, XML
          EAD, XML EAC-CPF, open source software.


1         Introduction

RiC-O Converter is an open-source command-line tool to convert EAD finding aids
and EAC-CPF authority records to RDF files conforming to ICA Records in Contexts
ontology.1 The tool, ordered by the Archives nationales of France (ANF), was devel-
oped by Sparna, a French company specialized in semantic Web and knowledge graphs
engineering. The Department of digital innovation of the French Ministry of Culture
sponsored and funded the project according to the semantic roadmap the ministry is
conducting. The tool was released on GitHub in April 2020.2




1
    ICA Records in Contexts Ontology (RiC-O) [1] is presented in another article authored by
    Florence Clavaud and Tobias Wildi.
2
    RiC-O Converter source code: https://github.com/ArchivesNationalesFR/rico-converter (last
    accessed 2021/07/03).
* Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
     International (CC BY 4.0).
2


2       Project History

Many archival institutions and projects (like portals such as Archives Portal Europe and
FranceArchives) around the world use XML/EAD and XML/EAC-CPF files to de-
scribe their collections and the agents related to them. Based on ISAD(G) standard,
XML/EAD (EAD) [2] facilitated its adoption and diffusion. EAD either is the produc-
tion and storage format of finding aids or is the output format of data bases. It preserves
finding aids and is an exchange format with external applications. Based on
ISAAR(CPF) standard, XML/EAC-CPF (EAC) [3] is quite often used to describe au-
thorities such as corporate bodies, persons and families that created or accumulated the
fonds held by archival institutions.
    RiC-O Converter project is based on the statement that transforming EAD and EAC
files to RDF, thus creating knowledge graphs about archives and their contextual enti-
ties, results in an homogeneous and interoperable data structure, that is compliant with
the FAIR principles,3 and opens new perspectives related to querying, browsing, reus-
ing, publishing and linking from / to archival metadata.
    The ANF4 have been interested in entity-relationships models and graph technolo-
gies since 2013, one of the reasons being that they already have authored a significant,
and growing, number of authority records that were linked to each other and to the
descriptions of the archives themselves. This in essence constitutes a very dense ori-
ented graph, whose relations are not really displayed, and cannot be queried and pro-
cessed in the ANF current information system [4, 5]. The ANF also wanted to connect
these metadata with other metadata sets created by other institutions. Linked Data tech-
nologies thus seemed to be a possible solution to meet these needs. RiC Ontology, an
OWL domain ontology for archives, was at last available; it is also based on a recent
entity-relationship conceptual model [6]; it is fully documented and rich. Which made
it possible for the ANF to produce RDF datasets.
    The ANF first built a qualitative proof of concept (PIAAF)5, to show that converting
existing archival metadata to RDF datasets conforming to RiC-O was possible, to in-
terconnect datasets from different institutions, and to visualize and explore them in a
new way. But the PIAAF prototype did not include a large quantity of metadata nor
took into account the variety of their structure and content. Therefore, the ANF needed
to move from this qualitative proof of concept to a large-scale project. Indeed, the ANF
hold a significant amount of metadata to be converted, which implied to develop a re-
liable, efficient, configurable, tool. The tool was designed to process finding aids and
authority records only, even if the ANF also hold several controlled vocabularies.
Building RDF/RiC-O vocabularies is a different process and is still a work in progress
3
    The FAIR Guiding Principles for scientific data management and stewardship (2016) are rec-
    RPPHQGDWLRQVWRLPSURYH³WKH)LQGDELOLW\WKH$FFHVVLELOLW\WKH,QWHURperability and the Re-
    XVHRIGLJLWDODVVHWV´DURXQGZKLFKDFRPPXQLW\DQGYDULRXVLQLWLDWLYHVKDYHGHYHORSHG6HH
    the website: https://www.go-fair.org/ (last accessed 2021/08/23).
4
    Homepage of the ANF website: http://www.archives-nationales.culture.gouv.fr/, last accessed
    2021/08/23.
5
    Homepage of the PIAAF project website: https://piaaf.demo.logilab.fr/, last accessed
    2021/07/03.
                                                                                                3


in the institution, since the internal data structure of these controlled vocabularies, that
currently conform to a very poor, locally defined, model, should change soon.6
   The conversion tool was required to be industrial (meaning: performant and capable
of processing tens of thousands of input files in a reasonable amount of time); tested
(to guarantee the coverage of all possible situations encountered in source files); ver-
bose (to produce log files to follow its execution), easy to install (so it can run on a
typical desktop machine), configurable and adaptable (to suit the ANF needs as well
as different needs of potential reusers), well documented.


3       Design and Main Characteristics

The proposed solution relies on XSLT stylesheets, encapsulated in a Java script. The
Java wrapping of the XSLT ensures a convenient command-line interface, the proper
sequencing of the conversion steps, and portability to all operating systems.
   The stylesheets convert EAD and EAC in RDF/XML containing instances of RiC-
O classes and properties. They live in a separate directory from the java command itself,
ensuring that modifications can easily be made in conversion logic without the need to
recompile the tool.
   The development methodology relied on unit tests to cover all possible situations
that could be found in input files. Each of the 90 unit tests is specified in an input EAD
or EAC file with a corresponding expected RDF/XML file.7 When run, the output of
the converter is compared to the expected file, and, if differences are found, the test
fails. The tests can be run directly from the command line, so that any user can verify
the tests, and add its own, if the stylesheets are modified. The tests ensure no regressions
are introduced when the software evolves.
   Running the tool is as simple as running a bat or sh script. The script asks for the
action to be executed (EAC or EAD file conversion) and the option properties file to
use.
   The EAC to RIC-O conversion process is summarized in the following diagram:



6
    About the context and the ANF ongoing projects, see presentations from the study day dated
    January 28, 2020, on Les métadonnées archivistiques en transition vers des graphes de don-
    nées: https://labarchiv.hypotheses.org/1495,        especially      https://labarchiv.hypothe-
    ses.org/files/2020/02/20200128_3_RiCauxAN_EnjeuxPremieresRealisations.pdf                  (in
    French). See also a more recent presentation by Florence Clavaud, ³,PSOHPHQWLQJ,&$5HF
    ords in Contexts-Ontology at the National Archives of France: first assessment and pro-
    VSHFWV´IRUWKHStudy Day on The Semantic Web and Cultural Heritage: From Data Conver-
    gence to Knowledge Crossing (Lille, France, February 3, 2021) - slides in English and audio
    recording in French: https://geriico.univ-lille.fr/detail-event/le-web-semantique-et-le-patri-
    moine-culturel-de-la-convergence-des-donnees-au-croisement-des-connai/ (last accessed
    2021/07/03).
7
    The unit tests are available at https://github.com/ArchivesNationalesFR/rico-con-
    verter/tree/master/ricoconverter/ricoconverter-convert/src/test/resources (in two subfolders
    named eac2rico and ead2rico). One input/output file encodes more than one test.
4




                      Fig. 1. EAC-CPF to RiC-O conversion process

1. Input EAC files are converted into RiC-O RDF/XML. Each input file yields a cor-
   responding RDF/XML file. An option allows to stop processing here to examine the
   raw output of the conversion.
2. The content of the raw RDF/XML files is reorganized to split the output in folders
   corresponding to agents, places, and relations. Relations are grouped into large files,
   each corresponding to a high-level relation class in RiC-O: Agent Hierarchical Re-
   lations, Agent Origination Relations, Agent Temporal Relations, Agent To Agent
   Relations, Family Relations, Membership Relations, Work Relations.
3. The relations are deduplicated to remove those that appear more than once. As the
   original relation is expressed in the source files for both related entities, the same
   relation expressed in RiC-O was generated twice in step 1.
The EAD to RiC-O conversion process is summarized in the following diagram:




                         Fig. 2. EAD to RiC-O conversion process

1. ,QSXW ($' ILOHV DUH ILOWHUHG DFFRUGLQJWR WKH³#DXGLHQFH´($'DWWULEXWH VRWKDW
   non-public files or components of files can be excluded from the conversion process.
2. The filtered EAD files are then converted into RiC-O RDF/XML. Each input file
   yields a corresponding RDF/XML file.
3. If requested, the output files can be splitted into smaller files, with the top Record
   5HVRXUFHLQRQHILOHDQGHDFK³EUDQFK´RIWKHILQGLQJDLGLQDVHSDUDWHILOH
It should be noted that the conversion step to RiC-O RDF/XML takes the assumption
WKDWHDFK³F´HOHPHQWLQWKHLQSXW($'ILOHKDVDQ³#LG´;0/DWWULEXWHIURPZKLFK
the corresponding Record Resource URI is derived.
          The EAD conversion takes into account some variability of what can be found
in EAD files, where the same element is allowed to contain different content. For ex-
DPSOHD³SK\VGHVF´HOHPHQWZLWKRQO\WH[WRUD³SK\VGHVF´HOHPHQWZLWKPL[HGFRQ
tent includLQJ³GLPHQVLRQV´DQG³H[WHQW´FKLOGUHQHOHPHQWVRUD³SK\VGHVF´HOHPHQW
                                                                                                 5


ZLWK³H[WHQW´DQG³SK\VIDFHW´FKLOGUHQHOHPHQWVFRQWDLQLQJDUHIHUHQFHWRDFRQWUROOHG
vocabulary, result in different outputs, as shown in the corresponding unit test.8

   The performance of the tool is very good: 15200 EAC files are processed in approx-
imately 15 minutes, yielding 0.7 million triples. 29000 EAD files are processed in ap-
proximately 30 minutes, yielding approximately 155 million triples.


4       A Tool for the Archival Community

The software is fully documented in English to reach the international community. The
project team produced mappings between EAC-CPF and EAD to RiC-O, which will be
very useful to the archives community, especially institutions considering converting
their metadata to RDF. The mappings are available in the documentation section9 of the
tool. The conversion process and how to customize the conversion are documented too.
The source code is open and freely accessible on GitHub. The software is licensed un-
der the terms of the CeCILL-B license.10
   The converter was developed to address first the needs of the ANF but we kept in
mind its potential uses for any other archival institution; therefore the code is easily
configurable. Typically, the root URI for the URIs to be generated is an option that can
be easily changed. However archival institutions would probably need to adapt the soft-
ware to their own systems. For example, we did not take into account some elements
in the conversion process because these elements are not used in the ANF EAD finding
DLGV ³DEVWUDFW´³GDR´³ELRJKLVW´HWF RUEHFDXVHWKH\DUHQRWUHOHYDQWLQ5') ³IURQW
PDWWHU´³WLWOHSDJH´HWF .11


5       Results and Prospects

The ANF have converted nearly 29000 finding aids and 15200 authority records using
RiC-O Converter. We can convert them again when needed, for example when major
updates occur in our metadata.
         Quality issues appeared during development time (lack of precision or worse,
bad use of EAD format). More generally speaking, improving the quality of archival
metadata is a key issue for the ANF, and quality management and data governance also


8
   Physdesc unit test: https://github.com/ArchivesNationalesFR/rico-converter/tree/master/rico-
   converter/ricoconverter-convert/src/test/resources/ead2rico/_32_physdesc (last accessed
   2021/07/03).
9
   EAD to RiC-O and EAC-CPF to RiC-O mappings can be found in
   https://github.com/ArchivesNationalesFR/rico-
   converter/tree/master/ricoconverter/ricoconverter-doc/src/main/resources         (EAC_to_Ric-
   O_0.1_documentation.xlsx and EAD_to_Ric-O_0.1_documentation.xlsx files).
10
   https://github.com/ArchivesNationalesFR/rico-converter/blob/master/license.txt.
11
   On these aspects, and on all the topics presented in this article, more information is available
   (in French) in [7].
6


has to be enhanced. In a way, processing the RDF datasets generated can help assess
this problem and solve it. Examples include: aligning the data on the agents to other
RDF datasets e.g. those of the French national Library (BnF)12 or of Wikidata13 in order
to enrich them, or (not investigated yet but an important need) linking (merging) distinct
descriptions, authored through time in distinct finding aids, of the same archival re-
sources.
          Also related to quality would be the use of SHACL14 rules to assess the con-
formance of the generated graph structure against some rules; these rules could be de-
rived directly from the RiC-O ontology15 (typically cardinality, domain and range
check), or they can be hand-written to validate business oriented patterns in the graph.
This is something to be done in the future.
          In terms of challenges, RDF datasets resulting from the conversion are not
published or searchable yet because of the lack of infrastructure in the ANF information
system. The ANF do not have any triplestore available online. However, the ANF pub-
lished dumps of the RDF datasets.16 Besides, the data are already used in research pro-
jects such as ALEGORIA,17 a project that aims at facilitating the promotion of icono-
graphic institutional collections describing the French territory in various periods going
from the interwar period to our days. A triplestore accessible through a SPARQL end-
point will soon be released. It will be connected to a web application demonstrating 3D
immersive navigation through geolocalised photographs.
          Moreover, the ANF should release a quite large-scale prototype by the begin-
ning of 2022, including an easy-to-use, visual, SPARQL query interface.18
          More generally speaking, RiC-O Converter needs to evolve for several rea-
sons. RiC-O Converter is based on RiC-O 0.1 (dated December 2019); but Records in
Contexts has evolved since that date: RiC-O 0.2 was released in February 2021, and
introduces new components (like the Extent class) as well as updates (particularly as
concerns the names of several object properties). The corresponding changes will be
made in RiC-O Converter before the end of 2021.



12
   The BnF provides a web interface, including a SPARQL endpoint, for its RDF datasets, at
   https://data.bnf.fr/. The datasets can be downloaded from the following page:
   https://api.bnf.fr/dumps-de-databnffr (pages last accessed 2021/08/23).
13
   About the Wikidata well-known knowledge base, see https://www.wikidata.org/ (last ac-
   cessed 2021/08/23).
14
   Shapes Constraint Language, https://www.w3.org/TR/shacl/ (last accessed 2021/07/03).
15
   Using for example SHACL Play! See https://shacl-play.sparna.fr/play/rules-catalog (last ac-
   cessed 2021/07/03).
16
   A small subset of the EAD and EAC files of the ANF, and their RDF/RiC-O 0.2 version, is
   available in the RiC-O Git public repository: https://github.com/ICA-EGAD/RiC-
   O/tree/master/examples/examples_v0-2/NationalArchivesOfFrance. The ANF also have
   started to publish their authority records and vocabularies (RDF version, using mainly RiC-O
   0.2 and SKOS, available at https://github.com/ArchivesNationalesFR/Referentiels).
17
   https://www.alegoria-project.fr/en (last accessed 2021/07/03).
18
   This interface will be built with Sparnatural; see https://github.com/sparna-git/Sparnatural
   (last accessed 2021/07/03).
                                                                                            7


         RiC-O Converter does not convert XML/SEDA19 files, used in French digital
archives management systems to describe these digital archives. These files include
technical and preservation metadata, whose definition is inspired by PREMIS data dic-
tionary20 which is more widely known in archives. Mapping SEDA to RiC-O and its
transformation is a task to do in a future version of RiC-O Converter.


6      Conclusion

The transition from existing formats to novel graph-based and web-oriented conceptual
models represents a challenge that can hinder the adoption of such new models. We
especially think of FRBR and LRM in the library world, or CIDOC CRM for museums.
By providing RiC-O Converter, a robust, adaptable and off-the-shelf tool to transition
from EAD and EAC to RiC-O, the archival community aims at soothing and encourag-
ing this transition, in order to make archival data part of the Web of data.


References
1. International Council on Archives (ICA) Records in Contexts-Ontology (RiC-O) latest offi-
   cial release: https://www.ica.org/standards/RiC/ontology, last accessed 2021/07/03.
2. Encoded Archival Description (EAD): https://www.loc.gov/ead/, last accessed 2021/07/03.
3. Encoded Archival Context-Corporate Bodies, Persons, and Families (EAC-CPF) XML
   schema: https://eac.staatsbibliothek-berlin.de/, last accessed 2021/07/03.
4. Clavaud, F.: Building a knowledge base on archival creators at the National Archives of
   France: issues, methods, and prospects. In: Journal of Archival Organization, vol. 12, 1-2
   (2015), pp. 118-142. Doi:10.1080/15332748.2015.1001642.
5. Clavaud, F.: Transformer les métadonnées des Archives nationales en graphe de données :
   enjeux et premières réalisations, in: Les Archives nationales, une refondation pour le XXIe
   siècle, La Gazette des Archives, n°254 (2019-2), pp. 59-88.
6. International Council on Archives (ICA): Records in Contexts-Conceptual model (RiC-CM)
   0.2 (July 2021), https://www.ica.org/sites/default/files/ric-cm-02_july2021_0.pdf, last ac-
   cessed 2021/08/23.
7. Francart, T., Charbonnier, P.: RiC-O Converter, un logiciel libre de conversion de métadon-
   nées archivistiques (en EAD et EAC-CPF) en jeux de données conformes à RiC-O
   (2020/01/28),        https://labarchiv.hypotheses.org/files/2020/02/20200128_4_RiCOConver-
   ter.pdf, last accessed 2021/07/03.


19
   7KH6('$ µVWDQGDUGG¶pFKDQJHGHGRQQpHVSRXUO¶DUFKLYDJH¶ PRGHOVWKHYDULRXVWUDQVDc-
   tions that may occur between the agents involved in digital archiving. This French standard
   conforms to ISO 20614:2017 (Information and documentation - Data exchange protocol for
   interoperability and preservation). The standard includes an XML schema. See
   https://francearchives.fr/seda/index.html (last accessed 2021/07/03).
20
   PREMIS (PREservation Metadata Implementation Strategies) is a data dictionary that was
   created in 2005, and is now expressed, among other formats, through an OWL ontology. It is
   hosted by the Library of Congress and maintained by the PREMIS Editorial Committee. See
   http://www.loc.gov/standards/premis/ (last accessed 2021/07/03).