Linked Data at the Swiss Federal Archives Status Report* Dr. Cochard, Jean-Luc 1 Swiss Federal Archives, Archivstrasse 24, 3003 Bern, Switzerland Abstract. Linked Data is attracting increasing interest from the Swiss public ad- ministration. The Swiss Federal Archives are playing a leading role in this respect by investing significantly in the deployment of an infrastructure for publishing data in LD. This approach has enabled the institution to acquire in-depth knowledge on the subject and to consider integrating LD into its core applications and into the services it offers to the public. Keywords: Linked Data, RDF, triplestore, Archival Information System, Data- base 1 Historical background The Swiss Federal Archives (SFA) have been interested in Linked Data (LD) technol- ogy for almost 10 years now. Initially, a few studies were commissioned from academic institutions in order to allow the leadership of the SFA to get an overall idea of this technology, which was already of great interest in the field of libraries [1]. The interest in LD was further increased when the SFA took over a project for the publication of open government data (OGD) in the Swiss federal administration in 2013. In Sir Tim Berners Lee's 5-star model [2] LD is considered to be the optimal format for publishing data. In 2014, a pilot infrastructure for hosting Linked Data was set up within the federal administration. The solution called LINDAS for Linked Data Service, has enabled var- ious administrations to experiment with both data conversion and access to LD from web applications. LINDAS also favoured the setting up in 2014 of a collaboration within Swiss ar- chival institutions with the name aLOD (archival Linked Open Data) [3]. Its ambition was to concretely experiment with the conversion into LD of descriptive metadata man- aged by the AIS1 of several Swiss institutions in order to gain experience in this field. Between 2017 and 2020, the focus was on improving LINDAS in order to transform the prototype solution into a productive, reliable infrastructure capable of hosting large volumes of data. In parallel, additional studies were conducted to determine whether LD and the triplestore databases were able to meet the technical requirements of an AIS. 1 Archival Information System * Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 2 aLOD related activities The archives participating in aLOD activities have set themselves the following goals: 1. To examine the opportunities for LD to achieve the mission of archival institutions. 2. Based on real descriptive metadata sets from their respective AIS, to transform and unify these metadata into LD datasets. 3. In doing so, to formulate "best practices" for transforming existing inventories (metadata) into LD. 4. To communicate and disseminate the achievements of the project within the archival community and the internal users, but also beyond, for example within the commu- nity of researchers in digital humanities, and also to exchange with the actors who contribute to the implementation of LD technologies (GLAM and beyond). 5. Demonstrate the potential for third party reuse of descriptive metadata in LD when made freely available (OGD), as for example in the context of hackathons on cultural data. The different archives exported data from their AIS in the form of csv files that were then converted into LD using an ad hoc data model, since the RiC-O data model [4] was not yet available when this work was undertaken. Several particularities had to be taken into account in order to achieve a commonality of these data: x The contents of the inventories had very different levels of detail from one institution to another. The data model therefore had to be enriched as new datasets were inte- grated. x The language of these contents was different: French or German in this case. Fortu- nately, this aspect is well handled with language tags in RDF 2. x The dates had variable numerical formats or even were in textual form. This is one of the aspects that took the longest time to be dealt with, without resulting in a clean and reusable solution. x For each institution, a data export procedure had to be put in place. Even for institu- tions that used the same AIS, it was not possible to have a generic solution, as the content structures were quite different. The data conversion produces triples like those associated with the AFS record with the signature "B0#1000/1483#3792*" (see Fig. 1). Thanks to this uniform representa- tion of the data from the different archives and to the fact that these Linked Data are directly accessible on the web via LINDAS and its SPARQL interface, it has been pos- sible to have an experimental prototype for the visualisation of all these data (see Fig. 2). This representation includes a histogram of the number of records per date, which is unusual in archival web portals but could be useful to identify the density of infor- mation over time on a specific subject. 2 Resource Description Framework: a formal model to define graph structures. 3 Fig. 1. Example of an entity from the AIS of the AFS, of type "File", converted to LD with the ad hoc data model used in 2015. Fig. 2. Screenshot of the experimental application representing descriptive metadata of institu- tions participating in aLOD. 4 3 LINDAS LINDAS as a linked data hosting infrastructure has been enhanced since its first release to become a productive infrastructure. This enhancement was carried out between 2017 and 2020. Its general structure is described schematically below (see Fig. 3). In the centre, there are several triplestores to allow testing, integration and finally production of new datasets. Data conversion can be a recurring or a one-off process. In any case, an ETL pipeline is implemented, the execution of which can be scheduled according to the updating of the source data. Fig. 3. Structure of LINDAS with its satellite solutions. To ease the definition of these conversion processes, the Data Cube Creator tool, specialised in the conversion of OLAP cubes [5], has been implemented. This tool al- lows configuring the conversion of this type of data without having in-depth knowledge of the W3C cube model [6], used for this purpose. This auxiliary solution allows many administrations to publish data in LD as LOGD. In addition, to enrich data documenta- tion, the Schema Manager tool allows the modelling and publication of schemas and ontologies [7]. This is central to the long-term archiving of LD, as the description of the modelling schemas is as important as the data itself to define the semantics of a dataset. To complete the infrastructure, the graphical visualisation of data hosted in LINDAS can be parameterised using the Visualize tool [8]. This solution works as an accelerator for the adoption of LD as it facilitates the production of interactive graphical represen- tation in web pages or digital reports if the data is first converted to LD and published in LINDAS. 4 Linked Data as core technology of an AIS We believe that LD is the optimal solution for publishing data and making it accessible on the web. The question we asked ourselves in relation to our core activities as an archive is whether LD and more specifically the RDF model could be used as the central 5 database of an AIS. In 2018 and 2019, two studies3 were conducted by research insti- tutes to verify certain aspects of this technology in relation to our own issues. It appeared from this work that there are suppliers of triplestores able to deliver so- lutions that are perfectly suited to our needs. Thus, Stardog version 5.2 allowed us to build a graph of 10B triples by reading files of 100M triples with an average and stable execution time of 20 min. per file. This amount of data is much more than what we estimate we will eventually have to manage in our AIS: 100-500M triples. Updates are crucial operations that are implemented by Delete and Insert functions. In our test, 1M updates were performed in 12 sec. on average. And finally SPARQL queries of different complexity combined with Insert, all at different frequencies, have had sub-second response times. Therefore, we are confident that, if the triplestore is installed on suitable servers, this technology will be performing well as the core database of an AIS. Another issue that has been studied is whether the RDF model is as powerful as Property Graphs (e.g. Neo4j [9]). Fortunately the evolution of RDF to RDF-star [10] and its pendant SPARQL to SPARQL-star considerably reduces the expressive ad- vantage of Property Graphs while maintaining the advantage of RDF which is a W3C open standard. As such, the RiC-O standard is written in RDF but is designed to evolve quickly into RDF-star when this new standard is approved. In our opinion, there is no reason why an AIS should not be developed with a triple- store at its core as a central database. 5 Future developments If LD can be implemented at the core of an AIS, it can also have other roles. Here are two areas we are considering working on in the coming years. 5.1 Publication of database content By publishing datasets in LD, public administrations take a first step towards publishing entire databases for public reuse. However, archives also hold databases in their archive holdings, ideally in SIARD format [11]. Unfortunately, this format is not designed for web publication of data and their structure. A conversion from SIARD to LD seems to be a promising and feasible way to fill this gap. 5.2 Testing RiC-O RiC-O in its current version 0.2 is a very promising proposal that still needs to be tested in the very different contexts of Swiss archives. To this end, LINDAS and its data con- version environment will allow us to test the conversion of the descriptive metadata of 3 These reports have not been published but the author can provide you with a copy if desired. 6 our inventories according to the RiC-O standard. Only then will it be possible to iden- tify possible gaps in the model and to establish best practices in the way of proceeding with this conversion task. References 1. Godby, Carol Jean.: 7KH 5HODWLRQVKLS EHWZHHQ %,%)5$0( DQG 2&/&¶V /LQNHG-Data Model of Bibliographic Description: A Working Paper. Dublin, Ohio: OCLC Research (2013), https://www.oclc.org/content/dam/research/publications/library/2013/2013-05.pdf, last accessed 2021/07/02. 2. 5-Star Open Data, https://5stardata.info/en, last accessed 2021/07/02 3. aLOD homepage, http://www.alod.ch, last accessed 2021/07/02 4. RiC-O Version 0.2 homepage, https://www.ica.org/standards/RiC/RiC-O_v0-2.html, last accessed 2021/07/02 5. OLAP Cube homepage, https://en.wikipedia.org/wiki/OLAP_cube, last accessed 2021/07/04 6. The RDF Data Cube Vocabulary, https://www.w3.org/TR/vocab-data-cube/, last accessed 2021/07/04 7. Zazuko Ontology Manager, https://zazuko.com/products/ontology-manager/, last accessed 2021/07/04 8. Visualize homepage, https://www.visualize.admin.ch/en, last accessed 2021/07/04 9. Neo4j homepage, https://neo4j.com/, last accessed 2021/07/04 10. RDF-star and SPARQL-star Community Group Report, https://w3c.github.io/rdf-star/cg- spec/editors_draft.html, last accessed 2021/07/04 11. SIARD Suite homepage, https://github.com/sfa-siard, last accessed 2021/07/04