Developing a Pan-Archival Linked Data Catalogue Jone Garmendia1[0000-0002-6532-2823] and Adam Retter2[0000-0001-9361-2126] 1 The National Archives of the United Kingdom, jone.garmendia@nationalarchives.gov.uk 2 Evolved Binary, adam@evolvedbinary.com Abstract. The UK National Archives has a large archival catalogue, available online since 2000. Our data is largely aligned to ISAD(G) and ISAAR(CPF) and is stored in a legacy relational database. We are now in a situation where the supporting infrastructure and the intrinsic data model are hindering our digital goals. We want to re-imagine archival practice by pioneering new approaches to description and access and by building a linked data catalogue. In this paper, we introduce Project Omega, which has evaluated standards, conceptual models and ontologies as most fit-for-purpose to underpin a new Pan-Archival Catalogue. This paper describes the early project findings and current implementation stage, providing an update on our five work streams: Data Modelling, Extract, Trans- form and Load of data, API, Catalogue Management System and Infrastructure. Keywords: Archives, Catalogues, Data Models, Linked Data, Metadata, Ontol- ogies. 1 Context The National Archives of the United Kingdom (TNA) has a large and diverse archival catalogue. Our catalogue is itself an archival record while also a crucial business asset. The catalogue comprises several discrete database systems of varying technologies, alt- hough the data is largely aligned to ISAD(G) and ISAAR(CPF). The largest catalogue system (PROCat), which holds details of born-physical records, has been available online since 2000, and is backed by a relational database. Over the last 21 years, the infrastructure supporting our catalogue has expanded and diverged into an ecosystem of over 10 database systems. Separate systems were built, for example, to manage the legal conditions governing access to records, and to preserve digitised and born-digital archives. In 2020, Project Omega started to explore the idea of replacing the aging archival cat- alogue systems. The project identified that a single system could be built to unify born- physical, digitised and born-digital catalogues by adopting a non-rigid (or schema-less) data model. It also recommended the use of a graph-based data model built with RDF (Resource Description Framework) technologies. Our proposition is to move towards a single pan-archival linked data catalogue, taking a holistic view of an archive's assets (including all media, digital surrogates, and other record manifestations). To achieve this we need a sustainable data model and ontology !!!!!!!!!!!!!! """"""#$%&'()*+",-,."/$'"+*(0"%1%2'"3&"(+0"14+*$'05"602"%2'7(++28"4982'"#'21+(:2"#$77$90";(<2902" =++'(34+($9">5-"?9+2'91+($91@"A##"BC">5-D5 that is flexible enough to support a second generation of complex born-digital accumu- lations as well as historical archives. We will gradually consolidate existing systems to introduce better workflows for accessions, data enhancement, enrichment, and for con- trolling access and publication of records. We imagine enhancing confidence in the integrity of the data, by introducing robust version management, provenance infor- mation, and audit trails. Through this project, we want to realize our ambition to reimag- ine archival practice and pioneer new approaches to description, data modelling and archival catalogue structures, delivering a new linked data catalogue. This approach provides firm foundations for delivering our 'Archives for Everyone' strategy1, enabling us to free our data and to unleash the power of an archival catalogue in a way that can support new forms of user engagement, participation, data re-use, and research. This data infrastructure project (Project Omega) is running in parallel with another project (Etna) that envisages what public interface we would create if we were to start completely anew with our website and vast catalogue 2. 2. Earlier Findings In early 2020, Project Omega analysed the strengths and weaknesses of existing stand- ards, to ascertain eligibility for expressing our conceptual data model. We focused our assessment on the following models: ● TNA-CS13 (aligned to ISAD(G)) ● TNA-DRI (aligned to Dublin Core) ● EADv3 ● DCATv2 ● FRBR ● RDA ● BIBFRAME Lite + Archive ● Europeana Data Model ● RiC-CMv0.1 and draft RiC-O v0.2, including PIAAF project information ● The Matterhorn RDF Model approach. In considering models for adoption, we have a strong preference to use open standards. The National Archives is committed to the use of open standards and is an active mem- ber of several international standards organisations. We adopt standards that align with the UK government’s open standards principles3 and in particular that are developed through fair and transparent processes. One model of particular interest was the Records in Contexts Conceptual Model (RiC- CM) and associated ontology RiC-O. These have been developed by the Expert Group 1 https://www.nationalarchives.gov.uk/about/our-role/plans-policies-performance-and-pro- jects/our-plans/archives-for-everyone 2 The National Archives Discovery website presents over 35 million descriptions at https://dis- covery.nationalarchives.gov.uk/ 3 https://www.gov.uk/government/publications/open-standards-principles/open-standards-prin- ciple on Archival Description under the auspices of the International Council on Archives (ICA). ICA provides a good home for archives and archivists to work together on stand- ards. Unfortunately, we found it challenging to engage in the process for developing or contributing to RiC. We are looking forward to engaging with the Expert Group in an open and collaborative process. We would encourage both transparency and wider par- ticipation in RiC, to ensure RiC reflects the needs of ICA’s membership and can benefit from the experience of prospective implementers like ourselves. The outcome of our assessment was published in a Catalogue Model Proposal paper 4, outlining findings and technology recommendations. The paper evaluated 35 test cases with sample catalogue data expressed using the various ontologies. We decided to adopt a graph linked data model adhering to the principles of the Records in Contexts Con- ceptual Model (RiC-CM) but using a combination of existing, mature, vocabularies (inspired by The Matterhorn RDF5 Model approach), rather than adopting the RiC On- tology as available at the time. The key RiC-O challenges for us were: ● its limited set of properties to model our current born-digital material, ● a lack of comprehensive facilities for describing and controlling access and availability conditions for descriptive metadata and digital records, ● our need for the model to handle revisions, redactions, manifestations, as- sociated provenance metadata, access rights and mappings to other vocab- ularies. We had to make some difficult trade-offs, balancing the convenience of using one single archival ontology against our existing data proposition, our catalogue busi- ness rules and the legal context surrounding access to public records in the UK. Our hybrid ontology makes use of matured and tested W3C vocabularies such as PROV-O and ODRL. This allows us to fulfil our business needs while modelling concepts in a wider multidisciplinary context, reaching to and beyond the world of archives, enhancing interoperability. We continue to review revisions to the RiC Conceptual Model and Ontology as they evolve, and believe that our approaches are similar, compatible, and travelling in the same direction. Modelling metadata variation over time, in the context of increasing uncertainty, is a difficult challenge. Descriptive practice must be aware of temporal variation. To model metadata variation over time, we have separated the enduring form of a record from its transient descriptions. Therefore, in our new model, any changes to the description or arrangement of a record will generate a new description and/or arrangement. Any fact established in the past is immutable and fully transparent. We make use of a FRBR-like layering of entities to separate enduring concepts, temporal descriptions, and realisa- tions. That, coupled with W3C (World Wide Web Consortium) Provenance Vocabulary (PROV) enables us to record how our records evolve. Additional properties in the data model are used to describe relationships between versions and their temporal extent. 4 https://www.nationalarchives.gov.uk/documents/omega-catalogue-model-proposal.pdf 5 https://fedora.phaidra.univie.ac.at/fedora/objects/o:1079685/methods/bdef:Content/download PROV gives us the ability to store information about revisions, agents (people/organi- sations), and activities (the process of change). In the UK public records system, early transfer of government files is encouraged. Rec- ords often reach The National Archives before they are 20 years old. Legal exemptions and instruments assert the types of information that need to remain closed for a partic- ular period (under Data Protection and Freedom of Information legislation). In addition to providing intellectual control, describing our records and providing access, our cat- alogue manages closure metadata and the operational process of opening previously closed files (and vice versa). The W3C Open Digital Rights Language (ODRL) vocab- ulary has furnished us with an approach to model the legal conditions governing access to public records. In parallel, another project at TNA is investigating the concept of ‘gradated access’, which would allow different types of users (sometimes in different locations) varying degrees of access to records and their metadata depending on multi- ple conditions. Our research shows that we will likely be able to complement our ODRL policies for closure, with additional policies for ‘gradated’ access and online publica- tion. 3. Implementation The second phase of Project Omega started in January 2021. We have a small but cross- disciplinary team working on five parallel work streams: data modelling, ETL (Extract, Transform and Load of data), API, management system, and infrastructure. There are many intricate dependencies between tasks under each of the work streams. We use Agile methods and tools to overcome these difficulties and obtain quick insights and feedback from archival and metadata experts. During the first phase of the project, our data modelling focused on immutable record description and arrangement. In the sec- ond phase, while we are considering the detail of modelling authority files (corporate bodies, persons, etc.), we have learnt that it is more effective to run the data modelling and ETL work streams in parallel, as together they inform a better design in each other. 3. 1 Data Modelling Implementation Our Project Omega Data Model6 is built atop ideas from many existing approaches and papers. We started modelling with the International Council on Archives’ Records in Context Conceptual Model (RiC-CM) and integrated many of the ideas around reuse from The Matterhorn RDF approach. Fundamentally, we derived two key axioms that underpin our work: ● A Record is not just the paper or the digital file. Its descriptive metadata is part of the record. 6 https://www.nationalarchives.gov.uk/documents/omega-catalogue-data-model.pdf When a catalogue description is subject to change or archival intervention, it must also be subject to preservation in the same manner as the paper or digital file. ● How a record changes through time, its physical form and description, pro- vides valuable contextual information. Being able to understand the curation of the record through time provides valu- able insight into record-keeping behaviours and their impact on records and us- ers. Preserving and presenting this information allows the archive (and the government creator) to become fully transparent and accountable. To achieve both axioms, all description of records must be preserved and become im- mutable. If an archivist wishes to change any element of the description or arrangement, a copy is made and the amended description becomes the live version. Our legacy sys- tem allows only for the live and one previous version. We will no longer replace a description with the amended one. Furthermore, we will record the provenance of each change, incorporating metadata about when, who, how, and why the description was amended. To implement this we decided to sub-divide the notion of the ‘Description of a Record’ as an entity into four distinct entities: Concept, Description, Realisation, and Digital File (when a digital file exists). The Record Concept contains properties that are known to be permanently immutable (i.e. will never be amended). The Record Concept is a single anchor for each archival record, only asserting ‘we know we have a record’. In practice, it is little more than an identifier. Each Record Concept may have as many Record Descriptions or Record Re- alisations as are required. It is entirely possible to have concurrent competing descrip- tions (e.g. curated vs. machine learning vs. public user contribution) and realisations of a record. The metadata properties, which have historically been considered the descrip- tion of the record, are now split between the Record Description and the Record Reali- sation. By separating the description of a record into these four entities, we can easily create new descriptions, realisations, and arrangements without destroying any existing information. Serendipitously, the ability to have multiple competing descriptions and realisations of a record enables us to use these same constructs to manage redaction (when part of a record or description cannot be publicly accessed) and un-redaction (when the closed part can be reinstated). Figure 1. PROCat and Omega Record History We have created a highly flexible data model (preserving changes and their purpose) from this FRBR-like separation of entities of a record description. Figure 2. Example of Omega Record Provenance Our data model documentation shows examples, advises on what properties and rela- tionships to use and when, and provides mappings from the current data models (roughly aligned to ISAD(G)). We are adopting the RiC-CM concepts of ‘Record’ and ‘RecordSet’ to enable non-hierarchical relationships, moving away from the classic ISAD(G) hierarchical levels of description. Our approach for authority files (corporate bodies, persons, places, concepts) uses the same division into FRBR-like entities. In terms of RDF vocabularies, we specifically make use of the following: ● PREMISv3: for our basic entities and structure. RecordSets, Record Con- cepts, and Record Description are all premis:IntellectualEntity. Our Record Realisation is a premis:Representation ● W3C Provenance: to record who, when, and why, things were changed ● W3C Time Ontology: to express complex date information, e.g. Covering Dates of a Record or RecordSet ● Dublin Core: for many of our simpler data properties, e.g. identifier, title, creator, abstract, etc. ● RDA (Resource Description and Access) Ontology is used where we cannot find suitable properties of relationships from other preferred ontologies ● a very limited bespoke ontology when we absolutely need to add a property 7. As we have no one comprehensive ontology to ensure the quality of the RDF data that we produce, we are using SHACL to validate our RDF data. At the time of writing, we have exported over 8 million item level records from the catalogue relational database into our new data model as RDF Turtle; the first step towards loading our data into a cloud-based graph database (Amazon Neptune). Every entity (resource) in RDF must have a unique URI to identify it. During the first phase of Project Omega, we undertook a task to investigate and propose a new Cata- logue Reference labelling scheme. The goal was to create a scheme which would be both friendly to human communication (verbal and written), be easy to generate com- putationally, and suitable for use in URIs. In addition, the scheme would have to be suitable for all media of records (born-physical, digitised, born-digital etc.) and scale to describe versions of description and realisation. This scheme is not designed to re- place existing record identifiers (catalogue references) but rather to augment them. The new scheme is known as OCI (Omega Catalogue Identifier) and has the potential to become the canonical identifier of a record at TNA. We have published several articles on our URI and Identifiers research: Archival Catalogue Record Identifiers 8, Archival Identifiers for Digital Files9, and Extreme Identifiers (for use in URIs) 10 and will fully document the scheme when it has been verified as fit for purpose. 3.2 Extract, Transform, and Load of Data To reach the goals of Project Omega and replace the existing legacy catalogue systems, we must populate our new graph database with the data drawn from source databases. Before importing, data must be transformed to fit the data model. The existing systems 7 See https://medium.com/the-national-archives-digital/reusing-standard-rdf-vocabularies- part-1-5a9bbfa58b85 and https://medium.com/the-national-archives-digital/reusing-standard- rdf-vocabularies-part-2-4e4a3ad0bbf5 8 https://medium.com/the-national-archives-digital/archival-catalogue-record-identifiers- 29b0a1fac9ba 9 https://medium.com/the-national-archives-digital/archival-identifiers-for-digital-files- c448ff463c22 10 https://medium.com/the-national-archives-digital/extreme-identifiers-for-use-in-uris- cae773b98cf7 are varied, and data must be drawn from SQL, JSON document databases (e.g. Mon- goDB), Microsoft Access, Excel spreadsheets, CSV files, XML files, RDF Data stores (e.g. Apache Jena), etc. Each data source offers new and unique technical and data challenges. Rather than developing custom software for each data source, we decided to make use of an existing framework for building and executing ETL (Extract, Transform, and Load) processes. For this purpose, we chose Hitachi Vantara's PDI (Pentaho Data Inte- gration) suite. PDI has the advantage of being an established product with good docu- mentation and community support, Open Source, written in Java (which our team is familiar with), and extensible through authoring custom plugins (in Java). PDI ships with many build-in steps, each of which performs a customisable action, such as extracting data from a SQL database, parsing emails, or finding and replacing text in strings. The benefit of using PDI is that many of the steps that we must perform to transform data from one system to our Omega model are already available. PDI allows you to connect visually these steps into custom transformations or jobs. Transfor- mations can be re-used within other transformations/jobs as steps themselves, which enables developers to build up their own library of reusable components. Much of the time taken in building transformations in PDI is about ensuring that the output data meets the standards of our data model. Data is often messy and inconsistent due to the long operating life and limited constraints of the original systems, which have been in use for over 20 years. We have had to build many steps within our work- flows to clean up the data. For example, the catalogue relational database uses EAD XML to provide structured textual metadata for some properties of the records (e.g. Scope and Content), however there is little in the way of system constraints on how the EAD may be used, or even whether the XML is well formed! We have used existing Schema Validation and Regular Expression steps in PDI to clean-up the data and ensure the validity of the EAD XML. Figure 3. PDI Transformation for transforming Person(s) from PROCat/ILDB to Omega RDF To date we have faced many challenges with our ETL work. We will not dwell on these in detail here; instead, we would like to share how we overcame some of the obstacles: ● Building custom plugins for PDI to produce RDF Output. We released these as Open Source11. ● Ensuring that we create a unique URI for each entity (Record, Person, Corpo- rate Body, etc.) and that the same URI is reused for the same entity throughout the system. We developed further Open Source plugins for PDI 12 to ensure that a unique URI was created and stored only once. ● Processing and computing over historical dates from the (now) United King- dom. This was ultimately related to how archival dates have been recorded, and the fact that the Julian/Gregorian Calendar switch-over was not undertaken 11 https://github.com/nationalarchives/kettle-jena-plugins and https://blog.adamret- ter.org.uk/rdf-plugins-for-pentaho-kettle 12 https://github.com/nationalarchives/kettle-atomic-plugins universally at the same time. We investigated how to solve this programmati- cally13 and also contributed an Open Source enhancement to PDI to fix the is- sue14. 3.3 API and Catalogue Management System Building a new Data Model for our Pan-Archival Catalogue and importing the existing data is a first step. However, to deliver a successful and usable product, The National Archives must be able to use the new linked data catalogue to exert intellectual control over its holdings. The main legacy editorial system has over 100 active users who need to manage the accessioning of new records into the archive, curate individual descrip- tions, ingest and enhance data in bulk while also providing quality assurance for a con- stant flow of data generated by many cataloguing projects. To this end, we will be building a new web-based, catalogue management and editorial application. So far, we have identified key technologies for our application that will allow users to interact with and manage our data graphs. We have also undertaken some preliminary research into user experience and user interface design. As well as allowing our staff and volunteers to interact with our new catalogue, we want to open opportunities to discover new relationships between its content and to create new ways to present and contextualise our records. To this end, our catalogue management and editorial system will access an API (Application Programming Inter- face) to communicate with the database, instead of interacting directly with it. This API will enable the new catalogue management system and other applications run by other parts of TNA to deliver their services in a joined up manner. We envision an ecosystem of applications that both consume and contribute to our graph to deliver, for example, catalogue, preservation, gradated access and public online services. 3.4 Infrastructure Our Infrastructure work stream has barely started; so far, all development has occurred in local environments. To date, we have established resources and procured contracts to set up a Virtual Private Cloud within Amazon's Web Services. We have just started to move all development into the Cloud, and to set up a Proof of Concept product uti- lising Amazon Neptune (Graph Store) and EC2 (Elastic Compute Cloud). 4. Conclusion Devising and developing a linked data catalogue with the ambition to become a Pan- Archival model is hard. The National Archives is devoting a very significant amount of intellectual effort, technical expertise and financial investment to transform our ar- chival catalogue infrastructure. The pan-archival approach has led us to collaborate with many teams and domain experts, as we have to consider much more than just our main catalogue. Project Omega is not a green-field project, instead we are having to 13 https://medium.com/the-national-archives-digital/processing-historical-dates- d7ddb5814de8 14 https://github.com/pentaho/pentaho-kettle/pull/8006 reverse-engineer existing systems, data, and legacy processes. Furthermore, the cata- logue has to keep functioning while we migrate into our new model and processes. This is a dynamic catalogue that, in spite of COVID restrictions, made available over 560,000 new or enhanced catalogue descriptions in the financial year that ended on 31 March 2021. A crucial and challenging part during the first phase of the project was the need to achieve buy-in and resources from our internal leadership. We worked tirelessly to sell the idea and communicate the advantages (and potential future cost savings) to the busi- ness; e.g. replacing existing legacy and unsupported software, reducing duplication and creating new opportunities through unlocking the unrealised potential in TNA’s data. It is also fair to acknowledge that this would not have been achieved without the vision and endorsement of our Digital Director, who championed the project from the outset. Although we face technical, conceptual and data challenges every week, we keep iter- ating, making continuous improvements and learning. There is a sense of professional gratification each time we are able to tackle, document and share our approaches to the resolution of an issue. We are committed to open source development and the sharing of our work through blogs and The National Archives public GitHub. Our modelling and implementation experience should hopefully aid others embarking on linked data transformation projects. For other archival institutions that are looking to develop or improve their catalogues, we hope that our research can prove helpful to inform their own decisions. Our advice would be the following: ● consider carefully the scope of your project and any legacy, technical or hu- man constraints ● think about what data models are most appropriate for your data ● tackle upfront your provenance and transparency requirements (e.g. do you wish to preserve all changes and versions?) ● re-use existing vocabularies to facilitate linking with the wider world ● agree an identifiers scheme that strikes a balance between human communi- cation and the ability of a machine to compute over it ● be ready to get your hands dirty fixing data to make progress, historic data is wonderfully inconsistent. We must stress the benefits and long-lasting value of linked data initiatives. Being part of the Semantic Web, using the tools and knowledge developed by others and collabo- rating to make them more usable for archives is a very worthy cause. We are excited by the possibilities that linked data will bring, for example, by using external relation- ships to enrich our own records via links to resources such as Legislation.gov.uk, Office for National Statistics, government datasets, Wikidata, etc. Finally, we would like to make it easier for other institutions and individuals to reference and use our data, plac- ing records and descriptions in the larger context, reaching beyond the archival com- munity.