Linked Data and Microservices at the Support of Customized Institutional Workflows * Greg Jansen1, Mark Conrad2, Lyneise Williams3, and Richard Marciano1 1 University of Maryland, College Park MD 20742, USA 2 Advanced Information Collaboratory (AIC), Keyser WV 26726, USA 3 UNC Chapel Hill, Chapel Hill NC 27516, USA jansen@umd.edu, conradsireland@gmail.com, lyneise@gmail.com, marciano@umd.edu Abstract: This paper presents a highly innovative prototype infrastructure for linked archives. We propose a novel open-source approach to repository design, one that bridges the boundary between a disorderly world and an orderly inner sanctum. This design seeks to acquire data as early as possible from organizational workflows and active record systems, then employs microservices to extract, transform, and load (ETL) workflow-specific inputs into consistent and reconciled linked data graphs. We illustrate this approach using the digital assets of the National Park Service Mary McLeod Bethune Council House National Historic Site. We show the ability to create a flexible and reconfigurable interoperability layer that can bridge existing systems composed of a combination of independent proprietary, custom, and open-source components. Keywords: Linked data, Infrastructure design, Mary McLeod Bethune Council House. 1 Background and Goals This research was funded through a three-year collaborative agreement between the U.S. National Park Service Mary McCleod Bethune National Historic Site (MMBNHS) [1] and the research team at the School of Information Studies at the University of Maryland, College Park. Faced with an accumulation of digitized photographs and other media, the MMBNHS realized that current strategies for managing the digital versions of their archival collections could benefit from a rethinking of the underlying infrastructure, both in terms of long-term preservation and public access. 1.1 National Archives for Black Women’s History Collections Mary McLeod Bethune was a pioneering community organizer. The National Archives for Black Women's History (NABWH) Collections, document Mary McCleod Bethune’s role in the federal government as a power broker [2]. Her success laid the path for Kamala Harris becoming Vice President of the United States of America. No other collection tells the story of Bethune’s rise to power and her role as the most powerful Black woman of the first half of the 20th century. When she formally accepted the nomination for VP at the Democratic Convention in August 2020, Harris named ______________ * Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 McCleod Bethune as one of the women who "inspired us to pick up the torch, and fight on." Franklin D. Roosevelt and First Lady Eleanor Roosevelt appointed Bethune a presidential advisor of African American affairs in 1934. Bethune’s position within the Roosevelt Administration would give her the leverage to form the Federal Council of Negro Affairs, which would become known as The Black Cabinet. Bethune achieved hearings for black concerns at the highest governmental levels, nurturing the principle that blacks were integral to the American body politic. The collections also document the National Council of Negro Women and other African American women’s organization. The collections include documentation of individuals – famous and not so famous - associated with these organizations. 1.2 Building Innovative Linked Data Infrastructure Our goal is to preserve and manage the current and future digital assets of the Mary McLeod Bethune Council House National Historic Site, in order to enable increased access to these assets and better serve the African American community. The project has a public purpose of promoting greater public and private participation, and sharing the information, products and services to increase public awareness. The main objectives are to conduct a Digital Asset Management System needs analysis, create a model repository for the assets, and develop means of access both by NPS staff and members of the public. Our project leverages NCSA Brown Dog [3], Drastic Fedora [4], and Trellis LDP. Trellis is the software layer supporting Sir Tim Berners-Lee’s Inrupt Solid technology. 2 System Design In this paper we propose a different approach to repository design, one that bridges the boundary between a disorderly world and an orderly inner sanctum. This design seeks to acquire data as early as possible from organizational workflows and active record systems, then employs microservices to extract, transform, and load (ETL) workflow- specific inputs into consistent and reconciled linked data graphs. The results of the repeatable ETL processing are superimposed upon the source data, which is left in the user-supplied format and folder structure, fixed and undisturbed. This supports the archival principle of respect des fonds. We will demonstrate many benefits of this design through our prototype system. 2.1 Submission Workflows When thinking about system design for cultural heritage, especially given the option of formal semantics afforded by linked data, we have a tendency to imagine an information architecture that is like a cathedral, an enclosure that expresses our desire for perfection in symmetry and elegant lines and that will stand the test of time. These virtues correspond in our minds to the need for fixity, valid metadata, content models, 3 standard ontologies, and durability, in short, a space where everything inside meets well-chosen data standards that are rigorous and timeless. In pursuit of such a pure data space, we risk creating a kind of inner sanctum, one in which all data must be valid prior to entry. However, the problem with such a vision and with the design mindset that may accompany it, is that the world outside of the software system is less than pure. It is more often messy, chaotic, and incomplete. Data presents itself in various forms at various times and it contains varying levels of user error and data corruption. Submissions as physical transfers: Members of the team have been involved in several previous repository projects that attempted to bar the gate to disorderly information. However, the inevitable result of enforcing a strict order inside of the repository is that there are more complex and cumbersome ingest workflows created outside the system to meet its demands for perfect consistency. We see this in baroque and failure-prone ingest pipelines that deliver highly processed submission packages to repositories. These ingest pipelines process submission packages, such as ZIP files, that are much like miniature repositories themselves, with their own internal folder structures, naming conventions, and metadata. These pipelines create a pristine package for physical transfer into the storage platform. The complexity involved in staging the ingest event, complete with boxing and unboxing the content, is a bit punishing. NPS Workflows: The digital collections at the MMCBHS are growing all of the time through several workflows. Archival description is created through the Interior Department Catalog Management System (ICMS), which also provides archives and museum management functions. This information is periodically exported as XML files. Vendor-supplied digitization packages contain sets of TIFF image files that follow a file-naming convention that mirrors collection structures. MD5 digests for each scanned page file and item-level Dublin Core descriptions are also supplied in Excel files. Authority records are kept in curated lists in the ICMS and may be exported as XML files. Submissions as units of work: In the prototype software uploading starts when the files and staff are available to begin the work of a submission, even if a batch of files is incomplete or has known issues. Such a submission is marked as a work in progress until errors have been detected and fixed and staff can mark it as complete. Early upload, even by drag and drop of files onto a web page, moves digital assets off of vulnerable storage as soon as possible. Instead of expecting every submission to conform to a limited set of platform content models, the submission structure will mirror the content stream produced by the organization. These are vendor digitization batches and museum catalog exports at the NPS, but other organizations might capture the outputs from scientific instruments, entire email accounts, or drive transfers. In most cases a microservice will be created to interpret the submission files, identify errors, and extract linked data. However, the raw data can be captured, preserved, and potentially accessed even before a more specific microservice has been created. For preservation and historical research, the ability to access the original file format may be desirable. The platform anticipates that there will be new streams of data that need to be captured first and only fully processed later. 4 Fig. 1. Uploaded images and Excel files with extracted objects. 2.2 Reproducible Linked Data Graphs Having arranged submission files according to how they are produced or managed outside of the system, we run microservices that extract data from the workflow- specific file structures. Each microservice captures the knowledge required to transform a particular set of files into a linked data graph that follows a platform-wide content model and metadata structure. This linked data is stored in named graphs that are associated with the files from which they were derived, making it straightforward to trace any assertions to a submission file, a necessary step towards trustworthy archives. We can regenerate the named graphs in an automated way any time submission files are added, replaced, corrected, destroyed, or enhanced. We can also regenerate the graph when the microservice that performs linked data extraction is improved. Because the extracted linked data is isolated in its own graph, it cannot pollute the rest of linked data in the repository. Problems can be detected and reported out to users for corrective measures. For example, a staff member can receive an email from a microservice that prompts them to add the missing Dublin Core record to a vendor supplied Excel file. Another microservice, responsible for building paged documents conforming to the Portland Common Data Model (PCDM) [5], might inform staff that a page file is missing from the sequence of file names that is expected. Curating an enhanced archival context: The separation of file management from linked data processing holds great promise for enhancing historical context. It allows a file to describe just one or multiple archival items or topics. Submitted files can contribute facts about the holdings, such as EAD or Dublin Core metadata, but they can also provide facts about the archival context that surrounds the holdings, including authority lists and further descriptions of entities like organizations and people. Workflows for managing archival context can be created in the same way as archival submissions, perhaps processing Excel files that staff create to describe historical individuals and organizations. 5 An example document: Figure 1 shows a view from the NPS prototype of a document produced through the existing digitization workflow and microservices. It shows three uploaded TIFF images and two Excel files, containing batch MD5 checksums and Dublin Core metadata. Around these uploaded files several new objects have been created, including a Portland Common Data Model paged document representation (second line in the figure - shaded) that was implied in the names of the uploaded files. A PREMIS-based subgraph for each file records fixity, both before and after file upload. Additionally, there are two access copies for each image and a batch metadata object that contains the Dublin Core re-encoded as linked data (RDF turtle format). 2.3 The Big Graph As the graphs of linked data from various submissions, stemming from a variety of different workflows, are added, curated and refined over time, they are also aggregated in a triple store index (technically an Apache Fuseki quad store in this case). The triple store presents a unified linked data landscape that has many uses. Data that is curated in many different ways is united in the same big graph. For instance, the NPS descriptive hierarchy from the Department of Interior Collection Management System (ICMS) – a commercial off-the-shelf software system-- is united in this structure with paged documents from the vendor digitization process. More than anywhere else in this system, this big graph represents the data according to a platform-wide content model. This consistent and conformant overview of all the platform data enables publishing, exports, reporting, and search indexing. For instance, we can leverage this unified graph to publish collections or certain archival series in the NPS NPGallery Digital Asset Management System. We can also use it to create a search index and a finding aid- based website for archival access by historians and community researchers. 2.4 Distributed Services Architecture The microservices that support ingest processing are just one part of an ecosystem of microservices, storage clusters, and web servers that together create the functions of digital asset management [6]. In the prototype for the MMBNHS we have incorporated important architectural features from our previous repository research. In the DRASTIC Fedora Project we demonstrated the capacity and performance benefits of horizontal scaling through stateless servers and distributed databases. In that research we showed that this approach can reliably keep pace with a performance load of around 500 upload requests per second [4]. Now we have added the industry-proven Kafka event streaming platform in order to run microservices and coordinate workflow actions. With these features we can maintain acceptable performance under any user load and meet arbitrary data storage demands [7]. 6 Fig. 2. Digital Asset Management functions as microservices in blue, as orchestrated by lifecycle hooks (event streaming) in light yellow around the persistence services provided by the linked data platform in gold. Context of the U.S. National Parks Service is shown in green. The open-source technology in the prototype was chosen in order to make the system open to extension. In particular, Apache Kafka (a community distributed event streaming platform capable of handling trillions of events a day [8]) and the HTTP application programming interface provided by Trellis LDP are programming language neutral. Microservices for submission processing, format migration, or computational enhancement may be written in Python, Java, Ruby, or any programming language. New microservices can support additional content streams from organizational workflows or computational treatments such as optical character recognition or other types of media and image processing. Fig. 3. Horizontal-scaling services at work in the prototype, including Trellis LDP, Apache Cassandra, Apache Kafka, and some pluggable microservices. 7 3 Conclusion This project represents the culmination of several threads of research in digital archives and linked data (based on prior funding from NSF and IMLS in particular). The open- source software prototype has demonstrated the versality and potential for workflow processing for MMBNHS catalog data and digitization packages, while not a fully functional digital asset management system or preservation system. We designed, implemented, and tested an innovative approach to linked data infrastructure that leverages open-source and massively scalable systems including Trellis LDP, NoSQL Apache Cassandra / Fuseki / Kafka. We demonstrated the ability to create a flexible and reconfigurable interoperability layer that can bridge existing systems composed of a combination of independent proprietary, custom, and open-source components. References 1. “Mary McCleod Bethune Council House National Historic Site”, website. https://www.nps.gov/mamc/index.htm 2. “National Archives for Black Women's History”, website. See: https://www.nps.gov/mamc/learn/historyculture/nabwh_collections.htm 3. S. Padhy, J. Alameda, R. Kooper, R. Liu, S.P. Satheesan, I. Zharnitsky, G. Jansen, M. Dietze, P. Kumar, J. Lee, R. Marciano, L. Marini, B. Minsker, C. Navarro, M. Slavenas, W. Sullivan, K. McHenry, “An Architecture for Automatic Deployment of Brown Dog Services At Scale into Diverse Computing Infrastructures”, July 2016 XSEDE16: Proceedings of the ACM XSEDE16 Conference on Diversity, Big Data and Science at Scale. 4. Jansen G., Coburn A., Soroka A., Thomas W., Marciano R., “DRAS-TIC Fedora: Evenly Distributing the Past”, Invited for submission to a Special Journal Issue “Selected Papers from Open Repositories 2018”, July 4, 2019 at MDPI Open Access. See: https://www.mdpi.com/2304-6775/7/3/50. 5. Wilcox, David. "A linked data approach to digital newspapers with Fedora and PCDM." (2016). See: http://origin-www.ifla.org/files/assets/newspapers/2017_Iceland/2017- wilcox-en.pdf 6. M. Villamizar et al., "Evaluating the monolithic and the microservice architecture pattern to deploy web applications in the cloud," 2015 10th Computing Colombian Conference (10CCC), 2015, pp. 583-590, doi: 10.1109/ColumbianCC.2015.7333476. 7. Jansen, Gregory, Aaron Coburn, Adam Soroka, and Richard Marciano. "Using Data Partitions and Stateless Servers to Scale Up Fedora Repositories." In IEEE BigData, pp. 3098-3102. 2019. 8. P. Le Noac'h, A. Costan and L. Bougé, "A performance evaluation of Apache Kafka in support of big data streaming applications," 2017 IEEE International Conference on Big Data (Big Data), 2017, pp. 4803-4806, doi: 10.1109/BigData.2017.8258548.