1. Introduction

Trusted Data Forever: Is AI the Answer?

Emanuele Frontoni

emanuele.frontoni@unimc.it 4

Marina Paolanti

marina.paolanti@unimc.it 4

Tracey P. Lauriault

Tracey.Lauriault@carleton.ca 0

Michael Stiber

stiber@uw.edu 1

Luciana Duranti

Luciana.Duranti@ubc.ca 2

Muhammad Abdul-Mageed

muhammad.mageed@ubc.ca 3 0 Critical Media and Big Data Lab, Carleton University , Ottawa, ON K1S 5B6 , Canada 1 Intelligent Networks Lab, University of Washington Bothell , WA , USA 2 InterPARES Lab, University of British Columbia , Vancouver, BC V6T 1Z4 , Canada 3 NLP and ML Lab, University of British Columbia , Vancouver, BC V6T 1Z4 , Canada 4 VRAI Vision Robotics and Artificial Intelligence Lab, University of Macerata , Italy

Archival institutions and programs worldwide work to ensure that the records of governments, organizations, communities, and individuals are preserved for future generations as cultural heritage, as sources of rights, and as vehicles for holding the past accountable and to inform the future. This commitment is guaranteed through the adoption of strategic and technical measures for the long-term preservation of digital assets in any medium and form - textual, visual, or aural. Public and private archives are the largest providers of data big and small in the world and collectively host yottabytes of trusted data, to be preserved forever. Several aspects of retention and preservation, arrangement and description, management and administrations, and access and use are still open to improvement. In particular, recent advances in Artificial Intelligence (AI) open the discussion as to whether AI can support the ongoing availability and accessibility of trustworthy public records. This paper presents preliminary results of the InterPARES Trust AI (“I Trust AI") international research partnership, which aims to (1) identify and develop specific AI technologies to address critical records and archives challenges; (2) determine the benefits and risks of employing AI technologies on records and archives; (3) ensure that archival concepts and principles inform the development of responsible AI; and (4) validate outcomes through a conglomerate of case studies and demonstrations.

eol>Artificial Intelligence Machine Learning Deep Learning Archives Trustworthiness

1. Introduction

trustworthy. Thus, their preservation must ensure that any activity carried out on the records to identify, select, Archival institutions and programs worldwide work to organize, describe them and make them accessible to the ensure that the records of governments, organizations, people at large must ensure that they remain trustworthy, communities, and individuals are preserved for future that is reliable (i.e. their content can be trusted), accurate generations as cultural heritage, as sources of rights, to (i.e. the data in them are unchanged and unchangeable), hold the past accountable, and as evidence to inform fu- and authentic (i.e. their identity and integrity are intact). ture plans. A record – or archival document – is any This is particularly dificult in the digital environment document (i.e. information afixed to a medium, with because the content, structure and form of records are stable content and fixed form) made or received in the no longer inextricably linked as they used to be in the course of an activity, and kept for further action or ref- traditional records environment [1]1. The issue exists for erence. Because of the circumstances of its creation a both digital and digitized records and is rendered more record is a natural by-product of activity, is related to serious by the sheer number and volume of records that all the other records that participate in the same activity, have accumulated overtime and are being created today are impartial with respect to the questions that future in a large variety of systems. researchers will ask of them, and are authentic as instru- The InterPARES (International research on Permanent ments of activity. This is why records are inherently Authentic Records in Electronic System) project has addressed these issues since 1990, focusing on current and emerging technologies as they evolve and developing theory, methods, and frameworks that allow for the ongoing preservation of the records resulting from the use of such technologies2. The latest iteration of InterPARES, I Trust AI, funded, as previous projects, by the Social Sciences and Humanities Research Council of Canada, but difers as it is not concerned with the records produced by a spe- challenges; 2.Determine the benefits and risks of using cific technology, but has the purpose of using AI to carry AI technologies on records and archives; 3. Ensure that out archival functions for the control in the long term of archival concepts and principles inform the development all records, on any medium, and from any age, and to do of responsible AI; and 4. Validate outcomes from Obso in such a way that the trustworthiness of the records jective 3 through case studies and demonstrations. Our remains protected and verifiable, and that the tools and approach is two-pronged, comprising the practical and processes are transparent, unbiased, equitable, inclusive, immediate need to address large-scale existing problems, responsible (i.e. protecting autonomy and privacy) and and the longer-term need to have AI-based tools that are sustainable3. There have been several projects looking at reliably applicable to future problems. Our short-term AI in archives, but they typically look at a particular tool approach focuses on identifying high impact problems in a specific context or even a single set of records, and and limitations in records and archives functions, and apthey tend to use of the shelf tools. The research question plying AI to improve the situation. This will be achieved that I Trust AI is being asked here is: “what would AI via collaboration between records and archival scienlook like if archival concepts, principles and methods tists and professionals and AI researchers and industry were to inform the development of AI tools?” What is experts. Our long-term approach focuses on identifylacking is comprehensive, systematic research into the ing the tools that records and archives specialists will use of AI to carry out the diferent archival functions in need in the future to flexibly address their ever-changing an integrated way and ensure the continuing availability needs. This includes decision support and, once decisions of verifiable trustworthy records to prevent the erosion are made, rapid implementation of AI-based solutions to of accountability, evidence, history and cultural heritage. those needs.

Thus, we are addressing the technological issues from The I Trust AI project is a multinational interdiscithe perspective of archival theory, by integrating the plinary endeavour, and this means that our first efort technology with complex human-oriented tools. The must be to understand each other, starting with the lanobjectives of I Trust AI are to 1. Identify specific AI tech- guage we use. For example, archival professionals talk nologies that can address critical records and archives about records, while computer scientists and AI professionals talk about data. To the former data are the smallest meaningful unit of information in a record. To an AI of ML comes from Mitchell and Learning [3] who prospecialist, data are arrangements of information (possibly vide a procedural definition maintaining that a computer in a database), be these facts or not, regardless of their program is said to learn from experience E with respect to size, nature and form.. Thus, for the purposes of this some class of tasks T and performance measure P if its perpaper, which is directed to data analytics specialists, we formance at tasks in T, as measured by P, improves with will use the term data. experience E. As to DL [4, 5], it is a class of ML methods

Public and private archives are the largest providers of inspired by information processing in the human brain. data big and small in the world as they collectively host What makes DL powerful is its ability to automatically yottabytes of trusted data, to be preserved forever. Their learn useful representations from data. DL algorithms, creators are organizations and individuals from myriad unlike classical ML methods, are not only able to learn sectors and disciplines, from public administration, to the mapping from representation to output but also to academia and businesses of all kind (e.g. banking, engi- learn the representation itself [6], thus alleviating the neering, architecture, gaming), and from Indigenous com- need for costly human expertise in crafting features for munities, civil society organizations, associations, and models. DL has achieved success in recent years in a virtual communities. Table 1 reports the data quantity. wider variety of applications in many domains involving The Italian State Central Archives, for example, has Data various types of data modalities such as language, speech, quantity (67 TB) of “digital objects” stored by the ACS image, and video. linked to the typology (digitised heritage data (in TIFF)). As mentioned, DL mimics information processing in The National Archives of the US has 1,323 terabytes of the brain. This is possible by designing artificial neural electronic data. This paper presents some preliminary networks arranged in multiple layers (and hence the term results of I Trust AI international research partnership deep) that take input, attempt to learn a good represenand is organized as follows: Section 2 provides a general tation of it, and map it to an output decision (e.g., is this discussion of AI and its subsets; Section 3 describes three text in Greek or Latin?). The way these networks are of the about forty studies that are now in course; and designed can vary, thus, various types of deep learning Section 4 presents an overview of the kind of studies that architectures have been proposed. Two main types of are being pursued at this time and a conclusion. DL architectures have been quite successful. These are recurrent neural networks (RNNs) (e.g., [7]), a family of networks specializing in processing sequential data, and 2. Artificial Intelligence and Deep convolutional neural networks (CNNs) [8], an architecture Learning specializing in data with a ‘grid-like’ typology (e.g., image data) [5]. More recent advances, however, abstract There are various definitions of what AI is. For exam- away from these two main types toward more dynamic ple, Russel and Norvig [2] define AI as a field focused networks such as the Transformer [9]. on the study of intelligent agents that perceive their en- In general, ML and DL methods learn best when given vironment and take actions to maximize their chance of large amounts of labeled data (e.g., for a model that desuccess at some goal. In recent times, however, AI has tects sensitive information, labels can be from the set been used much more widely to refer to any technology sensitive, not-sensitive). DL in particular is data-hungry where there is some level of automation especially re- and tends to learn best given large amounts of labeled sulting from the application of deep learning in various data. This type of learning with labeled data is called sudomains. Deep learning (DL) is a sub-field of machine pervised learning. It is also possible to work with smaller learning (ML), which are both sub-fields of AI. Similar to labeled datasets. In these cases, training samples can AI, ML is defined in several ways. A classical definition be grown iteratively exploiting unlabeled data based on decisions from an initial model (self-training) or using decisions from various initial models (co-training). This is called semi-supervised learning [10, 11, 12]. The third main-type of ML methods is unsupervised learning where a model usually tries to cluster the data without access to any labels. There are other paradigms such as distant supervision where a model attempts to learn from surrogate cues in the data in absence of high-quality labels (e.g., [13]). Self-supervised learning where real-world data are turned into labeled data by masking certain regions (e.g., removing some words or parts of an image) and tasking a model to predict the identity of these masked regions is currently a very successful approach. These various methods of supervision can also be combined to solve downstream tasks. For example, Zhang and AbdulMageed [14] combine self-supervised language models with classical self-training methods to solve text classification problems. The next section will introduce three I Trust AI studies.

3. Case Studies in I Trust AI 3.1. Data from Emergency Services Communications Systems

from another, and examining how various data analytics techniques might be usefully applied — understanding what characteristics of these data might be usable to researchers Specifically, we are addressing the following research questions: What real-world and simulation ESCS data are available to be preserved for access by researchers? The answer to this varies greatly from locale to locale, depending on technology in use, public policy, and controlling agency procedures. Moreover, for such data to be available, we must understand privacy and security risks associated with transferring them from their current owners to a research environment, along with the risk of misinterpreting them if they are decontextualized from whatever tacit knowledge might exist within the owning organizations. We are also considering pragmatic issues, such as building a knowledge base of legal restrictions on collection in various jurisdictions, formal processes for collecting these data (such as data sharing agreements), variation in the culture of practice surrounding such data, potential biases that might result from systematic diferences in diferent areas’ capacity to collect and share data (such as might arise from regional funding diferences), and understanding the metadata and other information (such as ESCS physical and operational structure and, generally, the policies and practices that determine what and how data are generated and collected).

One of the cornerstones of of public safety and societal wellbeing is a reliable and comprehensive emergency services communications system (ESCS, such as 9-1-1 in the US and Canada). Such systems can be considered to encompass the organizations, electronic infrastructure, What are the challenges and benefits of discoverand policies and procedures that enable answering and ing knowledge patterns from historical ESCS data? responding to emergency phone calls [15]. As might be These patterns will serve as clues to developing protocols expected of systems that originate in analog, switched for ESCS managers to follow regarding data collection, telephony, ESCS evolution into a digital system has re- and as clues for how these data can be applied for reuse. sulted in a haphazard conglomeration of subsystems and We will consult with external stakeholders to seek advice generally needs re-imagining as a modern technological and to run thought experiments using surveys, thinksolution. In the US, this change has been termed the aloud exercises, retrospective first-hand accounts, etc. “Next Generation 911” (NG911) project. We will also examine historical records of disasters for

A transformation such as NG911 once again re-casts which the preserved data are more complete than basic ESCS as keystone information and communication tech- ESCS datasets. nology, subject to all of the concerns of such systems: cybersecurity, privacy, crisis preparedness, strategic and What other data/metadata associated with emeroperational decisions, etc. At the same time, it opens gency events are not part of the ESCS data stream? up possibilities for data analytics to improve ESCS per- From our preliminary examination, typical ESCS data formance, inform funding decisions, monitor the health currently involve lists of individual calls and information of societies and their infrastructures, and serve as early directly associated with such calls (perhaps including full warnings for natural and human made crises. This study or partial phone numbers, call categorization, GPS coordiconnects large- scale simulations of ESCS to historical nates, responder information, response times, etc.). What data from ESCS operations to develop and document an these datasets do not directly include are events and data understanding of how to preserve authentic, reliable data that are external to the call stream but are the reason for that can be used for applications such as re-creation of such calls (trafic, weather, geopolitical events, and so on). past events (as might be done to support training or to Some of these additional data may be present in other explore the efects of changes in policies and procedures), sources of information in a format that can be reasonably testing system operation in one locale based on data collected in tandem with call data. On the other hand, it may be the case that other causal information must be Subsequently, we will prepare an initial case study in inferred from available data (and, of course, it may be the which we will apply the above to a single locale: we case that a combination of inference and extraction from will go through all of the steps of understanding the other data streams could be useful). For the inferencing ESCS processes that produced the data, developing a data task, we propose using an examination of simulation re- sharing agreement, and collecting data and metadata. sults and simulation artifact provenance information as exemplars to develop a set of specifications for what an 3.2. Learning from Parchments AI-driven system would need to accomplish [16].

The digitization of historical parchments is extraordinar

What are the roles of the disciplines of Archival ily convenient, as it allows easy access to the documents Science and Artificial Intelligence in building a cen- from remote locations and removes the need for the postral repository for ESCS data? Both the individual sible adverse efects of their physical management and ifelds of archival science and artificial intelligence, plus access [17]. This arrangement is particularly suitable their overlap or combination that could be considered to archives and museums who preserve such invaluable to fall within the realm of data science, have a number historical documents whose contents are unpublished of roles in the organization and interpretation of ESCS and which, if damaged, cannot be fully restored by condata. We provide examples below from the application ventional tools, are dificult to read on the original, due of real-world data to simulations: to high levels of damage and the delicate nature of the material. Damaged parchments are notably prevalent • Generating requirements for simulator design so in archives all over the world [18]. Their digital repthat simulation output matches real data in terms resentations reduce both damage and access issues by of format, metadata, etc. providing users with the possibility of reading their con• Analyzing and comparing simulation output with tents at any moment, from remote locations, and without real- world data. necessitating the potentially harmful physical handling • Synthesizing ESCS data that match features of of the document. Thus, the automatic analysis of digreal-world data as part of an overall ESCS simu- itized parchments has become an important research lation. topic in the fields of image and pattern recognition. It • Using real-world call data to drive a simulation has also been a considerable research issue for several of an emergency response system, for example, years, gaining attention recently because of the value that to allow a “replay” of a previous disaster or to maybe be unlocked by extracting the information stored investigate how modifications to such a system in historical documents [19]. Interest in applying AI/ML might produce diferent outcomes. to ancient image data analysis is becoming widespread, and scientists are increasingly using this method as a 3.1.1. Progress powerful and complex process for statistical inference. Computer-based image analysis provides an objective This work-in-progress is in its initial stages, preparing method of identifying visual content independently of for the point in time when we can begin collecting ESCS subjective personal interpretation, while potentially bedata. Specifically, we are: ing more sensitive, consistent and accurate than physical • working with a small set of external partners to human analysis. Learned representations often result in develop a general understanding of ESCS opera- much better performance than hand-designed representions, policies, and procedures, as well as identi- tations when it comes to these types of texts. Until now, fying which data exist; parchment analysis has required physical user interac• developing a process within our project for work- tion, which is very time consuming. Hence, the efective ing with a selected ESCS management organiza- automatic feature extraction competence of Deep Neural tion to build an understanding of their specific Networks (DNNs) decreases the demand for a personal operations and data; physical extraction processes. • fleshing out a model data sharing agreement to Considering the above, PergaNet is a lightweight DLserve as a starting point for discussions surround- based system for the historical reconstructions of ancient ing transferring data to our research environ- parchments is specifically designed and developed for ment; this type of analysis. The aim of PergaNet is to auto• consulting with our institutional review board mate the analysis and processing of large volumes of regarding the use of this particular set of human scanned parchments. This problem has not yet been data; deeply investigated by the computer vision community • and configuring a secure internal data storage as parchment scanning technology is still novel, but it system. has proven to be extremely efective for data recovery from historical documents whose content is inaccessible to help you plan your day. This occurs seamlessly in the due to the deterioration of the medium. The proposed background of our daily activities and involves a vast approach aims to reduce hand-operated analysis while complex of sensors; databases; cloud computing centres; using manual annotations as a form of continuous learn- telecommunication networks and the internet; standards; ing. The whole system however requires digital labour, code, software and platforms; people and institutions; such as the manual tagging of large training data. Up laws, regulation and policies; communication systems until now, large datasets remain necessary to boost the and of course AI/ML [25]. performance of DL models, and manually verified data An emerging subset of these spatial data infrastrucwill be used as continuous learning and maintained as tures are digital twins (DTs). A DT is an ecosystem of training datasets. PergaNet comprises three important multi-dimensional and interoperable subsystems made phases: the classification of parchments recto/verso, the up of physical things in the real-world, digital versions detection of text, and the detection and recognition of of those real things, synchronized data connections bethe “sig,num tabellionis”. (i.e. the identifier of the au- tween them and the people, organizations and instituthor). PergaNet concerns not only the recognition and tions involved in creating, managing, and using these5. classification of the objects present in the images, but In terms of physical and real things consider a building also their location. This I Trust AI study expands the or a car manufacturing plant; a digital representation implementation of AI guided by archival institutions and of those things in a digital platform or an interactive programs as this method could be used by many other virtual reality game engine; with an internet of things archives for diferent types of documents. The analysis (IoT) system of sensors and databases that communicates is based on data about the ordinary use by researchers between the buildings or manufacturing processes in the of this type of material and does not involve altering or plant in real and near real time and the people and inmanipulating techniques aimed to generate data. This stitutions that own and operate these. Contemporary provides actionable insights that are helpful to identify examples of DTs are the modeling and managing of the text as documentary form and not as reading. construction of Sweden’s new high speed rail systems6;

The DL pipeline is depicted in Figure 1. We chose Hyundai car and ship manufacturing plants7; and as part VGG16 Network [20] for its suitability and efectiveness of smart city strategies (see the submissions to the Infrasin image classification tasks and were inspired by the tructure Canada Smart City Challenge8). DTs originate work of Zhou et al. [21], in the way in which PergaNet in the aerospace industry, first with NASA’s Apollo 13 in detects the text in the image. This phase allows for the the 1970s, although in that case it was a physical replica exclusion of the text on the parchment in the phase of to help troubleshoot issues of a ship in flight; and were recognition of the signa. The DNN model chosen is EAST predominantly used in manufacturing and logistics [26] for word detection [21]. Finally, a Convolutional Neural Increasingly, DTs involve building information modelNetwork has been employed for the signa detection. Our ing (BIM) such as REvit a proprietary platform or the approach uses YOLOv3 [22], an algorithm that processes OS BlenderBIM; whereby a building will be conceptuimages in real time. We chose this algorithm because of alized and rendered into a 3-dimensional drawing with its eficiency in computational terms and for its precision attributes captured in a database; often replacing typical to detect and classify objects. The network is pre-trained blue prints. The BIM informs the construction of the using COCO4, a publicly available dataset; this was a building; and is a record of it once completed. BIM renchoice made to reduce the need for a large amount of derings are increasingly being submitted as part of the training data, that would come with a high computational building permit approval process (BuildingSmart, 2020, cost. e-submission common guidelines for introducing BIM to the building process9) and are updated into as is BIMs for ongoing operations. BIMs are also used to estimate mate

3.3. Digital Twin Study

Spatial media [23] and spatial data infrastructures (SDI) [24] have normalized as complex interconnected global, regional, national, and personalized social and technological systems of systems. Simply consider the monitoring of climate at a global scale to inform the logistics of production chains, or to predict, preempt and prevent disaster resulting from natural calamities on local physical infrastructure or simply to report the temperature and humidity levels outside to your smart phone

4https://cocodataset.org/#home 5CIMS, 2021,About page, what is a digital twin

https://canadasdigitaltwin.ca/about-2-2/

6Pimental, K. 2019, Visualizing Sweden’s first high-speed railway with real-time technology, https://www.unrealengine.com/enUS/spotlights/visualizing-sweden-s-first-high-speed-railway-withreal-time-technology

7Chang-Won, L., 2022, Hyundai Motor works with Unity to build digital-twin of factory supported by metaverse platform; https://www.ajudaily.com/view/20220107083928529 8https://www.infrastructure.gc.ca/sc-vi/map-applications.php 9https://www.buildingsmart.org/wpcontent/uploads/2020/08/e-submission-guidelines-PublishedTechnical-Report-RR-2020-1015-TR-1.pdf rial costs as they are interconnected with material vendor the Carleton Digital Campus Innovation (DCI)13 project databases and electrical and heating systems. BIMs in- integrating Building Performance Simulation (BPS) techterrelate with smart asset management systems (ASM) nologies with BIM on a campus scale; building inforwhich inform building maintenance and operations, and mation management systems (BIM), Asset Management monitor internal climate such as temperature, humidity, Systems (AMS), visualizations of the digital structures in air flow and quality; which are inputs for the AI/ML sys- the Unreal Game Engine, VR and modelling, AI/ML, and tems that remotely manage heating and cooling systems; Real-time data for decision making. electricity use and consumption as well as maintenance The Carleton Campus DT data of seven buildings beschedules and inform ongoing decision making. long to the University that must preserve these as oficial

In this I Trust AI Digital Twin case study we collaborate records, which are used for operations by Facilities, Manwith researchers working on the Imagining the Canada agement and Planning (FMP) and are part of research Digital Twin (ICDT)10 project funded by the Canadian and development for the CIMS and SUSTAIN projects; New Frontiers in Research Fund11 which proposes a na- thus involving research data that must be managed and tional, inclusive, and multidisciplinary research consor- deposited in a trusted digital repository. tium for the creation of a technical, cultural, and ethical The implications of this research are important to the framework to build and govern the technology, data and archival community who will increasingly have to ingest institutional arrangements of Canada’s DT. ICDT focuses complex record sets such as these, as well as create an on the built environment, concentrating on the Archi- archival package to maintain the integrity of these comtecture, Engineering, Construction, and Owner Operator plex interlinked DT systems through time. This study (AECOO) industry. ICDT is led by the Carleton Immer- will be one of the first globally to examine the presersive Media Studio (CIMS)12 developing a DT prototype of vation of a DT. Its research questions are: Can a digital the Montréal-Ottawa-Toronto corridor using a simulated, twin be preserved and what is required at the point of distributed server network. The research study involves creation to ensure that it can be? Can information about an interdisciplinary research team of architects, data sci- the AI tools, automation and real time data involved in entists, engineers, building scientists, archival profes- this complex data, social and technological system be presionals and critical data studies scholars, from Carleton served, and how? And, what might be the role of AI/ML University CA, Luleå University of Technology SE, the be in terms of creating an archival package to ingest a Swedish Transport Administration and the University of digital twin? The outputs of this research will provide Florence IT who will develop a Use and Creation preser- empirical data to meet the objectives of the I Trust AI vation case study. The Study aims to preserve the DT of Project; and also provide Carleton University with the opcampus buildings and structures created as part of SUS- portunity to test the preservation of Campus DT records TAIN (https://cims.carleton.ca/#/projects/Sustain) and in its institutional archives. In the process, it will inform the technology sectors involved in the creation of DTs, so that they may build-in at the point of creation, the 10https://canadasdigitaltwin.ca/ 11https://www.sshrc-crsh.gc.ca/funding-financement/nfrffnfr/index-eng.aspx

12https://cims.carleton.ca/#/home necessary bread crumbs for long term preservation. have been numerous calls to action to systematically explore the application of AI techniques to the records and archives field, AI also currently faces major ethical 4. Conclusions and Future Works challenges that will benefit from an archival theory perspective, for instance in dealing with bias and personal The studies presented above are only 3 of about 40 in- information. By exploring further the connections beprogress studies which cover a wide range of subjects and tween AI and archives, this project is and will contribute ipsrseuseesr,vsauticohnaosfeAnItetrepcrhinseiqmueasstaesrpdaartaadmataan,magoedmeellnint,gtahne tAoI tphreojiencttelhleacstgueanl eprraotegdreassgroefabtoatmhofieulndst.ofTehnethuI sTiarusmst AI-assisted digitization project, gamification of archival among participant researchers (about 200) and partner experience for users, declassification of personal informa- organizations (87), as well as organizations that do not tion using AI tools, and user approaches and behaviours have the capacity to participate but look forward to outin accessing records and archives in the perspective of comes they can use, because it deals with issues that are AI. The challenges we are addressing with this project already dramatically changing the way we act, behave have never before been systematically and globally dealt and think. We have a unique and essential contribution with; it is enormous and fraught, but critical. While the to make, because we have the means of creating knowlrisks of using AI to solve the problems of managing the edge ensuring that digital data and records are controlled ever-growing, ever-more-diverse bodies of public and and made accessible in a trustworthy, authentic form private records throughout their lifecycle, from creation wherever they are located; are promptly available when to preservation and access, are unknown, the risks of not needed; duly destroyed when required; and accessed only acting in concert to do so are unacceptable: loss of the by those who have a right to do so. ability to secure people’s rights, of evidence of past acts and facts to serve as a foundation for decision making, and of historical memory. References

This project will significantly impact society in several areas. (1) Records-keeping in local and national govern- [1] L. Duranti, K. Thibodeau, The concept of record ment agencies is a vital part of our society’s ability to in interactive, experiential and dynamic environmaintain oversight on and accountability of governance, ments: the view of interpares, Archival science 6 but, with the inability to handle the vast quantities of (2006) 13–68. digital records, public bodies risk undermining their own [2] S. Russel, P. Norvig, Artificial intelligence: a modlegitimacy as oversight if they can not appropriately pro- ern approach, Pearson Education Limited London, cess and make accessible information in a timely fashion. 2013.

By helping address this crisis through the development, [3] T. M. Mitchell, M. Learning, Mcgraw-hill, New evaluation, and contextualization of AI techniques we York (1997) 154–200. contribute to the ability of agencies and institutions to [4] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, maintain their place in our democracies. (2) Automation nature 521 (2015) 436–444. techniques can potentially aid in the economic viability [5] I. Goodfellow, Y. Bengio, A. Courville, Deep learnof many cash-starved records ofices and archival institu- ing, MIT press, 2016. tions by ensuring that professional records management [6] E. Granell, E. Chammas, L. Likforman-Sulem, C.-D. and archival expertise are used wisely, with classifica- Martínez-Hinarejos, C. Mokbel, B.-I. Cîrstea, Trantion tools and TAR able to allow a quick review and scription of spanish historical handwritten docuassessment of vast quantities of records. Similarly, with ments with deep neural networks, Journal of Imagbusinesses depending on records agencies for routine ing 4 (2018) 15. activities, improved speed in responding to queries will [7] A. Graves, et al., Supervised sequence labelling with bring a positive efect to the economy. (3) AI techniques recurrent neural networks, volume 385, Springer, have the potential to aid in the accessibility of records in 2012. archives by new audiences, for instance by translating [8] Y. LeCun, et al., Generalization and network design and indexing historical materials written in indigenous strategies, Connectionism in perspective (1989) languages, sensitising problematic archival descriptions, 143–155. helping patrons find connected items, or captioning his- [9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, torical photographs. These techniques have both a cul- L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attural significance, by providing better access to historical tention is all you need, Advances in neural informaterial, and a social and scientific significance, by mak- mation processing systems 30 (2017). ing current records easier to organise, retrieve and use by [10] S. Abney, Semisupervised learning for computaboth their creators and the public at large. (4) While there tional linguistics, CRC Press, 2007. [11] A. Søgaard, Semi-supervised learning and domain (2016) 5–24.

adaptation in natural language processing, Synthe- [26] C. Miskinis, The history and creation of the digital sis Lectures on Human Language Technologies 6 twin concept, Challenge Advisory. March (2019). (2013) 1–103. [12] X. Zhu, A. B. Goldberg, Introduction to semisupervised learning, Synthesis lectures on artificial intelligence and machine learning 3 (2009) 1–130. [13] M. Abdul-Mageed, L. Ungar, Emonet: Fine-grained emotion detection with gated recurrent neural networks, in: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), 2017, pp. 718–728. [14] C. Zhang, M. Abdul-Mageed, No army, no navy:

Bert semi-supervised learning of arabic dialects, in: Proceedings of the Fourth Arabic Natural Language

Processing Workshop, 2019, pp. 279–284. [15] J. M. Jordan, V. Salvatore, B. Endicott-Popovsky,

V. Gandhi, C. O’Keefe, M. S. Sotebeer, M. Stiber, Graph-based simulation of emergency services communications systems, in: Proc. 2022 Annual Modeling and Simulation Conference, San Diego, CA, submitted, 2022. [16] J. Conquest, M. Stiber, Software and data provenance as a basis for escience workflow, in: IEEE eScience, IEEE, online, 2021. [17] E. C. Francomano, H. Bamford, Whose digital middle ages? accessibility in digital medieval manuscript culture, Journal of Medieval Iberian

Studies (2022) 1–13. [18] K. Pal, M. Terras, T. Weyrich, 3d reconstruction for damaged documents: imaging of the great parchment book, in: Proceedings of the 2nd International Workshop on Historical Document Imaging and

Processing, 2013, pp. 14–21. [19] V. Frinken, A. Fischer, C.-D. Martínez-Hinarejos,

Handwriting recognition in historical documents using very large vocabularies, in: Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing, 2013, pp. 67–72. [20] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014). [21] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He,

J. Liang, East: an eficient and accurate scene text detector, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 5551–5560. [22] J. Redmon, A. Farhadi, Yolov3: An incremental im

provement, arXiv preprint arXiv:1804.02767 (2018). [23] R. Kitchin, T. P. Lauriault, M. W. Wilson, Under

standing spatial media, Sage, 2017. [24] C. G. D. Infrastructure, Natural resources canada,

https://doi.org/10.4095/328060 (2020). [25] S. Arctic, Spatial data infrastructure (sdi) manual for the arctic, Arctic Council: Ottawa, ON, Canada