Trusted Data Forever: Is AI the Answer? Emanuele Frontoni1 , Marina Paolanti1 , Tracey P. Lauriault2 , Michael Stiber3 , Luciana Duranti4 and Muhammad Abdul-Mageed5 1 VRAI Vision Robotics and Artificial Intelligence Lab, University of Macerata, Italy 2 Critical Media and Big Data Lab, Carleton University, Ottawa, ON K1S 5B6, Canada 3 Intelligent Networks Lab, University of Washington Bothell, WA, USA 4 InterPARES Lab, University of British Columbia, Vancouver, BC V6T 1Z4, Canada 5 NLP and ML Lab, University of British Columbia, Vancouver, BC V6T 1Z4, Canada Abstract Archival institutions and programs worldwide work to ensure that the records of governments, organizations, communities, and individuals are preserved for future generations as cultural heritage, as sources of rights, and as vehicles for holding the past accountable and to inform the future. This commitment is guaranteed through the adoption of strategic and technical measures for the long-term preservation of digital assets in any medium and form — textual, visual, or aural. Public and private archives are the largest providers of data big and small in the world and collectively host yottabytes of trusted data, to be preserved forever. Several aspects of retention and preservation, arrangement and description, management and administrations, and access and use are still open to improvement. In particular, recent advances in Artificial Intelligence (AI) open the discussion as to whether AI can support the ongoing availability and accessibility of trustworthy public records. This paper presents preliminary results of the InterPARES Trust AI (“I Trust AI") international research partnership, which aims to (1) identify and develop specific AI technologies to address critical records and archives challenges; (2) determine the benefits and risks of employing AI technologies on records and archives; (3) ensure that archival concepts and principles inform the development of responsible AI; and (4) validate outcomes through a conglomerate of case studies and demonstrations. Keywords Artificial Intelligence, Machine Learning, Deep Learning, Archives, Trustworthiness 1. Introduction trustworthy. Thus, their preservation must ensure that any activity carried out on the records to identify, select, Archival institutions and programs worldwide work to organize, describe them and make them accessible to the ensure that the records of governments, organizations, people at large must ensure that they remain trustworthy, communities, and individuals are preserved for future that is reliable (i.e. their content can be trusted), accurate generations as cultural heritage, as sources of rights, to (i.e. the data in them are unchanged and unchangeable), hold the past accountable, and as evidence to inform fu- and authentic (i.e. their identity and integrity are intact). ture plans. A record – or archival document – is any This is particularly difficult in the digital environment document (i.e. information affixed to a medium, with because the content, structure and form of records are stable content and fixed form) made or received in the no longer inextricably linked as they used to be in the course of an activity, and kept for further action or ref- traditional records environment [1]1 . The issue exists for erence. Because of the circumstances of its creation a both digital and digitized records and is rendered more record is a natural by-product of activity, is related to serious by the sheer number and volume of records that all the other records that participate in the same activity, have accumulated overtime and are being created today are impartial with respect to the questions that future in a large variety of systems. researchers will ask of them, and are authentic as instru- The InterPARES (International research on Permanent ments of activity. This is why records are inherently Authentic Records in Electronic System) project has ad- Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint dressed these issues since 1990, focusing on current and Conference (March 29-April 1, 2022), Edinburgh, UK emerging technologies as they evolve and developing the- $ emanuele.frontoni@unimc.it (E. Frontoni); ory, methods, and frameworks that allow for the ongoing marina.paolanti@unimc.it (M. Paolanti); preservation of the records resulting from the use of such Tracey.Lauriault@carleton.ca (T. P. Lauriault); stiber@uw.edu (M. Stiber); Luciana.Duranti@ubc.ca (L. Duranti); technologies2 . The latest iteration of InterPARES, I Trust muhammad.mageed@ubc.ca (M. Abdul-Mageed) AI, funded, as previous projects, by the Social Sciences  0000-0002-8893-9244 (E. Frontoni); 0000-0002-5523-7174 and Humanities Research Council of Canada, but differs (M. Paolanti); 0000-0003-1847-2738 (T. P. Lauriault); 0000-0002-1061-7667 (M. Stiber); 0000-0001-7895-1066 (L. Duranti); 1 “The Concept of Record in Interactive, Experiential and Dy- 0000-0002-8590-2040 (M. Abdul-Mageed) namic Environments: The View of InterPARES”. Archival Science 6, © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 26-33. CEUR Workshop Proceedings (CEUR-WS.org) 2 CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 www.InterPARES.org Figure 1: PergaNet DL pipeline. PergaNet DL pipeline consists of three stages: classification of parchments recto/verso, the detection of text, then the detection and recognition of the “signum tabellionis”. Firstly, a VGG16 Network trained on a dataset of scanned parchments is needed to solve a classification task: recto/verso. After, the text in the image is detected. Then, YOLOv3 was used to predict bounding box locations and classify these locations in one pass. as it is not concerned with the records produced by a spe- challenges; 2.Determine the benefits and risks of using cific technology, but has the purpose of using AI to carry AI technologies on records and archives; 3. Ensure that out archival functions for the control in the long term of archival concepts and principles inform the development all records, on any medium, and from any age, and to do of responsible AI; and 4. Validate outcomes from Ob- so in such a way that the trustworthiness of the records jective 3 through case studies and demonstrations. Our remains protected and verifiable, and that the tools and approach is two-pronged, comprising the practical and processes are transparent, unbiased, equitable, inclusive, immediate need to address large-scale existing problems, responsible (i.e. protecting autonomy and privacy) and and the longer-term need to have AI-based tools that are sustainable3 . There have been several projects looking at reliably applicable to future problems. Our short-term AI in archives, but they typically look at a particular tool approach focuses on identifying high impact problems in a specific context or even a single set of records, and and limitations in records and archives functions, and ap- they tend to use off the shelf tools. The research question plying AI to improve the situation. This will be achieved that I Trust AI is being asked here is: “what would AI via collaboration between records and archival scien- look like if archival concepts, principles and methods tists and professionals and AI researchers and industry were to inform the development of AI tools?” What is experts. Our long-term approach focuses on identify- lacking is comprehensive, systematic research into the ing the tools that records and archives specialists will use of AI to carry out the different archival functions in need in the future to flexibly address their ever-changing an integrated way and ensure the continuing availability needs. This includes decision support and, once decisions of verifiable trustworthy records to prevent the erosion are made, rapid implementation of AI-based solutions to of accountability, evidence, history and cultural heritage. those needs. Thus, we are addressing the technological issues from The I Trust AI project is a multinational interdisci- the perspective of archival theory, by integrating the plinary endeavour, and this means that our first effort technology with complex human-oriented tools. The must be to understand each other, starting with the lan- objectives of I Trust AI are to 1. Identify specific AI tech- guage we use. For example, archival professionals talk nologies that can address critical records and archives about records, while computer scientists and AI profes- sionals talk about data. To the former data are the small- 3 www.interparestrustai.org Table 1 Digitalised Heritage Data Digitalised Heritage Data Size Fondo Ufficio italiano brevetti e marchi, Trademarks series: volumes with trademark registrations 30 TB Official collection of laws and decrees 15 TB Fund A5G (First World War): files with various documents (reports, reports, correspondence) 1 TB Special collections (documents declassified under the Renzi and Prodi Directives): reports, reports, circulars 2 TB Judgments of military courts 3 TB Various photographic funds 2 TB Digitised study room inventories 15 TB National Archives of the US 1323 TB est meaningful unit of information in a record. To an AI of ML comes from Mitchell and Learning [3] who pro- specialist, data are arrangements of information (possibly vide a procedural definition maintaining that a computer in a database), be these facts or not, regardless of their program is said to learn from experience E with respect to size, nature and form.. Thus, for the purposes of this some class of tasks T and performance measure P if its per- paper, which is directed to data analytics specialists, we formance at tasks in T, as measured by P, improves with will use the term data. experience E. As to DL [4, 5], it is a class of ML methods Public and private archives are the largest providers ofinspired by information processing in the human brain. data big and small in the world as they collectively host What makes DL powerful is its ability to automatically yottabytes of trusted data, to be preserved forever. Their learn useful representations from data. DL algorithms, creators are organizations and individuals from myriad unlike classical ML methods, are not only able to learn sectors and disciplines, from public administration, to the mapping from representation to output but also to academia and businesses of all kind (e.g. banking, engi- learn the representation itself [6], thus alleviating the neering, architecture, gaming), and from Indigenous com- need for costly human expertise in crafting features for munities, civil society organizations, associations, and models. DL has achieved success in recent years in a virtual communities. Table 1 reports the data quantity. wider variety of applications in many domains involving The Italian State Central Archives, for example, has Data various types of data modalities such as language, speech, quantity (67 TB) of “digital objects” stored by the ACS image, and video. linked to the typology (digitised heritage data (in TIFF)). As mentioned, DL mimics information processing in The National Archives of the US has 1,323 terabytes of the brain. This is possible by designing artificial neural electronic data. This paper presents some preliminary networks arranged in multiple layers (and hence the term results of I Trust AI international research partnership deep) that take input, attempt to learn a good represen- and is organized as follows: Section 2 provides a general tation of it, and map it to an output decision (e.g., is this discussion of AI and its subsets; Section 3 describes threetext in Greek or Latin?). The way these networks are of the about forty studies that are now in course; and designed can vary, thus, various types of deep learning Section 4 presents an overview of the kind of studies that architectures have been proposed. Two main types of are being pursued at this time and a conclusion. DL architectures have been quite successful. These are recurrent neural networks (RNNs) (e.g., [7]), a family of networks specializing in processing sequential data, and 2. Artificial Intelligence and Deep convolutional neural networks (CNNs) [8], an architecture Learning specializing in data with a ‘grid-like’ typology (e.g., im- age data) [5]. More recent advances, however, abstract There are various definitions of what AI is. For exam- away from these two main types toward more dynamic ple, Russel and Norvig [2] define AI as a field focused networks such as the Transformer [9]. on the study of intelligent agents that perceive their en- In general, ML and DL methods learn best when given vironment and take actions to maximize their chance of large amounts of labeled data (e.g., for a model that de- success at some goal. In recent times, however, AI has tects sensitive information, labels can be from the set been used much more widely to refer to any technology sensitive, not-sensitive). DL in particular is data-hungry where there is some level of automation especially re- and tends to learn best given large amounts of labeled sulting from the application of deep learning in various data. This type of learning with labeled data is called su- domains. Deep learning (DL) is a sub-field of machine pervised learning. It is also possible to work with smaller learning (ML), which are both sub-fields of AI. Similar to labeled datasets. In these cases, training samples can AI, ML is defined in several ways. A classical definition be grown iteratively exploiting unlabeled data based on decisions from an initial model (self-training) or using from another, and examining how various data analytics decisions from various initial models (co-training). This techniques might be usefully applied — understanding is called semi-supervised learning [10, 11, 12]. The third what characteristics of these data might be usable to re- main-type of ML methods is unsupervised learning where searchers Specifically, we are addressing the following a model usually tries to cluster the data without access research questions: to any labels. There are other paradigms such as distant supervision where a model attempts to learn from sur- What real-world and simulation ESCS data are rogate cues in the data in absence of high-quality labels available to be preserved for access by researchers? (e.g., [13]). Self-supervised learning where real-world data The answer to this varies greatly from locale to locale, are turned into labeled data by masking certain regions depending on technology in use, public policy, and con- (e.g., removing some words or parts of an image) and trolling agency procedures. Moreover, for such data to tasking a model to predict the identity of these masked be available, we must understand privacy and security regions is currently a very successful approach. These risks associated with transferring them from their current various methods of supervision can also be combined to owners to a research environment, along with the risk of solve downstream tasks. For example, Zhang and Abdul- misinterpreting them if they are decontextualized from Mageed [14] combine self-supervised language models whatever tacit knowledge might exist within the owning with classical self-training methods to solve text classifi- organizations. We are also considering pragmatic issues, cation problems. The next section will introduce three I such as building a knowledge base of legal restrictions on Trust AI studies. collection in various jurisdictions, formal processes for collecting these data (such as data sharing agreements), variation in the culture of practice surrounding such data, 3. Case Studies in I Trust AI potential biases that might result from systematic differ- ences in different areas’ capacity to collect and share data 3.1. Data from Emergency Services (such as might arise from regional funding differences), Communications Systems and understanding the metadata and other information One of the cornerstones of of public safety and societal (such as ESCS physical and operational structure and, wellbeing is a reliable and comprehensive emergency generally, the policies and practices that determine what services communications system (ESCS, such as 9-1-1 in and how data are generated and collected). the US and Canada). Such systems can be considered to encompass the organizations, electronic infrastructure, What are the challenges and benefits of discover- and policies and procedures that enable answering and ing knowledge patterns from historical ESCS data? responding to emergency phone calls [15]. As might be These patterns will serve as clues to developing protocols expected of systems that originate in analog, switched for ESCS managers to follow regarding data collection, telephony, ESCS evolution into a digital system has re- and as clues for how these data can be applied for reuse. sulted in a haphazard conglomeration of subsystems and We will consult with external stakeholders to seek advice generally needs re-imagining as a modern technological and to run thought experiments using surveys, think- solution. In the US, this change has been termed the aloud exercises, retrospective first-hand accounts, etc. “Next Generation 911” (NG911) project. We will also examine historical records of disasters for A transformation such as NG911 once again re-casts which the preserved data are more complete than basic ESCS as keystone information and communication tech- ESCS datasets. nology, subject to all of the concerns of such systems: cybersecurity, privacy, crisis preparedness, strategic and What other data/metadata associated with emer- operational decisions, etc. At the same time, it opens gency events are not part of the ESCS data stream? up possibilities for data analytics to improve ESCS per- From our preliminary examination, typical ESCS data formance, inform funding decisions, monitor the health currently involve lists of individual calls and information of societies and their infrastructures, and serve as early directly associated with such calls (perhaps including full warnings for natural and human made crises. This study or partial phone numbers, call categorization, GPS coordi- connects large- scale simulations of ESCS to historical nates, responder information, response times, etc.). What data from ESCS operations to develop and document an these datasets do not directly include are events and data understanding of how to preserve authentic, reliable data that are external to the call stream but are the reason for that can be used for applications such as re-creation of such calls (traffic, weather, geopolitical events, and so on). past events (as might be done to support training or to Some of these additional data may be present in other explore the effects of changes in policies and procedures), sources of information in a format that can be reasonably testing system operation in one locale based on data collected in tandem with call data. On the other hand, it may be the case that other causal information must be Subsequently, we will prepare an initial case study in inferred from available data (and, of course, it may be the which we will apply the above to a single locale: we case that a combination of inference and extraction from will go through all of the steps of understanding the other data streams could be useful). For the inferencing ESCS processes that produced the data, developing a data task, we propose using an examination of simulation re- sharing agreement, and collecting data and metadata. sults and simulation artifact provenance information as exemplars to develop a set of specifications for what an 3.2. Learning from Parchments AI-driven system would need to accomplish [16]. The digitization of historical parchments is extraordinar- What are the roles of the disciplines of Archival ily convenient, as it allows easy access to the documents Science and Artificial Intelligence in building a cen- from remote locations and removes the need for the pos- tral repository for ESCS data? Both the individual sible adverse effects of their physical management and fields of archival science and artificial intelligence, plus access [17]. This arrangement is particularly suitable their overlap or combination that could be considered to archives and museums who preserve such invaluable to fall within the realm of data science, have a number historical documents whose contents are unpublished of roles in the organization and interpretation of ESCS and which, if damaged, cannot be fully restored by con- data. We provide examples below from the application ventional tools, are difficult to read on the original, due of real-world data to simulations: to high levels of damage and the delicate nature of the material. Damaged parchments are notably prevalent • Generating requirements for simulator design so in archives all over the world [18]. Their digital rep- that simulation output matches real data in terms resentations reduce both damage and access issues by of format, metadata, etc. providing users with the possibility of reading their con- • Analyzing and comparing simulation output with tents at any moment, from remote locations, and without real- world data. necessitating the potentially harmful physical handling • Synthesizing ESCS data that match features of of the document. Thus, the automatic analysis of dig- real-world data as part of an overall ESCS simu- itized parchments has become an important research lation. topic in the fields of image and pattern recognition. It • Using real-world call data to drive a simulation has also been a considerable research issue for several of an emergency response system, for example, years, gaining attention recently because of the value that to allow a “replay” of a previous disaster or to maybe be unlocked by extracting the information stored investigate how modifications to such a system in historical documents [19]. Interest in applying AI/ML might produce different outcomes. to ancient image data analysis is becoming widespread, and scientists are increasingly using this method as a 3.1.1. Progress powerful and complex process for statistical inference. Computer-based image analysis provides an objective This work-in-progress is in its initial stages, preparing method of identifying visual content independently of for the point in time when we can begin collecting ESCS subjective personal interpretation, while potentially be- data. Specifically, we are: ing more sensitive, consistent and accurate than physical • working with a small set of external partners to human analysis. Learned representations often result in develop a general understanding of ESCS opera- much better performance than hand-designed represen- tions, policies, and procedures, as well as identi- tations when it comes to these types of texts. Until now, fying which data exist; parchment analysis has required physical user interac- • developing a process within our project for work- tion, which is very time consuming. Hence, the effective ing with a selected ESCS management organiza- automatic feature extraction competence of Deep Neural tion to build an understanding of their specific Networks (DNNs) decreases the demand for a personal operations and data; physical extraction processes. • fleshing out a model data sharing agreement to Considering the above, PergaNet is a lightweight DL- serve as a starting point for discussions surround- based system for the historical reconstructions of ancient ing transferring data to our research environ- parchments is specifically designed and developed for ment; this type of analysis. The aim of PergaNet is to auto- • consulting with our institutional review board mate the analysis and processing of large volumes of regarding the use of this particular set of human scanned parchments. This problem has not yet been data; deeply investigated by the computer vision community • and configuring a secure internal data storage as parchment scanning technology is still novel, but it system. has proven to be extremely effective for data recovery from historical documents whose content is inaccessible to help you plan your day. This occurs seamlessly in the due to the deterioration of the medium. The proposed background of our daily activities and involves a vast approach aims to reduce hand-operated analysis while complex of sensors; databases; cloud computing centres; using manual annotations as a form of continuous learn- telecommunication networks and the internet; standards; ing. The whole system however requires digital labour, code, software and platforms; people and institutions; such as the manual tagging of large training data. Up laws, regulation and policies; communication systems until now, large datasets remain necessary to boost the and of course AI/ML [25]. performance of DL models, and manually verified data An emerging subset of these spatial data infrastruc- will be used as continuous learning and maintained as tures are digital twins (DTs). A DT is an ecosystem of training datasets. PergaNet comprises three important multi-dimensional and interoperable subsystems made phases: the classification of parchments recto/verso, the up of physical things in the real-world, digital versions detection of text, and the detection and recognition of of those real things, synchronized data connections be- the “sig,num tabellionis”. (i.e. the identifier of the au- tween them and the people, organizations and institu- thor). PergaNet concerns not only the recognition and tions involved in creating, managing, and using these5 . classification of the objects present in the images, but In terms of physical and real things consider a building also their location. This I Trust AI study expands the or a car manufacturing plant; a digital representation implementation of AI guided by archival institutions and of those things in a digital platform or an interactive programs as this method could be used by many other virtual reality game engine; with an internet of things archives for different types of documents. The analysis (IoT) system of sensors and databases that communicates is based on data about the ordinary use by researchers between the buildings or manufacturing processes in the of this type of material and does not involve altering or plant in real and near real time and the people and in- manipulating techniques aimed to generate data. This stitutions that own and operate these. Contemporary provides actionable insights that are helpful to identify examples of DTs are the modeling and managing of the text as documentary form and not as reading. construction of Sweden’s new high speed rail systems6 ; The DL pipeline is depicted in Figure 1. We chose Hyundai car and ship manufacturing plants7 ; and as part VGG16 Network [20] for its suitability and effectiveness of smart city strategies (see the submissions to the Infras- in image classification tasks and were inspired by the tructure Canada Smart City Challenge8 ). DTs originate work of Zhou et al. [21], in the way in which PergaNet in the aerospace industry, first with NASA’s Apollo 13 in detects the text in the image. This phase allows for the the 1970s, although in that case it was a physical replica exclusion of the text on the parchment in the phase of to help troubleshoot issues of a ship in flight; and were recognition of the signa. The DNN model chosen is EAST predominantly used in manufacturing and logistics [26] for word detection [21]. Finally, a Convolutional Neural Increasingly, DTs involve building information model- Network has been employed for the signa detection. Our ing (BIM) such as REvit a proprietary platform or the approach uses YOLOv3 [22], an algorithm that processes OS BlenderBIM; whereby a building will be conceptu- images in real time. We chose this algorithm because of alized and rendered into a 3-dimensional drawing with its efficiency in computational terms and for its precision attributes captured in a database; often replacing typical to detect and classify objects. The network is pre-trained blue prints. The BIM informs the construction of the using COCO4 , a publicly available dataset; this was a building; and is a record of it once completed. BIM ren- choice made to reduce the need for a large amount of derings are increasingly being submitted as part of the training data, that would come with a high computational building permit approval process (BuildingSmart, 2020, cost. e-submission common guidelines for introducing BIM to the building process9 ) and are updated into as is BIMs for 3.3. Digital Twin Study ongoing operations. BIMs are also used to estimate mate- Spatial media [23] and spatial data infrastructures 5 CIMS, 2021,About page, what is a digital twin (SDI) [24] have normalized as complex interconnected https://canadasdigitaltwin.ca/about-2-2/ global, regional, national, and personalized social and 6 Pimental, K. 2019, Visualizing Sweden’s first high-speed rail- technological systems of systems. Simply consider the way with real-time technology, https://www.unrealengine.com/en- US/spotlights/visualizing-sweden-s-first-high-speed-railway-with- monitoring of climate at a global scale to inform the lo- real-time-technology gistics of production chains, or to predict, preempt and 7 Chang-Won, L., 2022, Hyundai Motor works with Unity to prevent disaster resulting from natural calamities on local build digital-twin of factory supported by metaverse platform; physical infrastructure or simply to report the temper- https://www.ajudaily.com/view/20220107083928529 8 ature and humidity levels outside to your smart phone https://www.infrastructure.gc.ca/sc-vi/map-applications.php 9 https://www.buildingsmart.org/wp- content/uploads/2020/08/e-submission-guidelines-Published- 4 https://cocodataset.org/#home Technical-Report-RR-2020-1015-TR-1.pdf Figure 2: Integrating diverse databases into BIM (Image created by Nico Arellano, CIMS 2021). rial costs as they are interconnected with material vendor the Carleton Digital Campus Innovation (DCI)13 project databases and electrical and heating systems. BIMs in- integrating Building Performance Simulation (BPS) tech- terrelate with smart asset management systems (ASM) nologies with BIM on a campus scale; building infor- which inform building maintenance and operations, and mation management systems (BIM), Asset Management monitor internal climate such as temperature, humidity, Systems (AMS), visualizations of the digital structures in air flow and quality; which are inputs for the AI/ML sys- the Unreal Game Engine, VR and modelling, AI/ML, and tems that remotely manage heating and cooling systems; Real-time data for decision making. electricity use and consumption as well as maintenance The Carleton Campus DT data of seven buildings be- schedules and inform ongoing decision making. long to the University that must preserve these as official In this I Trust AI Digital Twin case study we collaborate records, which are used for operations by Facilities, Man- with researchers working on the Imagining the Canada agement and Planning (FMP) and are part of research Digital Twin (ICDT)10 project funded by the Canadian and development for the CIMS and SUSTAIN projects; New Frontiers in Research Fund11 which proposes a na- thus involving research data that must be managed and tional, inclusive, and multidisciplinary research consor- deposited in a trusted digital repository. tium for the creation of a technical, cultural, and ethical The implications of this research are important to the framework to build and govern the technology, data and archival community who will increasingly have to ingest institutional arrangements of Canada’s DT. ICDT focuses complex record sets such as these, as well as create an on the built environment, concentrating on the Archi- archival package to maintain the integrity of these com- tecture, Engineering, Construction, and Owner Operator plex interlinked DT systems through time. This study (AECOO) industry. ICDT is led by the Carleton Immer- will be one of the first globally to examine the preser- sive Media Studio (CIMS)12 developing a DT prototype of vation of a DT. Its research questions are: Can a digital the Montréal-Ottawa-Toronto corridor using a simulated, twin be preserved and what is required at the point of distributed server network. The research study involves creation to ensure that it can be? Can information about an interdisciplinary research team of architects, data sci- the AI tools, automation and real time data involved in entists, engineers, building scientists, archival profes- this complex data, social and technological system be pre- sionals and critical data studies scholars, from Carleton served, and how? And, what might be the role of AI/ML University CA, Luleå University of Technology SE, the be in terms of creating an archival package to ingest a Swedish Transport Administration and the University of digital twin? The outputs of this research will provide Florence IT who will develop a Use and Creation preser- empirical data to meet the objectives of the I Trust AI vation case study. The Study aims to preserve the DT of Project; and also provide Carleton University with the op- campus buildings and structures created as part of SUS- portunity to test the preservation of Campus DT records TAIN (https://cims.carleton.ca/#/projects/Sustain) and in its institutional archives. In the process, it will inform the technology sectors involved in the creation of DTs, 10 https://canadasdigitaltwin.ca/ so that they may build-in at the point of creation, the 11 https://www.sshrc-crsh.gc.ca/funding-financement/nfrf- fnfr/index-eng.aspx 12 13 https://cims.carleton.ca/#/home https://www.cims.carleton.ca/#/projects/DigitalCampusInnovation necessary bread crumbs for long term preservation. have been numerous calls to action to systematically ex- plore the application of AI techniques to the records and archives field, AI also currently faces major ethical 4. Conclusions and Future Works challenges that will benefit from an archival theory per- spective, for instance in dealing with bias and personal The studies presented above are only 3 of about 40 in- information. By exploring further the connections be- progress studies which cover a wide range of subjects and tween AI and archives, this project is and will contribute issues, such as enterprise master data management, the to the intellectual progress of both fields. The I Trust preservation of AI techniques as paradata, modelling an AI project has generated a great amount of enthusiasm AI-assisted digitization project, gamification of archival among participant researchers (about 200) and partner experience for users, declassification of personal informa- organizations (87), as well as organizations that do not tion using AI tools, and user approaches and behaviours have the capacity to participate but look forward to out- in accessing records and archives in the perspective of comes they can use, because it deals with issues that are AI. The challenges we are addressing with this project already dramatically changing the way we act, behave have never before been systematically and globally dealt and think. We have a unique and essential contribution with; it is enormous and fraught, but critical. While the to make, because we have the means of creating knowl- risks of using AI to solve the problems of managing the edge ensuring that digital data and records are controlled ever-growing, ever-more-diverse bodies of public and and made accessible in a trustworthy, authentic form private records throughout their lifecycle, from creation wherever they are located; are promptly available when to preservation and access, are unknown, the risks of not needed; duly destroyed when required; and accessed only acting in concert to do so are unacceptable: loss of the by those who have a right to do so. ability to secure people’s rights, of evidence of past acts and facts to serve as a foundation for decision making, and of historical memory. References This project will significantly impact society in several areas. (1) Records-keeping in local and national govern- [1] L. Duranti, K. Thibodeau, The concept of record ment agencies is a vital part of our society’s ability to in interactive, experiential and dynamic environ- maintain oversight on and accountability of governance, ments: the view of interpares, Archival science 6 but, with the inability to handle the vast quantities of (2006) 13–68. digital records, public bodies risk undermining their own [2] S. Russel, P. Norvig, Artificial intelligence: a mod- legitimacy as oversight if they can not appropriately pro- ern approach, Pearson Education Limited London, cess and make accessible information in a timely fashion. 2013. By helping address this crisis through the development, [3] T. M. Mitchell, M. Learning, Mcgraw-hill, New evaluation, and contextualization of AI techniques we York (1997) 154–200. contribute to the ability of agencies and institutions to [4] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, maintain their place in our democracies. (2) Automation nature 521 (2015) 436–444. techniques can potentially aid in the economic viability [5] I. Goodfellow, Y. Bengio, A. Courville, Deep learn- of many cash-starved records offices and archival institu- ing, MIT press, 2016. tions by ensuring that professional records management [6] E. Granell, E. Chammas, L. Likforman-Sulem, C.-D. and archival expertise are used wisely, with classifica- Martínez-Hinarejos, C. Mokbel, B.-I. Cîrstea, Tran- tion tools and TAR able to allow a quick review and scription of spanish historical handwritten docu- assessment of vast quantities of records. Similarly, with ments with deep neural networks, Journal of Imag- businesses depending on records agencies for routine ing 4 (2018) 15. activities, improved speed in responding to queries will [7] A. Graves, et al., Supervised sequence labelling with bring a positive effect to the economy. (3) AI techniques recurrent neural networks, volume 385, Springer, have the potential to aid in the accessibility of records in 2012. archives by new audiences, for instance by translating [8] Y. LeCun, et al., Generalization and network design and indexing historical materials written in indigenous strategies, Connectionism in perspective (1989) languages, sensitising problematic archival descriptions, 143–155. helping patrons find connected items, or captioning his- [9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, torical photographs. These techniques have both a cul- L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- tural significance, by providing better access to historical tention is all you need, Advances in neural infor- material, and a social and scientific significance, by mak- mation processing systems 30 (2017). ing current records easier to organise, retrieve and use by [10] S. Abney, Semisupervised learning for computa- both their creators and the public at large. (4) While there tional linguistics, CRC Press, 2007. [11] A. Søgaard, Semi-supervised learning and domain (2016) 5–24. adaptation in natural language processing, Synthe- [26] C. Miskinis, The history and creation of the digital sis Lectures on Human Language Technologies 6 twin concept, Challenge Advisory. March (2019). (2013) 1–103. [12] X. Zhu, A. B. Goldberg, Introduction to semi- supervised learning, Synthesis lectures on artificial intelligence and machine learning 3 (2009) 1–130. [13] M. Abdul-Mageed, L. Ungar, Emonet: Fine-grained emotion detection with gated recurrent neural net- works, in: Proceedings of the 55th annual meeting of the association for computational linguistics (vol- ume 1: Long papers), 2017, pp. 718–728. [14] C. Zhang, M. Abdul-Mageed, No army, no navy: Bert semi-supervised learning of arabic dialects, in: Proceedings of the Fourth Arabic Natural Language Processing Workshop, 2019, pp. 279–284. [15] J. M. Jordan, V. Salvatore, B. Endicott-Popovsky, V. Gandhi, C. O’Keefe, M. S. Sotebeer, M. Stiber, Graph-based simulation of emergency services com- munications systems, in: Proc. 2022 Annual Mod- eling and Simulation Conference, San Diego, CA, submitted, 2022. [16] J. Conquest, M. Stiber, Software and data prove- nance as a basis for escience workflow, in: IEEE eScience, IEEE, online, 2021. [17] E. C. Francomano, H. Bamford, Whose digital middle ages? accessibility in digital medieval manuscript culture, Journal of Medieval Iberian Studies (2022) 1–13. [18] K. Pal, M. Terras, T. Weyrich, 3d reconstruction for damaged documents: imaging of the great parch- ment book, in: Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing, 2013, pp. 14–21. [19] V. Frinken, A. Fischer, C.-D. Martínez-Hinarejos, Handwriting recognition in historical documents using very large vocabularies, in: Proceedings of the 2nd International Workshop on Historical Doc- ument Imaging and Processing, 2013, pp. 67–72. [20] K. Simonyan, A. Zisserman, Very deep convolu- tional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014). [21] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, J. Liang, East: an efficient and accurate scene text detector, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 5551–5560. [22] J. Redmon, A. Farhadi, Yolov3: An incremental im- provement, arXiv preprint arXiv:1804.02767 (2018). [23] R. Kitchin, T. P. Lauriault, M. W. Wilson, Under- standing spatial media, Sage, 2017. [24] C. G. D. Infrastructure, Natural resources canada, https://doi.org/10.4095/328060 (2020). [25] S. Arctic, Spatial data infrastructure (sdi) manual for the arctic, Arctic Council: Ottawa, ON, Canada