Trusted Data Forever: Is AI the Answer?
Emanuele Frontoni1 , Marina Paolanti1 , Tracey P. Lauriault2 , Michael Stiber3 , Luciana Duranti4
and Muhammad Abdul-Mageed5
1
  VRAI Vision Robotics and Artificial Intelligence Lab, University of Macerata, Italy
2
  Critical Media and Big Data Lab, Carleton University, Ottawa, ON K1S 5B6, Canada
3
  Intelligent Networks Lab, University of Washington Bothell, WA, USA
4
  InterPARES Lab, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
5
  NLP and ML Lab, University of British Columbia, Vancouver, BC V6T 1Z4, Canada


                                             Abstract
                                             Archival institutions and programs worldwide work to ensure that the records of governments, organizations, communities,
                                             and individuals are preserved for future generations as cultural heritage, as sources of rights, and as vehicles for holding the
                                             past accountable and to inform the future. This commitment is guaranteed through the adoption of strategic and technical
                                             measures for the long-term preservation of digital assets in any medium and form — textual, visual, or aural. Public and
                                             private archives are the largest providers of data big and small in the world and collectively host yottabytes of trusted
                                             data, to be preserved forever. Several aspects of retention and preservation, arrangement and description, management and
                                             administrations, and access and use are still open to improvement. In particular, recent advances in Artificial Intelligence (AI)
                                             open the discussion as to whether AI can support the ongoing availability and accessibility of trustworthy public records. This
                                             paper presents preliminary results of the InterPARES Trust AI (“I Trust AI") international research partnership, which aims to
                                             (1) identify and develop specific AI technologies to address critical records and archives challenges; (2) determine the benefits
                                             and risks of employing AI technologies on records and archives; (3) ensure that archival concepts and principles inform the
                                             development of responsible AI; and (4) validate outcomes through a conglomerate of case studies and demonstrations.

                                             Keywords
                                             Artificial Intelligence, Machine Learning, Deep Learning, Archives, Trustworthiness


1. Introduction                                                                                                       trustworthy. Thus, their preservation must ensure that
                                                                                                                      any activity carried out on the records to identify, select,
Archival institutions and programs worldwide work to                                                                  organize, describe them and make them accessible to the
ensure that the records of governments, organizations,                                                                people at large must ensure that they remain trustworthy,
communities, and individuals are preserved for future                                                                 that is reliable (i.e. their content can be trusted), accurate
generations as cultural heritage, as sources of rights, to                                                            (i.e. the data in them are unchanged and unchangeable),
hold the past accountable, and as evidence to inform fu-                                                              and authentic (i.e. their identity and integrity are intact).
ture plans. A record – or archival document – is any                                                                  This is particularly difficult in the digital environment
document (i.e. information affixed to a medium, with                                                                  because the content, structure and form of records are
stable content and fixed form) made or received in the                                                                no longer inextricably linked as they used to be in the
course of an activity, and kept for further action or ref-                                                            traditional records environment [1]1 . The issue exists for
erence. Because of the circumstances of its creation a                                                                both digital and digitized records and is rendered more
record is a natural by-product of activity, is related to                                                             serious by the sheer number and volume of records that
all the other records that participate in the same activity,                                                          have accumulated overtime and are being created today
are impartial with respect to the questions that future                                                               in a large variety of systems.
researchers will ask of them, and are authentic as instru-                                                                The InterPARES (International research on Permanent
ments of activity. This is why records are inherently                                                                 Authentic Records in Electronic System) project has ad-
Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint                                                     dressed these issues since 1990, focusing on current and
Conference (March 29-April 1, 2022), Edinburgh, UK                                                                    emerging technologies as they evolve and developing the-
$ emanuele.frontoni@unimc.it (E. Frontoni);                                                                           ory, methods, and frameworks that allow for the ongoing
marina.paolanti@unimc.it (M. Paolanti);                                                                               preservation of the records resulting from the use of such
Tracey.Lauriault@carleton.ca (T. P. Lauriault); stiber@uw.edu
(M. Stiber); Luciana.Duranti@ubc.ca (L. Duranti);
                                                                                                                      technologies2 . The latest iteration of InterPARES, I Trust
muhammad.mageed@ubc.ca (M. Abdul-Mageed)                                                                              AI, funded, as previous projects, by the Social Sciences
 0000-0002-8893-9244 (E. Frontoni); 0000-0002-5523-7174                                                              and Humanities Research Council of Canada, but differs
(M. Paolanti); 0000-0003-1847-2738 (T. P. Lauriault);
0000-0002-1061-7667 (M. Stiber); 0000-0001-7895-1066 (L. Duranti);                                                        1
                                                                                                                            “The Concept of Record in Interactive, Experiential and Dy-
0000-0002-8590-2040 (M. Abdul-Mageed)                                                                                 namic Environments: The View of InterPARES”. Archival Science 6,
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     26-33.
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                            2
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                                            www.InterPARES.org
Figure 1: PergaNet DL pipeline. PergaNet DL pipeline consists of three stages: classification of parchments recto/verso, the
detection of text, then the detection and recognition of the “signum tabellionis”. Firstly, a VGG16 Network trained on a dataset
of scanned parchments is needed to solve a classification task: recto/verso. After, the text in the image is detected. Then,
YOLOv3 was used to predict bounding box locations and classify these locations in one pass.


as it is not concerned with the records produced by a spe-       challenges; 2.Determine the benefits and risks of using
cific technology, but has the purpose of using AI to carry       AI technologies on records and archives; 3. Ensure that
out archival functions for the control in the long term of       archival concepts and principles inform the development
all records, on any medium, and from any age, and to do          of responsible AI; and 4. Validate outcomes from Ob-
so in such a way that the trustworthiness of the records         jective 3 through case studies and demonstrations. Our
remains protected and verifiable, and that the tools and         approach is two-pronged, comprising the practical and
processes are transparent, unbiased, equitable, inclusive,       immediate need to address large-scale existing problems,
responsible (i.e. protecting autonomy and privacy) and           and the longer-term need to have AI-based tools that are
sustainable3 . There have been several projects looking at       reliably applicable to future problems. Our short-term
AI in archives, but they typically look at a particular tool     approach focuses on identifying high impact problems
in a specific context or even a single set of records, and       and limitations in records and archives functions, and ap-
they tend to use off the shelf tools. The research question      plying AI to improve the situation. This will be achieved
that I Trust AI is being asked here is: “what would AI           via collaboration between records and archival scien-
look like if archival concepts, principles and methods           tists and professionals and AI researchers and industry
were to inform the development of AI tools?” What is             experts. Our long-term approach focuses on identify-
lacking is comprehensive, systematic research into the           ing the tools that records and archives specialists will
use of AI to carry out the different archival functions in       need in the future to flexibly address their ever-changing
an integrated way and ensure the continuing availability         needs. This includes decision support and, once decisions
of verifiable trustworthy records to prevent the erosion         are made, rapid implementation of AI-based solutions to
of accountability, evidence, history and cultural heritage.      those needs.
Thus, we are addressing the technological issues from               The I Trust AI project is a multinational interdisci-
the perspective of archival theory, by integrating the           plinary endeavour, and this means that our first effort
technology with complex human-oriented tools. The                must be to understand each other, starting with the lan-
objectives of I Trust AI are to 1. Identify specific AI tech-    guage we use. For example, archival professionals talk
nologies that can address critical records and archives          about records, while computer scientists and AI profes-
                                                                 sionals talk about data. To the former data are the small-
    3
        www.interparestrustai.org
Table 1
Digitalised Heritage Data
 Digitalised Heritage Data                                                                                          Size
 Fondo Ufficio italiano brevetti e marchi, Trademarks series: volumes with trademark registrations                 30 TB
 Official collection of laws and decrees                                                                           15 TB
 Fund A5G (First World War): files with various documents (reports, reports, correspondence)                        1 TB
 Special collections (documents declassified under the Renzi and Prodi Directives): reports, reports, circulars     2 TB
 Judgments of military courts                                                                                       3 TB
 Various photographic funds                                                                                         2 TB
 Digitised study room inventories                                                                                  15 TB
 National Archives of the US                                                                                      1323 TB


est meaningful unit of information in a record. To an AI   of ML comes from Mitchell and Learning [3] who pro-
specialist, data are arrangements of information (possibly vide a procedural definition maintaining that a computer
in a database), be these facts or not, regardless of their program is said to learn from experience E with respect to
size, nature and form.. Thus, for the purposes of this     some class of tasks T and performance measure P if its per-
paper, which is directed to data analytics specialists, we formance at tasks in T, as measured by P, improves with
will use the term data.                                    experience E. As to DL [4, 5], it is a class of ML methods
   Public and private archives are the largest providers ofinspired by information processing in the human brain.
data big and small in the world as they collectively host  What makes DL powerful is its ability to automatically
yottabytes of trusted data, to be preserved forever. Their learn useful representations from data. DL algorithms,
creators are organizations and individuals from myriad     unlike classical ML methods, are not only able to learn
sectors and disciplines, from public administration, to    the mapping from representation to output but also to
academia and businesses of all kind (e.g. banking, engi-   learn the representation itself [6], thus alleviating the
neering, architecture, gaming), and from Indigenous com-   need for costly human expertise in crafting features for
munities, civil society organizations, associations, and   models. DL has achieved success in recent years in a
virtual communities. Table 1 reports the data quantity.    wider variety of applications in many domains involving
The Italian State Central Archives, for example, has Data  various types of data modalities such as language, speech,
quantity (67 TB) of “digital objects” stored by the ACS    image, and video.
linked to the typology (digitised heritage data (in TIFF)).   As mentioned, DL mimics information processing in
The National Archives of the US has 1,323 terabytes of     the brain. This is possible by designing artificial neural
electronic data. This paper presents some preliminary      networks arranged in multiple layers (and hence the term
results of I Trust AI international research partnership   deep) that take input, attempt to learn a good represen-
and is organized as follows: Section 2 provides a general  tation of it, and map it to an output decision (e.g., is this
discussion of AI and its subsets; Section 3 describes threetext in Greek or Latin?). The way these networks are
of the about forty studies that are now in course; and     designed can vary, thus, various types of deep learning
Section 4 presents an overview of the kind of studies that architectures have been proposed. Two main types of
are being pursued at this time and a conclusion.           DL architectures have been quite successful. These are
                                                           recurrent neural networks (RNNs) (e.g., [7]), a family of
                                                           networks specializing in processing sequential data, and
2. Artificial Intelligence and Deep convolutional neural networks (CNNs) [8], an architecture
     Learning                                              specializing in data with a ‘grid-like’ typology (e.g., im-
                                                           age data) [5]. More recent advances, however, abstract
There are various definitions of what AI is. For exam- away from these two main types toward more dynamic
ple, Russel and Norvig [2] define AI as a field focused networks such as the Transformer [9].
on the study of intelligent agents that perceive their en-    In general, ML and DL methods learn best when given
vironment and take actions to maximize their chance of large amounts of labeled data (e.g., for a model that de-
success at some goal. In recent times, however, AI has tects sensitive information, labels can be from the set
been used much more widely to refer to any technology sensitive, not-sensitive). DL in particular is data-hungry
where there is some level of automation especially re- and tends to learn best given large amounts of labeled
sulting from the application of deep learning in various data. This type of learning with labeled data is called su-
domains. Deep learning (DL) is a sub-field of machine pervised learning. It is also possible to work with smaller
learning (ML), which are both sub-fields of AI. Similar to labeled datasets. In these cases, training samples can
AI, ML is defined in several ways. A classical definition be grown iteratively exploiting unlabeled data based on
decisions from an initial model (self-training) or using       from another, and examining how various data analytics
decisions from various initial models (co-training). This      techniques might be usefully applied — understanding
is called semi-supervised learning [10, 11, 12]. The third     what characteristics of these data might be usable to re-
main-type of ML methods is unsupervised learning where         searchers Specifically, we are addressing the following
a model usually tries to cluster the data without access       research questions:
to any labels. There are other paradigms such as distant
supervision where a model attempts to learn from sur-          What real-world and simulation ESCS data are
rogate cues in the data in absence of high-quality labels      available to be preserved for access by researchers?
(e.g., [13]). Self-supervised learning where real-world data   The answer to this varies greatly from locale to locale,
are turned into labeled data by masking certain regions        depending on technology in use, public policy, and con-
(e.g., removing some words or parts of an image) and           trolling agency procedures. Moreover, for such data to
tasking a model to predict the identity of these masked        be available, we must understand privacy and security
regions is currently a very successful approach. These         risks associated with transferring them from their current
various methods of supervision can also be combined to         owners to a research environment, along with the risk of
solve downstream tasks. For example, Zhang and Abdul-          misinterpreting them if they are decontextualized from
Mageed [14] combine self-supervised language models            whatever tacit knowledge might exist within the owning
with classical self-training methods to solve text classifi-   organizations. We are also considering pragmatic issues,
cation problems. The next section will introduce three I       such as building a knowledge base of legal restrictions on
Trust AI studies.                                              collection in various jurisdictions, formal processes for
                                                               collecting these data (such as data sharing agreements),
                                                               variation in the culture of practice surrounding such data,
3. Case Studies in I Trust AI                                  potential biases that might result from systematic differ-
                                                               ences in different areas’ capacity to collect and share data
3.1. Data from Emergency Services                              (such as might arise from regional funding differences),
     Communications Systems                                    and understanding the metadata and other information
One of the cornerstones of of public safety and societal       (such as ESCS physical and operational structure and,
wellbeing is a reliable and comprehensive emergency            generally, the policies and practices that determine what
services communications system (ESCS, such as 9-1-1 in         and how data are generated and collected).
the US and Canada). Such systems can be considered to
encompass the organizations, electronic infrastructure,        What are the challenges and benefits of discover-
and policies and procedures that enable answering and          ing knowledge patterns from historical ESCS data?
responding to emergency phone calls [15]. As might be          These patterns will serve as clues to developing protocols
expected of systems that originate in analog, switched         for ESCS managers to follow regarding data collection,
telephony, ESCS evolution into a digital system has re-        and as clues for how these data can be applied for reuse.
sulted in a haphazard conglomeration of subsystems and         We will consult with external stakeholders to seek advice
generally needs re-imagining as a modern technological         and to run thought experiments using surveys, think-
solution. In the US, this change has been termed the           aloud exercises, retrospective first-hand accounts, etc.
“Next Generation 911” (NG911) project.                         We will also examine historical records of disasters for
   A transformation such as NG911 once again re-casts          which the preserved data are more complete than basic
ESCS as keystone information and communication tech-           ESCS datasets.
nology, subject to all of the concerns of such systems:
cybersecurity, privacy, crisis preparedness, strategic and     What other data/metadata associated with emer-
operational decisions, etc. At the same time, it opens         gency events are not part of the ESCS data stream?
up possibilities for data analytics to improve ESCS per-       From our preliminary examination, typical ESCS data
formance, inform funding decisions, monitor the health         currently involve lists of individual calls and information
of societies and their infrastructures, and serve as early     directly associated with such calls (perhaps including full
warnings for natural and human made crises. This study         or partial phone numbers, call categorization, GPS coordi-
connects large- scale simulations of ESCS to historical        nates, responder information, response times, etc.). What
data from ESCS operations to develop and document an           these datasets do not directly include are events and data
understanding of how to preserve authentic, reliable data      that are external to the call stream but are the reason for
that can be used for applications such as re-creation of       such calls (traffic, weather, geopolitical events, and so on).
past events (as might be done to support training or to        Some of these additional data may be present in other
explore the effects of changes in policies and procedures),    sources of information in a format that can be reasonably
testing system operation in one locale based on data           collected in tandem with call data. On the other hand, it
may be the case that other causal information must be            Subsequently, we will prepare an initial case study in
inferred from available data (and, of course, it may be the    which we will apply the above to a single locale: we
case that a combination of inference and extraction from       will go through all of the steps of understanding the
other data streams could be useful). For the inferencing       ESCS processes that produced the data, developing a data
task, we propose using an examination of simulation re-        sharing agreement, and collecting data and metadata.
sults and simulation artifact provenance information as
exemplars to develop a set of specifications for what an       3.2. Learning from Parchments
AI-driven system would need to accomplish [16].
                                                               The digitization of historical parchments is extraordinar-
What are the roles of the disciplines of Archival              ily convenient, as it allows easy access to the documents
Science and Artificial Intelligence in building a cen-         from remote locations and removes the need for the pos-
tral repository for ESCS data? Both the individual             sible adverse effects of their physical management and
fields of archival science and artificial intelligence, plus   access [17]. This arrangement is particularly suitable
their overlap or combination that could be considered          to archives and museums who preserve such invaluable
to fall within the realm of data science, have a number        historical documents whose contents are unpublished
of roles in the organization and interpretation of ESCS        and which, if damaged, cannot be fully restored by con-
data. We provide examples below from the application           ventional tools, are difficult to read on the original, due
of real-world data to simulations:                             to high levels of damage and the delicate nature of the
                                                               material. Damaged parchments are notably prevalent
     • Generating requirements for simulator design so         in archives all over the world [18]. Their digital rep-
       that simulation output matches real data in terms       resentations reduce both damage and access issues by
       of format, metadata, etc.                               providing users with the possibility of reading their con-
     • Analyzing and comparing simulation output with          tents at any moment, from remote locations, and without
       real- world data.                                       necessitating the potentially harmful physical handling
     • Synthesizing ESCS data that match features of           of the document. Thus, the automatic analysis of dig-
       real-world data as part of an overall ESCS simu-        itized parchments has become an important research
       lation.                                                 topic in the fields of image and pattern recognition. It
     • Using real-world call data to drive a simulation        has also been a considerable research issue for several
       of an emergency response system, for example,           years, gaining attention recently because of the value that
       to allow a “replay” of a previous disaster or to        maybe be unlocked by extracting the information stored
       investigate how modifications to such a system          in historical documents [19]. Interest in applying AI/ML
       might produce different outcomes.                       to ancient image data analysis is becoming widespread,
                                                               and scientists are increasingly using this method as a
3.1.1. Progress                                                powerful and complex process for statistical inference.
                                                               Computer-based image analysis provides an objective
This work-in-progress is in its initial stages, preparing
                                                               method of identifying visual content independently of
for the point in time when we can begin collecting ESCS
                                                               subjective personal interpretation, while potentially be-
data. Specifically, we are:
                                                               ing more sensitive, consistent and accurate than physical
     • working with a small set of external partners to        human analysis. Learned representations often result in
       develop a general understanding of ESCS opera-          much better performance than hand-designed represen-
       tions, policies, and procedures, as well as identi-     tations when it comes to these types of texts. Until now,
       fying which data exist;                                 parchment analysis has required physical user interac-
     • developing a process within our project for work-       tion, which is very time consuming. Hence, the effective
       ing with a selected ESCS management organiza-           automatic feature extraction competence of Deep Neural
       tion to build an understanding of their specific        Networks (DNNs) decreases the demand for a personal
       operations and data;                                    physical extraction processes.
     • fleshing out a model data sharing agreement to             Considering the above, PergaNet is a lightweight DL-
       serve as a starting point for discussions surround-     based system for the historical reconstructions of ancient
       ing transferring data to our research environ-          parchments is specifically designed and developed for
       ment;                                                   this type of analysis. The aim of PergaNet is to auto-
     • consulting with our institutional review board          mate the analysis and processing of large volumes of
       regarding the use of this particular set of human       scanned parchments. This problem has not yet been
       data;                                                   deeply investigated by the computer vision community
     • and configuring a secure internal data storage          as parchment scanning technology is still novel, but it
       system.                                                 has proven to be extremely effective for data recovery
from historical documents whose content is inaccessible       to help you plan your day. This occurs seamlessly in the
due to the deterioration of the medium. The proposed          background of our daily activities and involves a vast
approach aims to reduce hand-operated analysis while          complex of sensors; databases; cloud computing centres;
using manual annotations as a form of continuous learn-       telecommunication networks and the internet; standards;
ing. The whole system however requires digital labour,        code, software and platforms; people and institutions;
such as the manual tagging of large training data. Up         laws, regulation and policies; communication systems
until now, large datasets remain necessary to boost the       and of course AI/ML [25].
performance of DL models, and manually verified data             An emerging subset of these spatial data infrastruc-
will be used as continuous learning and maintained as         tures are digital twins (DTs). A DT is an ecosystem of
training datasets. PergaNet comprises three important         multi-dimensional and interoperable subsystems made
phases: the classification of parchments recto/verso, the     up of physical things in the real-world, digital versions
detection of text, and the detection and recognition of       of those real things, synchronized data connections be-
the “sig,num tabellionis”. (i.e. the identifier of the au-    tween them and the people, organizations and institu-
thor). PergaNet concerns not only the recognition and         tions involved in creating, managing, and using these5 .
classification of the objects present in the images, but      In terms of physical and real things consider a building
also their location. This I Trust AI study expands the        or a car manufacturing plant; a digital representation
implementation of AI guided by archival institutions and      of those things in a digital platform or an interactive
programs as this method could be used by many other           virtual reality game engine; with an internet of things
archives for different types of documents. The analysis       (IoT) system of sensors and databases that communicates
is based on data about the ordinary use by researchers        between the buildings or manufacturing processes in the
of this type of material and does not involve altering or     plant in real and near real time and the people and in-
manipulating techniques aimed to generate data. This          stitutions that own and operate these. Contemporary
provides actionable insights that are helpful to identify     examples of DTs are the modeling and managing of the
text as documentary form and not as reading.                  construction of Sweden’s new high speed rail systems6 ;
   The DL pipeline is depicted in Figure 1. We chose          Hyundai car and ship manufacturing plants7 ; and as part
VGG16 Network [20] for its suitability and effectiveness      of smart city strategies (see the submissions to the Infras-
in image classification tasks and were inspired by the        tructure Canada Smart City Challenge8 ). DTs originate
work of Zhou et al. [21], in the way in which PergaNet        in the aerospace industry, first with NASA’s Apollo 13 in
detects the text in the image. This phase allows for the      the 1970s, although in that case it was a physical replica
exclusion of the text on the parchment in the phase of        to help troubleshoot issues of a ship in flight; and were
recognition of the signa. The DNN model chosen is EAST        predominantly used in manufacturing and logistics [26]
for word detection [21]. Finally, a Convolutional Neural      Increasingly, DTs involve building information model-
Network has been employed for the signa detection. Our        ing (BIM) such as REvit a proprietary platform or the
approach uses YOLOv3 [22], an algorithm that processes        OS BlenderBIM; whereby a building will be conceptu-
images in real time. We chose this algorithm because of       alized and rendered into a 3-dimensional drawing with
its efficiency in computational terms and for its precision   attributes captured in a database; often replacing typical
to detect and classify objects. The network is pre-trained    blue prints. The BIM informs the construction of the
using COCO4 , a publicly available dataset; this was a        building; and is a record of it once completed. BIM ren-
choice made to reduce the need for a large amount of          derings are increasingly being submitted as part of the
training data, that would come with a high computational      building permit approval process (BuildingSmart, 2020,
cost.                                                         e-submission common guidelines for introducing BIM to
                                                              the building process9 ) and are updated into as is BIMs for
3.3. Digital Twin Study                                       ongoing operations. BIMs are also used to estimate mate-

Spatial media [23] and spatial data infrastructures                5
                                                                     CIMS, 2021,About page, what is a digital twin
(SDI) [24] have normalized as complex interconnected          https://canadasdigitaltwin.ca/about-2-2/
global, regional, national, and personalized social and            6
                                                                     Pimental, K. 2019, Visualizing Sweden’s first high-speed rail-
technological systems of systems. Simply consider the         way with real-time technology, https://www.unrealengine.com/en-
                                                              US/spotlights/visualizing-sweden-s-first-high-speed-railway-with-
monitoring of climate at a global scale to inform the lo-
                                                              real-time-technology
gistics of production chains, or to predict, preempt and           7
                                                                     Chang-Won, L., 2022, Hyundai Motor works with Unity to
prevent disaster resulting from natural calamities on local   build digital-twin of factory supported by metaverse platform;
physical infrastructure or simply to report the temper-       https://www.ajudaily.com/view/20220107083928529
                                                                   8
ature and humidity levels outside to your smart phone                https://www.infrastructure.gc.ca/sc-vi/map-applications.php
                                                                   9
                                                                     https://www.buildingsmart.org/wp-
                                                              content/uploads/2020/08/e-submission-guidelines-Published-
    4
        https://cocodataset.org/#home                         Technical-Report-RR-2020-1015-TR-1.pdf
Figure 2: Integrating diverse databases into BIM (Image created by Nico Arellano, CIMS 2021).


rial costs as they are interconnected with material vendor      the Carleton Digital Campus Innovation (DCI)13 project
databases and electrical and heating systems. BIMs in-          integrating Building Performance Simulation (BPS) tech-
terrelate with smart asset management systems (ASM)             nologies with BIM on a campus scale; building infor-
which inform building maintenance and operations, and           mation management systems (BIM), Asset Management
monitor internal climate such as temperature, humidity,         Systems (AMS), visualizations of the digital structures in
air flow and quality; which are inputs for the AI/ML sys-       the Unreal Game Engine, VR and modelling, AI/ML, and
tems that remotely manage heating and cooling systems;          Real-time data for decision making.
electricity use and consumption as well as maintenance             The Carleton Campus DT data of seven buildings be-
schedules and inform ongoing decision making.                   long to the University that must preserve these as official
   In this I Trust AI Digital Twin case study we collaborate    records, which are used for operations by Facilities, Man-
with researchers working on the Imagining the Canada            agement and Planning (FMP) and are part of research
Digital Twin (ICDT)10 project funded by the Canadian            and development for the CIMS and SUSTAIN projects;
New Frontiers in Research Fund11 which proposes a na-           thus involving research data that must be managed and
tional, inclusive, and multidisciplinary research consor-       deposited in a trusted digital repository.
tium for the creation of a technical, cultural, and ethical        The implications of this research are important to the
framework to build and govern the technology, data and          archival community who will increasingly have to ingest
institutional arrangements of Canada’s DT. ICDT focuses         complex record sets such as these, as well as create an
on the built environment, concentrating on the Archi-           archival package to maintain the integrity of these com-
tecture, Engineering, Construction, and Owner Operator          plex interlinked DT systems through time. This study
(AECOO) industry. ICDT is led by the Carleton Immer-            will be one of the first globally to examine the preser-
sive Media Studio (CIMS)12 developing a DT prototype of         vation of a DT. Its research questions are: Can a digital
the Montréal-Ottawa-Toronto corridor using a simulated,         twin be preserved and what is required at the point of
distributed server network. The research study involves         creation to ensure that it can be? Can information about
an interdisciplinary research team of architects, data sci-     the AI tools, automation and real time data involved in
entists, engineers, building scientists, archival profes-       this complex data, social and technological system be pre-
sionals and critical data studies scholars, from Carleton       served, and how? And, what might be the role of AI/ML
University CA, Luleå University of Technology SE, the           be in terms of creating an archival package to ingest a
Swedish Transport Administration and the University of          digital twin? The outputs of this research will provide
Florence IT who will develop a Use and Creation preser-         empirical data to meet the objectives of the I Trust AI
vation case study. The Study aims to preserve the DT of         Project; and also provide Carleton University with the op-
campus buildings and structures created as part of SUS-         portunity to test the preservation of Campus DT records
TAIN (https://cims.carleton.ca/#/projects/Sustain) and          in its institutional archives. In the process, it will inform
                                                                the technology sectors involved in the creation of DTs,
   10
       https://canadasdigitaltwin.ca/                           so that they may build-in at the point of creation, the
   11
       https://www.sshrc-crsh.gc.ca/funding-financement/nfrf-
fnfr/index-eng.aspx
    12                                                             13
       https://cims.carleton.ca/#/home                                  https://www.cims.carleton.ca/#/projects/DigitalCampusInnovation
necessary bread crumbs for long term preservation.           have been numerous calls to action to systematically ex-
                                                             plore the application of AI techniques to the records
                                                             and archives field, AI also currently faces major ethical
4. Conclusions and Future Works challenges that will benefit from an archival theory per-
                                                             spective, for instance in dealing with bias and personal
The studies presented above are only 3 of about 40 in-
                                                             information. By exploring further the connections be-
progress studies which cover a wide range of subjects and
                                                             tween AI and archives, this project is and will contribute
issues, such as enterprise master data management, the
                                                             to the intellectual progress of both fields. The I Trust
preservation of AI techniques as paradata, modelling an
                                                             AI project has generated a great amount of enthusiasm
AI-assisted digitization project, gamification of archival
                                                             among participant researchers (about 200) and partner
experience for users, declassification of personal informa-
                                                             organizations (87), as well as organizations that do not
tion using AI tools, and user approaches and behaviours
                                                             have the capacity to participate but look forward to out-
in accessing records and archives in the perspective of
                                                             comes they can use, because it deals with issues that are
AI. The challenges we are addressing with this project
                                                             already dramatically changing the way we act, behave
have never before been systematically and globally dealt
                                                             and think. We have a unique and essential contribution
with; it is enormous and fraught, but critical. While the
                                                             to make, because we have the means of creating knowl-
risks of using AI to solve the problems of managing the
                                                             edge ensuring that digital data and records are controlled
ever-growing, ever-more-diverse bodies of public and
                                                             and made accessible in a trustworthy, authentic form
private records throughout their lifecycle, from creation
                                                             wherever they are located; are promptly available when
to preservation and access, are unknown, the risks of not
                                                             needed; duly destroyed when required; and accessed only
acting in concert to do so are unacceptable: loss of the
                                                             by those who have a right to do so.
ability to secure people’s rights, of evidence of past acts
and facts to serve as a foundation for decision making,
and of historical memory.                                    References
   This project will significantly impact society in several
areas. (1) Records-keeping in local and national govern- [1] L. Duranti, K. Thibodeau, The concept of record
ment agencies is a vital part of our society’s ability to         in interactive, experiential and dynamic environ-
maintain oversight on and accountability of governance,           ments: the view of interpares, Archival science 6
but, with the inability to handle the vast quantities of          (2006) 13–68.
digital records, public bodies risk undermining their own     [2] S. Russel, P. Norvig, Artificial intelligence: a mod-
legitimacy as oversight if they can not appropriately pro-        ern approach, Pearson Education Limited London,
cess and make accessible information in a timely fashion.         2013.
By helping address this crisis through the development, [3] T. M. Mitchell, M. Learning, Mcgraw-hill, New
evaluation, and contextualization of AI techniques we             York (1997) 154–200.
contribute to the ability of agencies and institutions to     [4] Y. LeCun, Y. Bengio, G. Hinton, Deep learning,
maintain their place in our democracies. (2) Automation           nature 521 (2015) 436–444.
techniques can potentially aid in the economic viability      [5] I. Goodfellow, Y. Bengio, A. Courville, Deep learn-
of many cash-starved records offices and archival institu-        ing, MIT press, 2016.
tions by ensuring that professional records management        [6] E. Granell, E. Chammas, L. Likforman-Sulem, C.-D.
and archival expertise are used wisely, with classifica-          Martínez-Hinarejos, C. Mokbel, B.-I. Cîrstea, Tran-
tion tools and TAR able to allow a quick review and               scription of spanish historical handwritten docu-
assessment of vast quantities of records. Similarly, with         ments with deep neural networks, Journal of Imag-
businesses depending on records agencies for routine              ing 4 (2018) 15.
activities, improved speed in responding to queries will      [7] A. Graves, et al., Supervised sequence labelling with
bring a positive effect to the economy. (3) AI techniques         recurrent neural networks, volume 385, Springer,
have the potential to aid in the accessibility of records in      2012.
archives by new audiences, for instance by translating        [8] Y. LeCun, et al., Generalization and network design
and indexing historical materials written in indigenous           strategies, Connectionism in perspective (1989)
languages, sensitising problematic archival descriptions,         143–155.
helping patrons find connected items, or captioning his- [9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
torical photographs. These techniques have both a cul-            L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
tural significance, by providing better access to historical      tention is all you need, Advances in neural infor-
material, and a social and scientific significance, by mak-       mation processing systems 30 (2017).
ing current records easier to organise, retrieve and use by [10] S. Abney, Semisupervised learning for computa-
both their creators and the public at large. (4) While there      tional linguistics, CRC Press, 2007.
[11] A. Søgaard, Semi-supervised learning and domain         (2016) 5–24.
     adaptation in natural language processing, Synthe- [26] C. Miskinis, The history and creation of the digital
     sis Lectures on Human Language Technologies 6           twin concept, Challenge Advisory. March (2019).
     (2013) 1–103.
[12] X. Zhu, A. B. Goldberg, Introduction to semi-
     supervised learning, Synthesis lectures on artificial
     intelligence and machine learning 3 (2009) 1–130.
[13] M. Abdul-Mageed, L. Ungar, Emonet: Fine-grained
     emotion detection with gated recurrent neural net-
     works, in: Proceedings of the 55th annual meeting
     of the association for computational linguistics (vol-
     ume 1: Long papers), 2017, pp. 718–728.
[14] C. Zhang, M. Abdul-Mageed, No army, no navy:
     Bert semi-supervised learning of arabic dialects, in:
     Proceedings of the Fourth Arabic Natural Language
     Processing Workshop, 2019, pp. 279–284.
[15] J. M. Jordan, V. Salvatore, B. Endicott-Popovsky,
     V. Gandhi, C. O’Keefe, M. S. Sotebeer, M. Stiber,
     Graph-based simulation of emergency services com-
     munications systems, in: Proc. 2022 Annual Mod-
     eling and Simulation Conference, San Diego, CA,
     submitted, 2022.
[16] J. Conquest, M. Stiber, Software and data prove-
     nance as a basis for escience workflow, in: IEEE
     eScience, IEEE, online, 2021.
[17] E. C. Francomano, H. Bamford, Whose digital
     middle ages? accessibility in digital medieval
     manuscript culture, Journal of Medieval Iberian
     Studies (2022) 1–13.
[18] K. Pal, M. Terras, T. Weyrich, 3d reconstruction for
     damaged documents: imaging of the great parch-
     ment book, in: Proceedings of the 2nd International
     Workshop on Historical Document Imaging and
     Processing, 2013, pp. 14–21.
[19] V. Frinken, A. Fischer, C.-D. Martínez-Hinarejos,
     Handwriting recognition in historical documents
     using very large vocabularies, in: Proceedings of
     the 2nd International Workshop on Historical Doc-
     ument Imaging and Processing, 2013, pp. 67–72.
[20] K. Simonyan, A. Zisserman, Very deep convolu-
     tional networks for large-scale image recognition,
     arXiv preprint arXiv:1409.1556 (2014).
[21] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He,
     J. Liang, East: an efficient and accurate scene text
     detector, in: Proceedings of the IEEE conference
     on Computer Vision and Pattern Recognition, 2017,
     pp. 5551–5560.
[22] J. Redmon, A. Farhadi, Yolov3: An incremental im-
     provement, arXiv preprint arXiv:1804.02767 (2018).
[23] R. Kitchin, T. P. Lauriault, M. W. Wilson, Under-
     standing spatial media, Sage, 2017.
[24] C. G. D. Infrastructure, Natural resources canada,
     https://doi.org/10.4095/328060 (2020).
[25] S. Arctic, Spatial data infrastructure (sdi) manual
     for the arctic, Arctic Council: Ottawa, ON, Canada