<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Trusted Data Forever: Is AI the Answer?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Emanuele Frontoni</string-name>
          <email>emanuele.frontoni@unimc.it</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marina Paolanti</string-name>
          <email>marina.paolanti@unimc.it</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tracey P. Lauriault</string-name>
          <email>Tracey.Lauriault@carleton.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Stiber</string-name>
          <email>stiber@uw.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luciana Duranti</string-name>
          <email>Luciana.Duranti@ubc.ca</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muhammad Abdul-Mageed</string-name>
          <email>muhammad.mageed@ubc.ca</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Critical Media and Big Data Lab, Carleton University</institution>
          ,
          <addr-line>Ottawa, ON K1S 5B6</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Intelligent Networks Lab, University of Washington Bothell</institution>
          ,
          <addr-line>WA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>InterPARES Lab, University of British Columbia</institution>
          ,
          <addr-line>Vancouver, BC V6T 1Z4</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>NLP and ML Lab, University of British Columbia</institution>
          ,
          <addr-line>Vancouver, BC V6T 1Z4</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>VRAI Vision Robotics and Artificial Intelligence Lab, University of Macerata</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Archival institutions and programs worldwide work to ensure that the records of governments, organizations, communities, and individuals are preserved for future generations as cultural heritage, as sources of rights, and as vehicles for holding the past accountable and to inform the future. This commitment is guaranteed through the adoption of strategic and technical measures for the long-term preservation of digital assets in any medium and form - textual, visual, or aural. Public and private archives are the largest providers of data big and small in the world and collectively host yottabytes of trusted data, to be preserved forever. Several aspects of retention and preservation, arrangement and description, management and administrations, and access and use are still open to improvement. In particular, recent advances in Artificial Intelligence (AI) open the discussion as to whether AI can support the ongoing availability and accessibility of trustworthy public records. This paper presents preliminary results of the InterPARES Trust AI (“I Trust AI") international research partnership, which aims to (1) identify and develop specific AI technologies to address critical records and archives challenges; (2) determine the benefits and risks of employing AI technologies on records and archives; (3) ensure that archival concepts and principles inform the development of responsible AI; and (4) validate outcomes through a conglomerate of case studies and demonstrations.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Artificial Intelligence</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Archives</kwd>
        <kwd>Trustworthiness</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>trustworthy. Thus, their preservation must ensure that
any activity carried out on the records to identify, select,
Archival institutions and programs worldwide work to organize, describe them and make them accessible to the
ensure that the records of governments, organizations, people at large must ensure that they remain trustworthy,
communities, and individuals are preserved for future that is reliable (i.e. their content can be trusted), accurate
generations as cultural heritage, as sources of rights, to (i.e. the data in them are unchanged and unchangeable),
hold the past accountable, and as evidence to inform fu- and authentic (i.e. their identity and integrity are intact).
ture plans. A record – or archival document – is any This is particularly dificult in the digital environment
document (i.e. information afixed to a medium, with because the content, structure and form of records are
stable content and fixed form) made or received in the no longer inextricably linked as they used to be in the
course of an activity, and kept for further action or ref- traditional records environment [1]1. The issue exists for
erence. Because of the circumstances of its creation a both digital and digitized records and is rendered more
record is a natural by-product of activity, is related to serious by the sheer number and volume of records that
all the other records that participate in the same activity, have accumulated overtime and are being created today
are impartial with respect to the questions that future in a large variety of systems.
researchers will ask of them, and are authentic as instru- The InterPARES (International research on Permanent
ments of activity. This is why records are inherently Authentic Records in Electronic System) project has
addressed these issues since 1990, focusing on current and
emerging technologies as they evolve and developing
theory, methods, and frameworks that allow for the ongoing
preservation of the records resulting from the use of such
technologies2. The latest iteration of InterPARES, I Trust
AI, funded, as previous projects, by the Social Sciences
and Humanities Research Council of Canada, but difers
as it is not concerned with the records produced by a spe- challenges; 2.Determine the benefits and risks of using
cific technology, but has the purpose of using AI to carry AI technologies on records and archives; 3. Ensure that
out archival functions for the control in the long term of archival concepts and principles inform the development
all records, on any medium, and from any age, and to do of responsible AI; and 4. Validate outcomes from
Obso in such a way that the trustworthiness of the records jective 3 through case studies and demonstrations. Our
remains protected and verifiable, and that the tools and approach is two-pronged, comprising the practical and
processes are transparent, unbiased, equitable, inclusive, immediate need to address large-scale existing problems,
responsible (i.e. protecting autonomy and privacy) and and the longer-term need to have AI-based tools that are
sustainable3. There have been several projects looking at reliably applicable to future problems. Our short-term
AI in archives, but they typically look at a particular tool approach focuses on identifying high impact problems
in a specific context or even a single set of records, and and limitations in records and archives functions, and
apthey tend to use of the shelf tools. The research question plying AI to improve the situation. This will be achieved
that I Trust AI is being asked here is: “what would AI via collaboration between records and archival
scienlook like if archival concepts, principles and methods tists and professionals and AI researchers and industry
were to inform the development of AI tools?” What is experts. Our long-term approach focuses on
identifylacking is comprehensive, systematic research into the ing the tools that records and archives specialists will
use of AI to carry out the diferent archival functions in need in the future to flexibly address their ever-changing
an integrated way and ensure the continuing availability needs. This includes decision support and, once decisions
of verifiable trustworthy records to prevent the erosion are made, rapid implementation of AI-based solutions to
of accountability, evidence, history and cultural heritage. those needs.</p>
      <p>Thus, we are addressing the technological issues from The I Trust AI project is a multinational
interdiscithe perspective of archival theory, by integrating the plinary endeavour, and this means that our first efort
technology with complex human-oriented tools. The must be to understand each other, starting with the
lanobjectives of I Trust AI are to 1. Identify specific AI tech- guage we use. For example, archival professionals talk
nologies that can address critical records and archives about records, while computer scientists and AI
professionals talk about data. To the former data are the
smallest meaningful unit of information in a record. To an AI of ML comes from Mitchell and Learning [3] who
prospecialist, data are arrangements of information (possibly vide a procedural definition maintaining that a computer
in a database), be these facts or not, regardless of their program is said to learn from experience E with respect to
size, nature and form.. Thus, for the purposes of this some class of tasks T and performance measure P if its
perpaper, which is directed to data analytics specialists, we formance at tasks in T, as measured by P, improves with
will use the term data. experience E. As to DL [4, 5], it is a class of ML methods</p>
      <p>Public and private archives are the largest providers of inspired by information processing in the human brain.
data big and small in the world as they collectively host What makes DL powerful is its ability to automatically
yottabytes of trusted data, to be preserved forever. Their learn useful representations from data. DL algorithms,
creators are organizations and individuals from myriad unlike classical ML methods, are not only able to learn
sectors and disciplines, from public administration, to the mapping from representation to output but also to
academia and businesses of all kind (e.g. banking, engi- learn the representation itself [6], thus alleviating the
neering, architecture, gaming), and from Indigenous com- need for costly human expertise in crafting features for
munities, civil society organizations, associations, and models. DL has achieved success in recent years in a
virtual communities. Table 1 reports the data quantity. wider variety of applications in many domains involving
The Italian State Central Archives, for example, has Data various types of data modalities such as language, speech,
quantity (67 TB) of “digital objects” stored by the ACS image, and video.
linked to the typology (digitised heritage data (in TIFF)). As mentioned, DL mimics information processing in
The National Archives of the US has 1,323 terabytes of the brain. This is possible by designing artificial neural
electronic data. This paper presents some preliminary networks arranged in multiple layers (and hence the term
results of I Trust AI international research partnership deep) that take input, attempt to learn a good
represenand is organized as follows: Section 2 provides a general tation of it, and map it to an output decision (e.g., is this
discussion of AI and its subsets; Section 3 describes three text in Greek or Latin?). The way these networks are
of the about forty studies that are now in course; and designed can vary, thus, various types of deep learning
Section 4 presents an overview of the kind of studies that architectures have been proposed. Two main types of
are being pursued at this time and a conclusion. DL architectures have been quite successful. These are
recurrent neural networks (RNNs) (e.g., [7]), a family of
networks specializing in processing sequential data, and
2. Artificial Intelligence and Deep convolutional neural networks (CNNs) [8], an architecture
Learning specializing in data with a ‘grid-like’ typology (e.g.,
image data) [5]. More recent advances, however, abstract
There are various definitions of what AI is. For exam- away from these two main types toward more dynamic
ple, Russel and Norvig [2] define AI as a field focused networks such as the Transformer [9].
on the study of intelligent agents that perceive their en- In general, ML and DL methods learn best when given
vironment and take actions to maximize their chance of large amounts of labeled data (e.g., for a model that
desuccess at some goal. In recent times, however, AI has tects sensitive information, labels can be from the set
been used much more widely to refer to any technology sensitive, not-sensitive). DL in particular is data-hungry
where there is some level of automation especially re- and tends to learn best given large amounts of labeled
sulting from the application of deep learning in various data. This type of learning with labeled data is called
sudomains. Deep learning (DL) is a sub-field of machine pervised learning. It is also possible to work with smaller
learning (ML), which are both sub-fields of AI. Similar to labeled datasets. In these cases, training samples can
AI, ML is defined in several ways. A classical definition be grown iteratively exploiting unlabeled data based on
decisions from an initial model (self-training) or using
decisions from various initial models (co-training). This
is called semi-supervised learning [10, 11, 12]. The third
main-type of ML methods is unsupervised learning where
a model usually tries to cluster the data without access
to any labels. There are other paradigms such as distant
supervision where a model attempts to learn from
surrogate cues in the data in absence of high-quality labels
(e.g., [13]). Self-supervised learning where real-world data
are turned into labeled data by masking certain regions
(e.g., removing some words or parts of an image) and
tasking a model to predict the identity of these masked
regions is currently a very successful approach. These
various methods of supervision can also be combined to
solve downstream tasks. For example, Zhang and
AbdulMageed [14] combine self-supervised language models
with classical self-training methods to solve text
classification problems. The next section will introduce three I
Trust AI studies.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Case Studies in I Trust AI</title>
      <sec id="sec-2-1">
        <title>3.1. Data from Emergency Services</title>
      </sec>
      <sec id="sec-2-2">
        <title>Communications Systems</title>
        <p>from another, and examining how various data analytics
techniques might be usefully applied — understanding
what characteristics of these data might be usable to
researchers Specifically, we are addressing the following
research questions:
What real-world and simulation ESCS data are
available to be preserved for access by researchers?
The answer to this varies greatly from locale to locale,
depending on technology in use, public policy, and
controlling agency procedures. Moreover, for such data to
be available, we must understand privacy and security
risks associated with transferring them from their current
owners to a research environment, along with the risk of
misinterpreting them if they are decontextualized from
whatever tacit knowledge might exist within the owning
organizations. We are also considering pragmatic issues,
such as building a knowledge base of legal restrictions on
collection in various jurisdictions, formal processes for
collecting these data (such as data sharing agreements),
variation in the culture of practice surrounding such data,
potential biases that might result from systematic
diferences in diferent areas’ capacity to collect and share data
(such as might arise from regional funding diferences),
and understanding the metadata and other information
(such as ESCS physical and operational structure and,
generally, the policies and practices that determine what
and how data are generated and collected).</p>
        <p>One of the cornerstones of of public safety and societal
wellbeing is a reliable and comprehensive emergency
services communications system (ESCS, such as 9-1-1 in
the US and Canada). Such systems can be considered to
encompass the organizations, electronic infrastructure, What are the challenges and benefits of
discoverand policies and procedures that enable answering and ing knowledge patterns from historical ESCS data?
responding to emergency phone calls [15]. As might be These patterns will serve as clues to developing protocols
expected of systems that originate in analog, switched for ESCS managers to follow regarding data collection,
telephony, ESCS evolution into a digital system has re- and as clues for how these data can be applied for reuse.
sulted in a haphazard conglomeration of subsystems and We will consult with external stakeholders to seek advice
generally needs re-imagining as a modern technological and to run thought experiments using surveys,
thinksolution. In the US, this change has been termed the aloud exercises, retrospective first-hand accounts, etc.
“Next Generation 911” (NG911) project. We will also examine historical records of disasters for</p>
        <p>A transformation such as NG911 once again re-casts which the preserved data are more complete than basic
ESCS as keystone information and communication tech- ESCS datasets.
nology, subject to all of the concerns of such systems:
cybersecurity, privacy, crisis preparedness, strategic and What other data/metadata associated with
emeroperational decisions, etc. At the same time, it opens gency events are not part of the ESCS data stream?
up possibilities for data analytics to improve ESCS per- From our preliminary examination, typical ESCS data
formance, inform funding decisions, monitor the health currently involve lists of individual calls and information
of societies and their infrastructures, and serve as early directly associated with such calls (perhaps including full
warnings for natural and human made crises. This study or partial phone numbers, call categorization, GPS
coordiconnects large- scale simulations of ESCS to historical nates, responder information, response times, etc.). What
data from ESCS operations to develop and document an these datasets do not directly include are events and data
understanding of how to preserve authentic, reliable data that are external to the call stream but are the reason for
that can be used for applications such as re-creation of such calls (trafic, weather, geopolitical events, and so on).
past events (as might be done to support training or to Some of these additional data may be present in other
explore the efects of changes in policies and procedures), sources of information in a format that can be reasonably
testing system operation in one locale based on data collected in tandem with call data. On the other hand, it
may be the case that other causal information must be Subsequently, we will prepare an initial case study in
inferred from available data (and, of course, it may be the which we will apply the above to a single locale: we
case that a combination of inference and extraction from will go through all of the steps of understanding the
other data streams could be useful). For the inferencing ESCS processes that produced the data, developing a data
task, we propose using an examination of simulation re- sharing agreement, and collecting data and metadata.
sults and simulation artifact provenance information as
exemplars to develop a set of specifications for what an 3.2. Learning from Parchments
AI-driven system would need to accomplish [16].</p>
        <sec id="sec-2-2-1">
          <title>The digitization of historical parchments is extraordinar</title>
          <p>What are the roles of the disciplines of Archival ily convenient, as it allows easy access to the documents
Science and Artificial Intelligence in building a cen- from remote locations and removes the need for the
postral repository for ESCS data? Both the individual sible adverse efects of their physical management and
ifelds of archival science and artificial intelligence, plus access [17]. This arrangement is particularly suitable
their overlap or combination that could be considered to archives and museums who preserve such invaluable
to fall within the realm of data science, have a number historical documents whose contents are unpublished
of roles in the organization and interpretation of ESCS and which, if damaged, cannot be fully restored by
condata. We provide examples below from the application ventional tools, are dificult to read on the original, due
of real-world data to simulations: to high levels of damage and the delicate nature of the
material. Damaged parchments are notably prevalent
• Generating requirements for simulator design so in archives all over the world [18]. Their digital
repthat simulation output matches real data in terms resentations reduce both damage and access issues by
of format, metadata, etc. providing users with the possibility of reading their
con• Analyzing and comparing simulation output with tents at any moment, from remote locations, and without
real- world data. necessitating the potentially harmful physical handling
• Synthesizing ESCS data that match features of of the document. Thus, the automatic analysis of
digreal-world data as part of an overall ESCS simu- itized parchments has become an important research
lation. topic in the fields of image and pattern recognition. It
• Using real-world call data to drive a simulation has also been a considerable research issue for several
of an emergency response system, for example, years, gaining attention recently because of the value that
to allow a “replay” of a previous disaster or to maybe be unlocked by extracting the information stored
investigate how modifications to such a system in historical documents [19]. Interest in applying AI/ML
might produce diferent outcomes. to ancient image data analysis is becoming widespread,
and scientists are increasingly using this method as a
3.1.1. Progress powerful and complex process for statistical inference.
Computer-based image analysis provides an objective
This work-in-progress is in its initial stages, preparing method of identifying visual content independently of
for the point in time when we can begin collecting ESCS subjective personal interpretation, while potentially
bedata. Specifically, we are: ing more sensitive, consistent and accurate than physical
• working with a small set of external partners to human analysis. Learned representations often result in
develop a general understanding of ESCS opera- much better performance than hand-designed
representions, policies, and procedures, as well as identi- tations when it comes to these types of texts. Until now,
fying which data exist; parchment analysis has required physical user
interac• developing a process within our project for work- tion, which is very time consuming. Hence, the efective
ing with a selected ESCS management organiza- automatic feature extraction competence of Deep Neural
tion to build an understanding of their specific Networks (DNNs) decreases the demand for a personal
operations and data; physical extraction processes.
• fleshing out a model data sharing agreement to Considering the above, PergaNet is a lightweight
DLserve as a starting point for discussions surround- based system for the historical reconstructions of ancient
ing transferring data to our research environ- parchments is specifically designed and developed for
ment; this type of analysis. The aim of PergaNet is to
auto• consulting with our institutional review board mate the analysis and processing of large volumes of
regarding the use of this particular set of human scanned parchments. This problem has not yet been
data; deeply investigated by the computer vision community
• and configuring a secure internal data storage as parchment scanning technology is still novel, but it
system. has proven to be extremely efective for data recovery
from historical documents whose content is inaccessible to help you plan your day. This occurs seamlessly in the
due to the deterioration of the medium. The proposed background of our daily activities and involves a vast
approach aims to reduce hand-operated analysis while complex of sensors; databases; cloud computing centres;
using manual annotations as a form of continuous learn- telecommunication networks and the internet; standards;
ing. The whole system however requires digital labour, code, software and platforms; people and institutions;
such as the manual tagging of large training data. Up laws, regulation and policies; communication systems
until now, large datasets remain necessary to boost the and of course AI/ML [25].
performance of DL models, and manually verified data An emerging subset of these spatial data
infrastrucwill be used as continuous learning and maintained as tures are digital twins (DTs). A DT is an ecosystem of
training datasets. PergaNet comprises three important multi-dimensional and interoperable subsystems made
phases: the classification of parchments recto/verso, the up of physical things in the real-world, digital versions
detection of text, and the detection and recognition of of those real things, synchronized data connections
bethe “sig,num tabellionis”. (i.e. the identifier of the au- tween them and the people, organizations and
instituthor). PergaNet concerns not only the recognition and tions involved in creating, managing, and using these5.
classification of the objects present in the images, but In terms of physical and real things consider a building
also their location. This I Trust AI study expands the or a car manufacturing plant; a digital representation
implementation of AI guided by archival institutions and of those things in a digital platform or an interactive
programs as this method could be used by many other virtual reality game engine; with an internet of things
archives for diferent types of documents. The analysis (IoT) system of sensors and databases that communicates
is based on data about the ordinary use by researchers between the buildings or manufacturing processes in the
of this type of material and does not involve altering or plant in real and near real time and the people and
inmanipulating techniques aimed to generate data. This stitutions that own and operate these. Contemporary
provides actionable insights that are helpful to identify examples of DTs are the modeling and managing of the
text as documentary form and not as reading. construction of Sweden’s new high speed rail systems6;</p>
          <p>The DL pipeline is depicted in Figure 1. We chose Hyundai car and ship manufacturing plants7; and as part
VGG16 Network [20] for its suitability and efectiveness of smart city strategies (see the submissions to the
Infrasin image classification tasks and were inspired by the tructure Canada Smart City Challenge8). DTs originate
work of Zhou et al. [21], in the way in which PergaNet in the aerospace industry, first with NASA’s Apollo 13 in
detects the text in the image. This phase allows for the the 1970s, although in that case it was a physical replica
exclusion of the text on the parchment in the phase of to help troubleshoot issues of a ship in flight; and were
recognition of the signa. The DNN model chosen is EAST predominantly used in manufacturing and logistics [26]
for word detection [21]. Finally, a Convolutional Neural Increasingly, DTs involve building information
modelNetwork has been employed for the signa detection. Our ing (BIM) such as REvit a proprietary platform or the
approach uses YOLOv3 [22], an algorithm that processes OS BlenderBIM; whereby a building will be
conceptuimages in real time. We chose this algorithm because of alized and rendered into a 3-dimensional drawing with
its eficiency in computational terms and for its precision attributes captured in a database; often replacing typical
to detect and classify objects. The network is pre-trained blue prints. The BIM informs the construction of the
using COCO4, a publicly available dataset; this was a building; and is a record of it once completed. BIM
renchoice made to reduce the need for a large amount of derings are increasingly being submitted as part of the
training data, that would come with a high computational building permit approval process (BuildingSmart, 2020,
cost. e-submission common guidelines for introducing BIM to
the building process9) and are updated into as is BIMs for
ongoing operations. BIMs are also used to estimate
mate</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>3.3. Digital Twin Study</title>
        <p>Spatial media [23] and spatial data infrastructures
(SDI) [24] have normalized as complex interconnected
global, regional, national, and personalized social and
technological systems of systems. Simply consider the
monitoring of climate at a global scale to inform the
logistics of production chains, or to predict, preempt and
prevent disaster resulting from natural calamities on local
physical infrastructure or simply to report the
temperature and humidity levels outside to your smart phone</p>
        <sec id="sec-2-3-1">
          <title>4https://cocodataset.org/#home</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>5CIMS, 2021,About page, what is a digital twin</title>
          <p>https://canadasdigitaltwin.ca/about-2-2/</p>
          <p>6Pimental, K. 2019, Visualizing Sweden’s first high-speed
railway with real-time technology,
https://www.unrealengine.com/enUS/spotlights/visualizing-sweden-s-first-high-speed-railway-withreal-time-technology</p>
          <p>7Chang-Won, L., 2022, Hyundai Motor works with Unity to
build digital-twin of factory supported by metaverse platform;
https://www.ajudaily.com/view/20220107083928529
8https://www.infrastructure.gc.ca/sc-vi/map-applications.php
9https://www.buildingsmart.org/wpcontent/uploads/2020/08/e-submission-guidelines-PublishedTechnical-Report-RR-2020-1015-TR-1.pdf
rial costs as they are interconnected with material vendor the Carleton Digital Campus Innovation (DCI)13 project
databases and electrical and heating systems. BIMs in- integrating Building Performance Simulation (BPS)
techterrelate with smart asset management systems (ASM) nologies with BIM on a campus scale; building
inforwhich inform building maintenance and operations, and mation management systems (BIM), Asset Management
monitor internal climate such as temperature, humidity, Systems (AMS), visualizations of the digital structures in
air flow and quality; which are inputs for the AI/ML sys- the Unreal Game Engine, VR and modelling, AI/ML, and
tems that remotely manage heating and cooling systems; Real-time data for decision making.
electricity use and consumption as well as maintenance The Carleton Campus DT data of seven buildings
beschedules and inform ongoing decision making. long to the University that must preserve these as oficial</p>
          <p>In this I Trust AI Digital Twin case study we collaborate records, which are used for operations by Facilities,
Manwith researchers working on the Imagining the Canada agement and Planning (FMP) and are part of research
Digital Twin (ICDT)10 project funded by the Canadian and development for the CIMS and SUSTAIN projects;
New Frontiers in Research Fund11 which proposes a na- thus involving research data that must be managed and
tional, inclusive, and multidisciplinary research consor- deposited in a trusted digital repository.
tium for the creation of a technical, cultural, and ethical The implications of this research are important to the
framework to build and govern the technology, data and archival community who will increasingly have to ingest
institutional arrangements of Canada’s DT. ICDT focuses complex record sets such as these, as well as create an
on the built environment, concentrating on the Archi- archival package to maintain the integrity of these
comtecture, Engineering, Construction, and Owner Operator plex interlinked DT systems through time. This study
(AECOO) industry. ICDT is led by the Carleton Immer- will be one of the first globally to examine the
presersive Media Studio (CIMS)12 developing a DT prototype of vation of a DT. Its research questions are: Can a digital
the Montréal-Ottawa-Toronto corridor using a simulated, twin be preserved and what is required at the point of
distributed server network. The research study involves creation to ensure that it can be? Can information about
an interdisciplinary research team of architects, data sci- the AI tools, automation and real time data involved in
entists, engineers, building scientists, archival profes- this complex data, social and technological system be
presionals and critical data studies scholars, from Carleton served, and how? And, what might be the role of AI/ML
University CA, Luleå University of Technology SE, the be in terms of creating an archival package to ingest a
Swedish Transport Administration and the University of digital twin? The outputs of this research will provide
Florence IT who will develop a Use and Creation preser- empirical data to meet the objectives of the I Trust AI
vation case study. The Study aims to preserve the DT of Project; and also provide Carleton University with the
opcampus buildings and structures created as part of SUS- portunity to test the preservation of Campus DT records
TAIN (https://cims.carleton.ca/#/projects/Sustain) and in its institutional archives. In the process, it will inform
the technology sectors involved in the creation of DTs,
so that they may build-in at the point of creation, the
10https://canadasdigitaltwin.ca/
11https://www.sshrc-crsh.gc.ca/funding-financement/nfrffnfr/index-eng.aspx</p>
          <p>12https://cims.carleton.ca/#/home
necessary bread crumbs for long term preservation. have been numerous calls to action to systematically
explore the application of AI techniques to the records
and archives field, AI also currently faces major ethical
4. Conclusions and Future Works challenges that will benefit from an archival theory
perspective, for instance in dealing with bias and personal
The studies presented above are only 3 of about 40 in- information. By exploring further the connections
beprogress studies which cover a wide range of subjects and tween AI and archives, this project is and will contribute
ipsrseuseesr,vsauticohnaosfeAnItetrepcrhinseiqmueasstaesrpdaartaadmataan,magoedmeellnint,gtahne tAoI tphreojiencttelhleacstgueanl eprraotegdreassgroefabtoatmhofieulndst.ofTehnethuI sTiarusmst
AI-assisted digitization project, gamification of archival among participant researchers (about 200) and partner
experience for users, declassification of personal informa- organizations (87), as well as organizations that do not
tion using AI tools, and user approaches and behaviours have the capacity to participate but look forward to
outin accessing records and archives in the perspective of comes they can use, because it deals with issues that are
AI. The challenges we are addressing with this project already dramatically changing the way we act, behave
have never before been systematically and globally dealt and think. We have a unique and essential contribution
with; it is enormous and fraught, but critical. While the to make, because we have the means of creating
knowlrisks of using AI to solve the problems of managing the edge ensuring that digital data and records are controlled
ever-growing, ever-more-diverse bodies of public and and made accessible in a trustworthy, authentic form
private records throughout their lifecycle, from creation wherever they are located; are promptly available when
to preservation and access, are unknown, the risks of not needed; duly destroyed when required; and accessed only
acting in concert to do so are unacceptable: loss of the by those who have a right to do so.
ability to secure people’s rights, of evidence of past acts
and facts to serve as a foundation for decision making,
and of historical memory. References</p>
          <p>This project will significantly impact society in several
areas. (1) Records-keeping in local and national govern- [1] L. Duranti, K. Thibodeau, The concept of record
ment agencies is a vital part of our society’s ability to in interactive, experiential and dynamic
environmaintain oversight on and accountability of governance, ments: the view of interpares, Archival science 6
but, with the inability to handle the vast quantities of (2006) 13–68.
digital records, public bodies risk undermining their own [2] S. Russel, P. Norvig, Artificial intelligence: a
modlegitimacy as oversight if they can not appropriately pro- ern approach, Pearson Education Limited London,
cess and make accessible information in a timely fashion. 2013.</p>
          <p>By helping address this crisis through the development, [3] T. M. Mitchell, M. Learning, Mcgraw-hill, New
evaluation, and contextualization of AI techniques we York (1997) 154–200.
contribute to the ability of agencies and institutions to [4] Y. LeCun, Y. Bengio, G. Hinton, Deep learning,
maintain their place in our democracies. (2) Automation nature 521 (2015) 436–444.
techniques can potentially aid in the economic viability [5] I. Goodfellow, Y. Bengio, A. Courville, Deep
learnof many cash-starved records ofices and archival institu- ing, MIT press, 2016.
tions by ensuring that professional records management [6] E. Granell, E. Chammas, L. Likforman-Sulem, C.-D.
and archival expertise are used wisely, with classifica- Martínez-Hinarejos, C. Mokbel, B.-I. Cîrstea,
Trantion tools and TAR able to allow a quick review and scription of spanish historical handwritten
docuassessment of vast quantities of records. Similarly, with ments with deep neural networks, Journal of
Imagbusinesses depending on records agencies for routine ing 4 (2018) 15.
activities, improved speed in responding to queries will [7] A. Graves, et al., Supervised sequence labelling with
bring a positive efect to the economy. (3) AI techniques recurrent neural networks, volume 385, Springer,
have the potential to aid in the accessibility of records in 2012.
archives by new audiences, for instance by translating [8] Y. LeCun, et al., Generalization and network design
and indexing historical materials written in indigenous strategies, Connectionism in perspective (1989)
languages, sensitising problematic archival descriptions, 143–155.
helping patrons find connected items, or captioning his- [9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
torical photographs. These techniques have both a cul- L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
Attural significance, by providing better access to historical tention is all you need, Advances in neural
informaterial, and a social and scientific significance, by mak- mation processing systems 30 (2017).
ing current records easier to organise, retrieve and use by [10] S. Abney, Semisupervised learning for
computaboth their creators and the public at large. (4) While there tional linguistics, CRC Press, 2007.
[11] A. Søgaard, Semi-supervised learning and domain (2016) 5–24.</p>
          <p>adaptation in natural language processing, Synthe- [26] C. Miskinis, The history and creation of the digital
sis Lectures on Human Language Technologies 6 twin concept, Challenge Advisory. March (2019).
(2013) 1–103.
[12] X. Zhu, A. B. Goldberg, Introduction to
semisupervised learning, Synthesis lectures on artificial
intelligence and machine learning 3 (2009) 1–130.
[13] M. Abdul-Mageed, L. Ungar, Emonet: Fine-grained
emotion detection with gated recurrent neural
networks, in: Proceedings of the 55th annual meeting
of the association for computational linguistics
(volume 1: Long papers), 2017, pp. 718–728.
[14] C. Zhang, M. Abdul-Mageed, No army, no navy:</p>
          <p>Bert semi-supervised learning of arabic dialects, in:
Proceedings of the Fourth Arabic Natural Language</p>
          <p>Processing Workshop, 2019, pp. 279–284.
[15] J. M. Jordan, V. Salvatore, B. Endicott-Popovsky,</p>
          <p>V. Gandhi, C. O’Keefe, M. S. Sotebeer, M. Stiber,
Graph-based simulation of emergency services
communications systems, in: Proc. 2022 Annual
Modeling and Simulation Conference, San Diego, CA,
submitted, 2022.
[16] J. Conquest, M. Stiber, Software and data
provenance as a basis for escience workflow, in: IEEE
eScience, IEEE, online, 2021.
[17] E. C. Francomano, H. Bamford, Whose digital
middle ages? accessibility in digital medieval
manuscript culture, Journal of Medieval Iberian</p>
          <p>Studies (2022) 1–13.
[18] K. Pal, M. Terras, T. Weyrich, 3d reconstruction for
damaged documents: imaging of the great
parchment book, in: Proceedings of the 2nd International
Workshop on Historical Document Imaging and</p>
          <p>Processing, 2013, pp. 14–21.
[19] V. Frinken, A. Fischer, C.-D. Martínez-Hinarejos,</p>
          <p>Handwriting recognition in historical documents
using very large vocabularies, in: Proceedings of
the 2nd International Workshop on Historical
Document Imaging and Processing, 2013, pp. 67–72.
[20] K. Simonyan, A. Zisserman, Very deep
convolutional networks for large-scale image recognition,
arXiv preprint arXiv:1409.1556 (2014).
[21] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He,</p>
          <p>J. Liang, East: an eficient and accurate scene text
detector, in: Proceedings of the IEEE conference
on Computer Vision and Pattern Recognition, 2017,
pp. 5551–5560.
[22] J. Redmon, A. Farhadi, Yolov3: An incremental
im</p>
          <p>provement, arXiv preprint arXiv:1804.02767 (2018).
[23] R. Kitchin, T. P. Lauriault, M. W. Wilson,
Under</p>
          <p>standing spatial media, Sage, 2017.
[24] C. G. D. Infrastructure, Natural resources canada,</p>
          <p>https://doi.org/10.4095/328060 (2020).
[25] S. Arctic, Spatial data infrastructure (sdi) manual
for the arctic, Arctic Council: Ottawa, ON, Canada</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>