Walking in My Shoes: A Case Study from a Born-Digital
                           Archive


                                     Emmanuela Carbé
                                     University of Pavia
                                  emmanuela.carbe@unipv.it


Keywords: born-digital archive, digital curation, Italian contemporary literature,
private papers archiving


Abstract
The vulnerability of bits and the obsolescence of media raise new challenges with respect to the
preservation of cultural heritage produced in the last few decades. In 2009 a research team at
the University of Pavia decided to develop the PAD (Pavia Archivi Digitali) project, aimed at the
long-time preservation of digital papers from Italian writers and journalists and their accessibility
to the research community. PAD is intended to be as flexible as possible in terms of types of
material, numbers of authors and the dimensions of their archives: its main feature is an
integrated quality control system that manages each single phase of a deposit almost in real
time, allowing the ingestion, classification and validation of archives under strict and accurate
supervision. The archival system is based on five areas: "staging", "deposit", "permanent",
"work", and "info". The most difficult acquisition for PAD has been the archive of Francesco
Pecoraro. It has been the best test case for procedures and workflow, for instance the ingestion
of files from different media and of the materials published on his blog and on social networks.
With the help of Pecoraro’s archive, the PAD team designed software that facilitates cataloguing
and managing digital archives.


Introduction
When the Italian journalist Beppe Severgnini submitted the proposal in 2009 to build a Born-
Digital Archive of contemporary Italian authors, the University of Pavia did not have any idea
how much of a challenge this would be. Shortly afterwards, Severgnini made available the
developing PAD (Pavia Archivi Digitali) with more than 16,000 files from his own computer, and it
soon became evident that the prototype project of archiving files of contemporary writers was on
the point of becoming extremely complex and ambitious.
Nevertheless the University of Pavia had always been very attentive to new technologies and to
the collaboration of different disciplines and so it seemed to be an ideal location to build a Born-
Digital Archive – even more so because of its long-standing philological tradition. Back in 1969,
Maria Corti came up with the ground-breaking idea of collecting manuscripts of twentieth-century
Italian poets and novelists, and founded the Centre for Research in the Manuscript Tradition of
Modern and Contemporary Authors, also known as Centro Manoscritti.


Consequently, thanks to the efforts of Professor Umberto Anselmi Tamburini (coordinator), Dr
Primo Baldini (technical project and development) and Annalisa Doneda (responsible for
interactions with the authors), a first working group was established in 2009. Currently PAD is
chaired by Fabio Rugge, Chancellor of the University, and coordinated by Professor Paul
Gabriele Weston. The Academic Board (http://pad.unipv.it/comitato) benefits from the work of
many professors in various fields and areas. In addition, the staff of the University library system
offer archival assistance and experience, and the attorney Luigi Ubertazzi and the legal office of
the University of Pavia provide legal aid. The staff consist of two technical and scientific
supervisors: Primo Baldini, who is in charge of the technical project; and Emmanuela Carbé,
who supports development and testing of the software and liaises with authors. In the beginning,
PAD could rely on the support of Fondazione Alma Mater Ticinensis of the University of Pavia,
with the future objective of a profitable cooperation with the Centro Manoscritti.


PAD’s mission is to collect and preserve born-digital materials provided by Italian authors,
journalists and leading personalities in cultural fields: it consists of an archive of memories that
contributes to the present Italian cultural landscape and which is easily accessible to the
research community, complying with authors' privacy and copyright. After the original donation
by Severgnini, five more authors have donated their archives to PAD: Silvia Avallone, Franco
Buffoni, Gianrico Carofiglio, Paolo Di Paolo and Francesco Pecoraro. This has amounted to
almost 80,000 files thus far. These authors vary greatly in age, education, and literary and
journalistic approach and are highly diverse: this helps to build up a wide-ranging archive that is
useful for the type of research that goes beyond just the literary sphere and provides samples of
various methods of writing. The archives collected by the PAD project do not necessarily follow a
schema and do not have any particular form: they sometimes contain files of different types, for
example writing, graphics, media and documents generated using specific software.


Beyond paper
As we know, paper preservation is today only part of a wider problem. In February 2015, during
the annual meeting of the American Association for the Advancement of Science, Vint Cerf, the
Internet pioneer, addressed the vulnerability of memories that have been stored on digital
platforms, which arise from the obsolescence of both hardware and software: what will twenty-
first-century historians study? What strategies are in place to avoid the loss of the cultural
heritage that has been created over the last few decades? Although the issue has been
addressed previously (Kuny 1998), several problems remain unsolved to date: memory
institutions face new challenges in securing the collective and personal memories of the last
decades. The availability of large volumes of digital material raises questions about the role of
digital curators in physical preservation and access to documents (Kirschenbaum, Ovenden, &
Redwine 2010).


In 2010 Ricky Erway published a study which explained concisely and with extreme clarity a
range of scenarios and issues about the long-term preservation of digital materials. Following
that initial contribution, together with Barrera-Gomez in 2012 and 2013 she proposed certain
fundamental steps required for the preservation of born-digital content extracted from physical
media. They suggested the approach “walk before you can run”. This is valuable advice for
those who work in projects involving digital humanities, which rely on architectures based on
scalability and interoperability. In the beginning, PAD looked like it would be a long journey and
yet, despite the difficult experiences had with six authors and the improvement of all the
procedures, every acquirement is characterized by new problems which are always different and
unique.


The vulnerability of bits in fact has consequences in the field of literary archives: what kind of
“manuscripts” have been produced by the writers of the last few decades? How is it possible to
preserve today's “writers' desks” if nowadays everything is virtual, (only apparently) invisible,
consisting of sequences of bits and binary code?


Only a few institutions have been working on projects for the preservation of born-digital writers’
papers, including the Harry Ransom Center, which preserves collections such as that of Michael
Joyce (Stollar Peters 2006). Another significant example is the collection of the Salman Rushdie
digital archive, preserved by Emory University’s Manuscript, Archives and Rare Book Library
(Carroll, Farr, Hornsby, & Ranker 2011).


In the examples mentioned above, great effort has been put into ensuring the accessibility of the
collections, for example by providing ways to emulate the original archive, and by the integration
of paper and born-digital documents. Generally speaking, those few projects that emphasize
literary archives focus on the works of a single author. The main goal of the PAD project is to
implement a wider and more complex system dedicated to handling literary archives, with the
added aims of comparing archives from several authors and incorporating a facility to parse texts
with built-in textual analysis tools. PAD focuses on cataloguing the archives using adequate
archival standards in order to ensure interoperability with traditional archives and perhaps with
other born-digital archives that may, in the future, be more common than today. Along with some
legal issues that are still to be settled, this is one of the major challenges in the project: given the
amount of data that has to be handled, a fully manual cataloguing process would be
unreasonable because the investment in terms of time and human resources would be too high.
As a minimum, semi-automated and sustainable solutions are essential.


In these years of development of the project, PAD has amassed far more questions than
answers regarding the management of writers’ digital materials. We tried to explore different
methods and perspectives, combining techniques that are typically used in other areas such as
forensics, and applying them to the unique and specific features of a private literary archive with
the intention of providing the DH community new questions to work on within a field that has
received comparatively little substantive attention until now.


Methodologies and architecture
The aim of PAD is to be as flexible as possible in terms of types of content, number of authors
and the size of their archives. So how should we want others to “walk in our shoes”? From the
beginning, considerable effort has been put into the implementation of new technology and
processes aimed at achieving better performance. Following the evaluation of established DAM
solutions, the decision was taken in 2014 to develop an in-house software platform to
seamlessly integrate with PAD's complex architecture. The software, entirely designed by Dr
Primo Baldini, is called QUANDO (Quality control for Archiving and Networking of Digital
Objects). The main tool in its development is FileMaker: the first version was initially a stand-
alone and single-user program but, since 2015, it has been converted into a multi-user
application that can be accessed within a private network (intranet). At the moment only staff can
access the software using their personal credentials: users have different access levels, and can
modify the data according to their roles within the project.
                               Figure 1: Screen Summary of QUANDO


QUANDO manages all of the important aspects of the life cycle an archive and also acts as a
repository for administrative documentation. It integrates information that has been entered
manually with data that has been gathered automatically from other PAD software components
(for checksumming, virus checking, metadata extraction, synchronization, etc.). It also assists in
co-ordinating the efforts of all the various participants: the Academic Board, DAMS
Administrator, Repository Administrator, staff, and legal consultants. The workflow can be
coordinated through the software: from the first contact with the authors to the secure storage of
data files (Weston, Carbé, & Baldini 2016).


The architecture of PAD has been designed in accordance with the OAIS Reference Model
recommendations (CCSDS 2012; Lavoie 2014). The PAD archival system is based around six
areas: "staging", "deposit", "work", "permanent", "info" and "database". The workflow procedure
for every ingest requires the author to fill in an informational survey consisting of 15 main areas
of questions. This is a primary step in the deposit process: we ask, for example, for information
about the author’s computer and devices, how the archive is organized, and how the authorship
of the work and the technical tools relate to each other. This also helps provide more information
about the relationship between author and computer, which is deeper than a purely technical
relationship, and which can entail modifications to creative processes (Kirschenbaum, Farr,
Kraus et. al. 2009): the process of acquisition of an archive can be intricate not only from a
strictly technical point of view but also psychologically, since each author has a unique relation
with the tools that she or he uses to write.
Upon initial ingest, materials are stored in the temporary area, where they are kept while waiting
for the availability of an operator. In the deposit area, the integrity of the archive is checked,
along with the possible presence of viruses. If any malware is found, the author is notified
immediately and, if needed, assistance is offered. Viruses are usually quarantined within the
PAD archive and they are only removed if a file could be irredeemably compromised, as
described in the documentation: in such a situation, there are particular processes to be
activated to try to recover the contents of a file. SHA-1 hashes are then generated. The PAD
Print application generates a list of unique files that have been transferred, which is sent to the
author for validation. In case of any second thoughts, the author can decide to remove a file or a
set of files. Attached to the list of files is a summary indicating the total number of files
transferred, the number of unique files and the size of the entire archive.


The work area is where metadata are extracted, documents are converted to formats that allow
for better long-term accessibility and older computers may be emulated using virtualization
technology. Finally, all of the data related to the deposit and collected by the QUANDO system
are transferred to the "info" area. The database area has been created for the purposes of
facilitating PAD workshops for the students of our university and to ensure accessibility for the
research community. The permanent area is dedicated to preservation. An unencrypted copy of
the archive is burned onto Gold Preservation archival standard DVDs and transferred to a bank
vault. For every archive, two copies are stored in Pavia and another one in the University’s
facilities in Cremona, more than ninety kilometers away from the main site of the PAD project,
thus following the principles of Distributed Digital Preservation (Skinner, Mevenkamp 2010).


A case study: Francesco Pecoraro’s archive
The most difficult acquisition for PAD has been the archive of Francesco Pecoraro in 2015. It
has been the best test case so far for our procedures and workflow, and has helped us to re-
examine many aspects thereof, such as the ingestion of files from different media and of the
posts published on the blog and on social networks.


The author, who is also an architect, was very popular for his blog “Tash-tego”, which was active
from 2005 to 2011; he was subsequently quite active on Facebook until April 2015. He made his
début with the short stories of Dove credi di andare (Pecoraro 2007), a collection of writings from
his blog in Questa e altre preistorie (2009), the poems in Primordio Vertebrale (2011) and the
novel La vita in tempo di pace (2013).
During our first meeting, Pecoraro noted that he first used a PC for writing in the 80s. He used to
work with a workstation running Windows 7 at the time of the deposit and uses Dropbox for the
majority of his writing. He also makes use of two external hard disks for storage, one of which
also contains files that are not preserved on Dropbox: this hard disk is organized with directory
names that informally describe where the files were previously located (for example: “White
Thumb Drive”). The author backed up his work on a number of other occasions, especially (but
not limited to) upon changing his workstation. His archive presented PAD with a "Russian doll"
file structure and posed a number of problems in the first validation step.


Pecoraro provided us with 35 floppy disks and the recommendation to convert any CAD files of
his architectural work to either JPEG or PDF format. Of these 35 floppies, 10 could not be read
any more and 5 of them contained a spanned ZIP file which could not be extracted even using
an old version of WinZip95. The AutoCAD files have been converted to PDFs and, in the
process, many viruses were found. Subsequently, we returned the materials to the author
because PAD had refused to archive the obsolete media. He also gave us a DVD containing the
work carried out by the editors of the book Questa e altre preistorie, including the final print
version. We presented the author with a Kodak Preservation Gold DVD with all the files
converted into open document formats.


After the first meeting we proceeded with the deposit process. The files were copied to a hard
disk with hardware cryptography capabilities. We transferred files from Dropbox, the author's
personal computer and an external hard drive. During the deposit process, the operator took
notes and screenshots, taking especial note of the archival elements that would not be included
in the deposit. We are always extremely careful in creating different folders on our external hard
disk to represent the original source content and to consider them, from the archival point of
view, only a reconstruction of the original situation.


Pecoraro gave us more than 43,000 files and specified that he wished to keep part of the private
correspondence under embargo for 30 years. Obviously, in every ingestion, we always check in
the temporary area to see whether an author has given us any highly personal files by mistake.
One must face the problem, however, of how it is possible to check thousands of files to trace
items that should be under embargo or highly personal files that need to be removed or kept
unavailable.


From theory to practice: PADManager
This very difficult part of the project helped us focus on developing a system that could manage
the file from the deposit to the permanent storage area, which would add metadata that could be
helpful in the next steps of the process. Using Pecoraro’s archive we designed PADManager, a
piece of software that exploits the database functionality of PAD’s architecture for cataloguing
and managing archives. The future development of PADManager is projected to use techniques
normally positioned in the field of artificial intelligence, such as machine learning and natural
language processing. The ultimate aim is to allow the scientific community to access the archive
and to provide tools for text mining and statistical analysis that could assist in the study of textual
content.


                              Figure 2. PadManager: the Archiver Section


This test version is divided into three main sections: Archiver, Project and Biblio. Archiver, in
particular, is the part that was developed while we were contending with the complexity of
Pecoraro’s archive. Each element can be checked through the software, where it is possible to
select actions for all files and folders: checking the item with the author, identifying sensitive and
undisclosable files, and locating files with technical problems. Every file can be seen in a simple
preview or rendered to PDF format for ease of reading. There are functions to add temporary
bookmarks to the archive, check technical metadata, and add tags either to a single document or
to any instances of a document that is found replicated through the archive. Operators may also
add chronological references, which are particularly useful when those that can be derived from
the existing technical metadata appear to be incorrect with respect to the document contents.
Bibliographic data collection is added in the Biblio section, which has been designed, for the time
being, using a template that follows the Wikipedia guidelines in the hope of future publication as
linked open data. Bibliographic data are needed for the archival description of funding sources
(Project component of the platform), inspired by the FRBRoo model (Bekiari, Doerr, Le Bœuf, &
Riva 2015).


The experience of ingesting a complex archive such as Pecoraro's showed us that there is still a
long way to go in this work, not only in terms of the development and implementation of a data
management application such as PADManager, but also regarding the improvement of the
acquisition process, which does not only depend on expertise in Information Technology but also
on the individual expertise of the operators. Hopefully there will be an opportunity to cooperate
with other international institutions while we tread this path, with the common objective of
improving best practice in the still relatively little known field of born-digital literary archive
management.


REFERENCES
Barrera-Gomez, J., & Erway, R. (2013). Walk this Way: Detailed Steps for Transferring Born-
Digital Content from Media You can Read In-house. Dublin, Ohio: OCLC Research.
Bekiari, C., Doerr, M., Le Bœuf, P., & Riva, P. (2015). Definition of FRBRoo. A Conceptual
Model for Bibliographic Information in Object-Oriented Formalism. Den Haag: IFLA.
https://www.ifla.org/files/assets/cataloguing/FRBRoo/frbroo_v_2.4.pdf
Carroll, L., Farr, E, Hornsby, P., Ranker, B. (2011). A Comprehensive Approach to Born-Digital
Archivers, Archiviaria, 71, 61-92.
Consultative Committee for Space Data Systems (2012). Reference Model for an Open Archival
Information       System       (OAIS).    Washington     DC:     CCSDS         Secretariat.
https://public.ccsds.org/pubs/650x0m2.pdf
Erway, R. (2012). You’ve got to Walk Before You Can Run: First Steps for Managing Born-
Digital Content Received on Physical Media. Dublin, Ohio: OCLC Research.
http://www.oclc.org/research/publications/library/2012/2012-06.pdf
Kirschenbaum M. G., Farr, E. L, Kraus K. M. et al. (2009). Digital Materiality: preserving access
to computer as complete environments, iPRES 2009: The Sixth International Conference on
Preservation of Digital Objects. California Digital Library, 5-9 October, UC Office of the
President, 105-112. https://escholarship.org/uc/item/7d3465vg
Kirschenbaum M. G., Ovenden, R., & Redwine G. (2010). Digital Forensics and Born-Digital
Content in Cultural Heritage Collections, Washington DC: Council on Library and Information
Resources.
Kuny, T. (1997). A Digital Dark Ages? Challenges in the Preservation of Electronic Information,
Workshop: Audiovisual and Multimedia joint with Preservation and Conservation, Information,
Technology, Library Buldings and Equipment, and the PAC Core Programme, 63rd IFLA Council
and General Conference.
Lavoie, B. (2014). The Open Archival Information System (OAIS) Research Model: Introductory
guide (2nd Edition). Dublin, Ohio: OCLC Research. http://dx.doi.org/10.7207/TWR14-02.
Pecoraro, F. (2007). Dove credi di andare. Milano: Mondadori.
Pecoraro, F. (2009). Questa e altre preistorie. Firenze: Le Lettere.
Pecoraro, F. (2011). Primordio vertebrale. Roma: Ponte Sisto.
Pecoraro, F. (2013). La vita in tempo di pace. Roma: Ponte Alle Grazie.
Sample, I. (2015). Google boss warns of “forgotten century” with email and photos at risk. The
Guardian, 13th February. https://www.theguardian.com/technology/2015/feb/13/google-boss-
warns-forgottencentury-email-photos-vint-cerf
Skinner, K., Mevenkamp, M. (2010). DDP Architecture, in Skinner, K., Schultz, M. A Guide to
Distributed Digital Preservation. Atlanta: Educopia.
Stollar Peters C. (2006). When Not All Papers are Paper: A Case Study in Digital Archivy,
Journal of the Society of Georgia Archivists, 24, 22-34.
Weston, P. G., Carbé E., & Baldini, P. (2016). Hold it All Together: a Case Study in Quality
Control forn Born-Digital Archiving, Qualitative and Quantitative Methods in Libraries (QQML), 5,
695-710.
http://www.qqml.net/papers/September_2016_Issue/5313QQML_Journal_2016_Westonetal_69
5-710.pdf