Long-term Electronic Archives: Modernization and Integration Aleksandr Marchuk, Sergey Troshkov, and Irina Krayneva A.P. Ershov Institute of Informatics Systems (IIS), SB RAS, Acad. Lavrentjev pr., 6, Novosibirsk 630090, Russia Abstract Electronic archives are more than just “paper based” funds transferred into digital form. The task of forming a qualitative source base for historical research is solved in these projects on the basis of interdisciplinary interaction of computer science and humanities. One of the long- term projects at the IIS SB RAS is a creation of a methodology and technologies for the formation, use and support of electronic archives. Here are information systems of historical orientation that were created on the basis of the solutions described in this paper: the Electronic Archive of Academician A.P. Ershov, Photo Archive of the Siberian Branch of the Russian Academy of Sciences, Archive of the newspaper “Science in Siberia”, Open Archive of SB RAS, etc. Each of these resources has its own specifics, but in general, their documentary and information fund is based on the general social and territorial basis of scientific and social activities of the Siberian Branch of Academy of Sciences of the USSR/RAS and Novosibirsk Akademgorodok. In this report we will look into the general problems of creating electronic archives, relevant technologies, and architectural solutions. Some approaches to the integration/disintegration of isolated electronic resources are given on the examples of our electronic archives and on the basis of generally accepted and created tools. We will also examine the experience of using proprietary and open source software, their advantages and weaknesses. Keywords 1 interdisciplinarity, high-quality information, integration of electronic resources, open archives, Open Archives Initiative, euroCRIS, proprietary software, Semantic Web, Drupal 1. Introduction Information technology (IT) has been successfully integrated in humanitarian research for a long time. There are several scientific and educational centers in Russia teaching the use of ICT in the humanities and practically applying the acquired skills. Such universities as Moscow State University (corresponding member L.I. Borodkin, Doctor of I.N.I.M. Garskova), Altai State University (Doctor of I.N. V.N. Vladimirov), Tomsk State University (Doctor of I.N. S.A. Nekrylov), Izhevsk State University (Doctor of Philology N. V.A. Baranov), Perm State University (Doctor of I.N.S.P. Kornienko), Krasnoyarsk State University (Candidate of Philology N. I.A. Kizhner) [1] can serve as an example. State archives, which are the main place of interest for the researchers, digitize the scientific reference apparatus [2]. The development of approaches to the creation of a unified information system(IS) for the archival sphere is the area of interest of Doctor of Historical Sciences Yu.Yu. Yumasheva from the All-Union Research Institute of Documentation and Archival Affairs (VNIIDAD) [3]. It has been empirically established that automation helps not only to create and save electronic copies of artifacts, but also to build convenient ways for the researcher to systematize them. What was previously achieved by years of painstaking work in “paper based” archives is carried out in weeks and days in the presence of an electronic archive. The most significant property of IS is the structuring of documents, which includes the creation of a database of related information. At the early stages of the SSI-2021: Scientific Services & Internet, September 20–23, 2021, Moscow (online) EMAIL: mag@iis.nsk.su (A.G. Marchuk); kamronis@xtech.ru (S.N. Troshkov); cora@iis.nsk.su (I.A. Krayneva) ORCID: 0000-0001-8455-725X (A.G. Marchuk); 0000-0003-2952-9509 (S.N. Troshkov); 0000-0002-0601-9795 (I.A. Krayneva); © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) creation of electronic bibliographic and archival systems, the database basically consisted of a collection of cards with the descriptions of artifacts. This approach is described in the classical set of meta-information fields called “Dublin core” [4]. In more modern approaches, the Dublin core is expanded to a more advanced construction or the structuring system explicitly or implicitly contains it as part. A significant problem of accumulation and structuring of information is the integration of data. Electronic archives are formed around a certain local body of documents or a number of collections. The sparseness of electronic archives (EA) over the Network is not an obstacle if there are simple means of combining document sets to ensure uniform access to documents by users. The problem is in the database. The intersections of entities were not taken into account during the creation of each EA. This means that grouping information around entities (persons, organization systems, geographical systems) is difficult. The task of integrating EA data is to create convenient ways of using external information of this archive and using of this archive by other IS. However, we are not tasked with creating a “universal” archive but the development of a unified toolkit can lead to this. The purpose of the work conducted on electronic archives at the IIS SB RAS is to create methods and technologies for electronic archiving, to accumulate data on the history of science and scientific society. Another purpose of the research presented in this article is to study the possibilities of integration and interoperability of various electronic archives, modernization of platform and software solutions for them. 2. International experience of using IT in archival business The idea of using IT for legacy archiving turned out to be quite productive. It has brought to life several international projects to unify approaches of its design. One of them was the pan-European project Open Archives Initiative (OAI, 1999) [5]. The OIE “develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. The Open Archives Initiative has its roots in an effort to enhance access to e-print archives as a means of increasing the availability of scholarly communication. The fundamental technological framework and standards that are developing to support this work are, however, independent of the both the type of content offered and the economic mechanisms surrounding that content, and promise to have much broader relevance in opening up access to a range of digital materials. As we gain greater knowledge of the scope of applicability of the underlying technology and standards being developed, and begin to understand the structure and culture of the various adopter communities, we expect that we will have to make continued evolutionary changes to both the mission and organization of the Open Archives Initiative” [6]. The OIE-group created some specifications as a Protocol of Metadata Harvesting, Implementation Guidelines for the Open Archives Initiative Protocol for Metadata Harvesting, etc. The association proposed XML encoding as the packaging mechanism for collected metadata. The euroCRIS project, an international organization of research Information, was launched almost simultaneously.It provided the Common European Research Information Format (CERIF) is the comprehensive information model for the domain of scientific research. It is intended to support interchange of research information between and with CRISs. Among other use-cases, the OpenAIRE Guidelines for CRIS Managers are based upon it [7]. Technical Committee for Interoperability and Standards (TCIS) was established in sept. 2020, coexist with the CERIF TG, aims to deal with the decision-making process for the evolution of CERIF and associated technical products. The TCIS will also set the roadmap and strategy towards the widespread adoption of CERIF and its alignment with other relevant technologies, models and standards. The differences in the activities of the two projects are in the scope of scientific activities: euroCRIS develops the CERIF data format for a comprehensive description of all scientific activities, while OAI focuses on the process of presenting metadata of objects (artifacts) and exchanging them between information systems as shown on Figure 1. 70 Figure 1: A grouping of CERIF entities (https://eurocris.org/services/main-features-cerif) In any case, international efforts in the field of developing an open information space in order to integrate resources as well as attempts to take independent steps in this direction are very welcomed. These standards have found almost no application in their original form in Russia so we will try to justify the relevance of our approach. In our case, it is due to several circumstances: ● like all other countries, Russia has a number of restrictions on the open publication of archives. In our case we work directly with the owners of collections and we receive permission to disclose personal data for publication from them; ● conducting and developing research in the field of creating information systems is dictated by the internal imperatives of the development of computer science and the use of tools chosen for the implementation of ideas. In addition to the fact that we develop proprietary software we also have experience in using open source software. In 2016 S.N. Troshkov performed a migration of A.P. Ershov Electronic Archive to Drupal open source software as part of his bachelor's degree work in Mathematics and Mechanics department of Novosibirsk State University. Following the migration of the Electronic Archive, the migration of the Library system developed back in the early 1980s was also performed using the client- server technology [8]. 3. Electronic Archives of IIS SB RAS: the technology Several information systems have been created at the Institute of Informatics Systems SB RAS in the 2000s as a result of a large number of experiments. The research work started with the implementation of the task of turning the paper archive of Academician A.P. Ershov into an electronic form, which at first seemed simple. The complexity of the work was not only due to the number of documents (over 75 thousand) but also due to preserving the author's idea of thematic and systematic organization of the corpus of documents as shown on Figure 2. We tried to adhere to the developed principles of structuring, functionality and approaches to interfaces in the future, although the basic technologies were changing, and the capabilities of computers, networks and external equipment were growing. 71 Figure 2: Thematic and systematic catalogues in A.P.Ershov’s Electronic Archive The technology for the project was created for each new project. And for the later projects we used modified versions of older technology. The technology defined ways to store document content and databases (information databases), ways to add, describe and edit data. Two interfaces were created: one for information specialists creating or modifying an information database (the so-called backend) and another for EA users (frontend). Our projects support different types of multimedia documents, multiple languages, simple and advanced search. At first, we used classic client-server solutions with an SQL database. Later, we created our own platform based on the Semantic Web paradigm. Important properties of the Semantic Web for the problems of EA are the presence of a basic structuring model in the form of a directed graph, standards for the representation of data and knowledge, the formalism of the description of ontologies, the query language for the standards RDF, RDFS, OWL, Sparql. The technological platform is divided into several layers, which helps both in the implementation and in the modification of individual parts of the EA. The information base is formed from sections called the cassettes. The cassette groups the documentary content in the form of files into a single entity that is easy to create, store, and process. One of the features of the solution is that not only the original document file is stored, but also some "lightweight" versions of the document for quick use on the Internet. For example, a photo scanned in high resolution may become a file with size of 50 MB but its compressed versions may be 100 KB or even 10 KB. In addition to the formats of primary documents, the technological platform includes procedures for processing and converting various multimedia. The database is organized in a special way. It is created in the form of RDF documents and stored in cassettes along with document files. As a result, cassettes are an important block for building EA configurations as shown on Figure 3. Client-server architecture consists of a distributed system of cassettes and some set of configurations (depicted as servers). The configuration defines a set of working cassettes so the configuration is specialized for a given topic or for a given fund. There are also specifics of storing different types of multimedia (photo-video-audio), document forms (PDF, etc.), assembly documents, e.g. DVD movies. We also conducted experiments with special forms of representation, such as Deep Zoom and vector graphics. There are two problems in supporting different document formats. First is providing prompt access to the contents of the document through a regular browser. And the second is the evolution of the formats and their application in the modern world. For example, Microsoft Research stopped supporting such unique technologies as Silverlight and DeepZoom, with which we used for the processing of the collection of the weekly newspaper “Science in Siberia”. As a result of this, access to the electronic collection was partially lost. 72 Distributed information database Original documents input Dynamic synchroniza tion Operators Users Figure 3: Principle of work of document electronic archives The issue of data editing is solved in a special way at the architectural level. Editing is performed only by adding new database elements. Accordingly, the deletion and modification is performed using timestamps in the data. This allows us to support the methodology of a small layer of changes in the database. This means that most cassettes are static: they do not change over time. Changes are made only in a small number of cassettes and documents. The order of operation follows as such: one operator edits its own RDF document. This gives us the protection from unscrupulous or malicious activity of operators editing shared data. Another layer of architecture is the engine for the RDF database. We have created our own version of the engine that works effectively with medium-sized databases. We do not use the SPARQL query language (so far), although we have experimented with it. We have created our own API that effectively supports the required selections and actions. The current version of the main platform solutions is implemented in C# technologies: ASP.NET, PolarDB library [9]. 4. The lifecycle and modernization of information systems The oldest information system created and developed in our institute was the information system A.P. Ershov Library. It was created for the BESM-6 computer back in the era of punched cards. Both its creation and further processing were carried out by enthusiasts. The addition of funds and correction of errors and inaccuracies, was carried out by employees of the Department of Scientific and Technical Information (DSTI) of the Computer center of the USSR AS and later IIS SB RAS. In the 1990s the A.P. Ershov Library was ported to MS DOS using the FoxPro 2 DBMS. This system kept a catalog of books and magazines, supported registering users and forming lists of new arrivals. It served the IIS for almost 30 years in this form. Despite the relevance and functionality of the application, the environment in which it was created was outdated. Further support and development proved impossible. In addition, the application did not support data structuring and dictionaries. All data was of string type, filled in manually by the librarian, which caused errors and duplicates. In 2018, the application A.P. Ershov Library was reengineered [10] with the help of the freely distributed Drupal web platform. This work included semi-automatic correction of errors in the names of authors and names of funds was performed, data migration with the preservation of the data model and the implementation of convenient modern interfaces. 73 The first Internet-oriented electronic archive created and maintained by the IIS SB RAS was the Archive of Academician A.P. Ershov [11]. It was created with the classical scheme of a Web application built on a relational database, having a public interface (frontend) and an editing interface (backend). A significant amount of work has been done on scanning and describing documents contained in more than 500 folders of the “paper based” archive stored in the IIS SB RAS. The technology of adding, editing of information, presentation of documents and additional information to users turned out to be successful so later additional segments not related to the folders formed by A.P. Ershov were added to the archive. These were the archives of “Start” collective, IIS SB RAS archive, including PSI conferences, as well as the archive of corresponding member of the USSR AS Svyatoslav Sergeevich Lavrov. The project was developed in 2000 with the support of Microsoft Research and faced one of the frequent problems for applications using proprietary software. 15 years after the development of the project it was still relevant but further support and development were difficult due to the expiration of licenses for proprietary software. That’s why we decided to migrate the application to freely distributed software in 2015. An important condition for migration was the preservation of the original archive data model since it has a historical value as one of the first data models for electronic archives in Russia. In 2016 the migration to the freely distributed Drupal web platform was finished successfully. Aside from data migration the user and archivist interfaces were improved and the support for mass image uploading was implemented. Another project is the Photo Archive of the SB RAS (Photoletopis) [12]. A technology based on the Semantic Web was prepared for it [13, 14]. We also formed an ontology: a data structuring system, which later received a formalization in the form of an ontology of non-specific entities. The new problems to overcome in the project were not only the problems related to a new approach to structuring but also the processing and presentation of photo and video materials, data dispersion and data protection from accidental and malicious distortions. As the number of applications for storing archives in digital format began to grow we formed an approach and the technology for the multiarchive system of the Open Archive of the USSR AS / SB RAS [15]. The basis of technological solutions was the same only interfaces and the structure of document descriptions changed for more detalization. For example, documents were considered as composition of its parts (sections, pages, page scans) and authorship was expanded with the ability to add information on recipients and authors of documents. We have created convenient tools for information operators to process large volumes of scanned pages. 25 archival funds have been loaded into the Open Archive currently and work is underway on several more. Another problem is the physical safety of data. Hard drives can be damaged as a result of physical wear and external influences. Although we have not encountered serious virus threats this should also be taken into account. We can’t say that the safety of our data today is provided at a modern reliable level because our solutions are heavy and require costs. The basis for ensuring the reliability of storage and functioning of electronic archives is a server pool of machines and devices that provides virtual machines maintained by the IIS SB RAS for the developers and operators. The use of specialized solutions or cloud infrastructure would be appropriate here but for now we want to fully control our data, while maintaining the necessary flexibility in solutions. The described solution of the modernization of the A.P. Ershov Electronic Archive does not seem to solve all the problems of ensuring a long service life also. The experiments that will be described below are to be considered as an attempt to determine the direction of the technological evolution of the archive. A testing ground for new solutions was the electronic archive of the Summer School of Young Programmers (SSYP), which is held annually by the Institute for the talented schoolchildren. During the creation of this archive moved from technological solutions from an XML database and interfaces formed by XSLT tools (at that time, Semantic Web was still unknown) to quite modern Web applications with RDF and OWL, ASP.NET etc. Schoolchildren who studied at SSYP contributed to the project. The archive is constantly updated and used (http://mag.iis.nsk.su/syp). 74 5. Integration purpose and challenges Electronic archives have the purpose of collecting and storing data and information as well as providing convenient access to information for specialists and the public. At the same time, data preservation is also significant. So the question arises about the possibility of full or partial integration of different archives. The task of integration is not only relevant to the mentioned information collections but also has a general scientific interest. To understand the possible nature of archive integration, it is necessary to study the interests of their users. Since access to archival information in electronic archives is much faster and more convenient than in classic “paper based” ones the number of possible users is significantly expanding. In the past the main users were scientists-historians who receive information for research purposes in the documentary archives but now the archives can easily be used by any curious person on the Internet interested in history of their family, city, country, etc. We mainly use the ontology of non-specific entities to structure documents and data in our archives. Figure 4 shows in the form of a tree the system of classes and relations (properties) that form the basis of this ontology. 75 Figure 4: Basic classes and relations of non-specific entities Starting from the document of a particular collection, the creators of the archive fill in the authors of the document, the presented persons, organizations and geographical places described in the document. So the additional structuring of the archive is performed through people, organizations and geoinformation, which may be useful for the user. Thus, integration can be general (the maximum amount of information requested), individual (specific problem) and problem-oriented (a special thematic request). At the same time, we have already shown a need to split a single archive into parts. Breaking archives in parts may be significantly more difficult than combining them. This is because the combined archive has multimedia, documents and a database integrated. Despite the fact that considerable efforts have been made to ensure the modularity of stored documents through cassette technology and segmentation of the database through fog-segments, data intersection is a problem. The splitting technology as well as the merging technology is still in the phase of creation and formalization. The task of splitting the archive into parts also includes the requirement to ensure further development of the archive in terms of addition and modification of the document set and database. A copy of a part of the archive (a subcollection or collection created on a thematic request) at a particular moment seems of little use to the archive that continues to be filled with additions and edited. 76 6. Integration experiments The purpose of the described experiments was to find out the possibility of joining and mutual enrichment of document and information funds that are already in operation separately. The method of performing the integration also has scientific interest. The first experiment depicts an attempt to expand the information provided to the user of the Electronic Archive of Academician A.P. Ershov (EAE). During this experiment, we wanted to expand the information from the EAE with insertions from another information system. In order to do this, it was necessary to add a web service to the donor information system that allows you to dynamically respond to requests from a web application or web page and generate a response in the form of HTML markup with the requested additional information. Similar approach is often used in other systems. The response message contains a photo, information about names, dates of life, major degrees and awards, and professional path. In this experiment the information was taken from the photo archive of the SB RAS. What do we have as a result? EAE as an archive is an isolated object. It has information about the persons who are the authors or addressees of the documents, organizations and other independent (not belonging to a specific archive) objects. The information about these entities was recorded at the time of archive creation and it may change over time. EAE also has almost no photo materials relative to persons. We assumed that the information available in EAE should be expanded as a result of integration with other sources. An example of such an experiment can be seen on Figure 5. Figure 5: Person related information from photo archive SB RAS in A.P. Ershov Electronic Archive What do we have as a result? EAE as an archive is an isolated object. It has information about the persons who are the authors or addressees of the documents, organizations and other independent (not belonging to a specific archive) objects. The information about these entities was recorded at the time of archive creation and it may change over time. EAE also has almost no photo materials relative to persons. We assumed that the information available in EAE should be expanded as a result of integration with other sources. An example of such an experiment can be seen on Figure 5. The object of another experiment was also EAE. In this case, it was not the information system itself that was relevant to research but its database. The goal of the experiment was to immerse EAE in the Open Archive of the SB RAS as a separate fund. With the reimplantation of the archive with usage of the principles and technologies of the Semantic Web, usage of the ontology of non-specific entities, usage of cassettes, fog files of the database such immersion is quite simple: you need to make the matching of entities and bind the root of the composition to the “Funds” collection. At this stage of the experiment, the extraction of the database from EAE to the Open Archive was carried out manually but this process can be automated in the future. The experiment showed that such an immersion of a third-party fund can be successfully carried out without significant labor costs. On the one hand EAE as a fund is complemented by the possibilities of using navigation and search of an Open Archive, on the other hand in some cases there is an enrichment 77 of access to documents and data for other already existing funds. For example, the fund of corresponding member Alexey Andreevich Lyapunov, structured within the framework of the Open Archive, is expanded with the documents from the archive of A.P. Ershov, which were not in the Open Archive in the first place. One of the most significant problems of the practical application of Semantic Web technology is the fact that it does not allow creating a common system for identifying real-world objects, which means that there is no single identification system for database records. This means that each database assigns identification codes independently, effectively constraining the identification space within its own framework. Integration of such databases requires matching of record identifiers corresponding to the same entities. The matching process can be experimentally carried out by comparing full names of entities of the same or related classes. This approach has a certain effectiveness, especially when the full names are available in the database with name, surname and patronymic. The effectiveness of identification is quickly lost when using incomplete variants of names, initials, different translation options, etc. In such cases, it is necessary to apply more complex methods of analyzing not only the name, but also the context [16]. 7. Conclusion We developed an approach to ensure the technological improvement of long-lived archival information systems as a result of long evolution of methods. The tools for creating and maintaining electronic archives created by IIS SB RAS have been used and developed for more than 2 decades. A significant number of industrial and experimental information systems have been created with the help of those tools. The most active information systems have been modernized and adapted to the changing technologies of the underlying platforms at various times. The next step is the development of the principles of coexistence of different information systems and dealing with issues of their integration and disintegration. Substantial experiments were carried out on partial or complete inclusion of the resources of one electronic archive into another. It is shown that such inclusion can be carried out without destroying the integrity of the systems representing the author's composition. 8. References [1] Istoricheskie issledovaniya v kontekste nauki o dannyh: Informacionnye resursy, analiticheskie metody i cifrovye tekhnologii. Tezisy mezhdun. konf. Moscow: MAKS Press, 128 s., 2020 (Informacionnyj byulleten' Associacii “Istoriya i komp'yuter”, no. 48). [2] Russian State Archive of Social and Political History (RGASPI). URL: http://rgaspi.info/fonds/ (the sample). [3] Yu. Yu. Yumasheva, Edinaya avtomatizirovannaya informacionnaya sistema arhivnoj sfery: ot postanovki zadachi k tekhnicheskomu zadaniyu. In coll.: Documentation in the Information Society: Actual Problems of Electronic Document Management. Proc. of the XXIV Intern. scientific-practical conf., 2018. pp. 227–240. [4] Dublin CoreTM Memadata Initiative. URL: https://dublincore.org/ [5] Carl Lagose and Herbert Van der Sompel, The Open Archives Initiative: Building a low-barrier interoperability framework. URL: https://www.openarchives.org/documents/jcdl2001-oai.pdf [6] Open Archives Initiative Organization. URL: https://www.openarchives.org/organization/ [7] S. Parinov, Mezhdunarodnaya professional'naya associaciya razrabotchikov nauchnyh informacionnyh sistem euroCRIS i ee glavnyj produkt CERIF. URL: http://ceur-ws.org/Vol-12976-9_paper-2.pdf [8] S. N. Troshkov, Ob opyte migracii prilozhenij na svobodno rasprostranyaemoe programmnoe obespechenie s otkrytym kodom. Vestnik NGU. Seriya: Informacionnye tekhnologii. 16 (2) (2018) 86–94. [9] A. Marchuk, P. Marchuk, Platforma realizacii elektronnyh arhivov dannyh i dokumentov. V sb. materialov XIV Vseross. nauchnoj konferencii RSDL-2012. 2012. pp. 332–338. [10] A.P. Ershov Memorial Library. URL: http://lib.iis.nsk.su/ [11] Electronic archive of Academician A.P. Ershov. URL: http://ershov.iis.nsk.su/en 78 [12] Photoarchive of the SB RAS. URL: http://soran1957.ru [13] P. Hitzler, A. Gangemi, K. Janowicz and al., Engineering with Ontology Design Patterns: Foundations and Applications, IOS Press, 2016. [14] O. Ataeva, V. Serebryakov and N. Tuchkova, Author's Identification within the Subject Area in the Semantic Library, Proc. of the 22nd Conf. on Scientific Services&Internet (SSI-20) (2020) 12–22. [15] Open Archive of the SB RAS. URL: http://odasib.ru [16] A. G. Marchuk, P. A. Marchuk, Bazovaya ontologiya nespecificheskih sushchnostej BONE i eyo ispol'zovanie dlya postroeniya informacionnyh sistem. Vestnik SibGUTI (4) (2014) 118– 128. 79