Towards an Ontology for Describing Archival Resources Laura Pandolfo1 , Luca Pulina1 , and Marek Zieliński2 1 POLCOMING, Università di Sassari, Viale Mancini n. 5 – 07100 Sassari – Italy laura.pandolfo@uniss.it, lpulina@uniss.it 2 Pilsudski Institute of America, 138 Greenpoint Avenue, Brooklyn, NY 11222 – USA MZielinski@pilsudski.org Abstract. Several digital libraries and archives are emerging around the world due to the need to store, organize and make available on the Web a lot of resource collections. However, managing this information poses new challenges in order to overcome traditional data management and information browsing. Semantic Web technologies can improve dig- ital libraries and archives by facilitating metadata storage and adding semantic capabilities, which increase the quality of the information re- trieval process. In this paper we present arkivo, an ontology designed to model the archival description of historical document collections. 1 Introduction The Web changed the way people can search and discover information provid- ing them the opportunity to have direct access to millions of documents easily. Online repositories, such as digital libraries, support users’ exploration of large document collections and, as in the case of digital historical archives, also facil- itate access to original and rare documents. Recently, digital archives are facing new challenges in order to overcome traditional data management and informa- tion browsing. The Semantic Web (SW) [2] technologies provide ways to address these challenges by offering valuable solutions to represent, organize, and retrieve such kind of data. In particular, ontologies play a key role providing a common shared vocabulary that can be used to describe domains, annotate documents and promote interoperability and consistency between different sources [9, 10]. In the context of digital libraries and archives, some of the most used meta- data and ontologies include Dublin Core Metadata Initiative (DCMI) [17], CIDOC Conceptual Reference Model (CRM) [4], MAchine-Readable Cataloging (MARC), Metadata Object Description Schema (MODS) [8], and Encoded Archival De- scription (EAD) [14]. However, none of these can exhaustively support both the representation of the archival arrangement structure and the annotation of his- torical data embedded within the documents – the importance of which has been highlighted in, e.g., [1]. To address these needs, in this paper we introduce arkivo1 , an ontology de- signed to accommodate the description of historical archival documents, support- 1 Arkivo is the translation of “Archive” in Esperanto. 112 A. Adamou, E. Daga and L. Isaksen (eds.) ing archive workers by encompassing both the hierarchical structure of archival collections and rich metadata created during archive digitization, such as his- torical elements. The aim of arkivo is not only to provide a reference schema for publishing Linked Data [3] about historical archival documents, but also to describe the historical elements contained in these documents, e.g., giving the opportunity to represent useful relationships between people, places, and events. In this paper, we also describe the usage of arkivo in the context of the his- torical archive stored by the Józef Pilsudski Institute of America, which houses a rich collection of historical sources covering the period from the 1863 to the present day. The paper is organized as follows. In Section 2, we briefly the arkivo ontol- ogy and its design process, while in Section 3 we show the usage of arkivo in the context of the digitized collections of the Józef Pilsudski Institute of America. We conclude the paper in Section 4 with some final remarks and future work. 2 The ARKIVO Ontology The ontology development process can be characterized by different strategies and methodologies – see, e.g., [16, 7]. arkivo has been developed according to a top-down strategy, which consists first in identifying the most abstract concepts of the domain and then in specializing the specific concepts. In the following, we report the main phases of the development process of arkivo, which have been carried out with the support of the experts. Requirements Specification and Knowledge Acquisition. In this phase, we consid- ered different scenarios, use-cases and end-users, focusing on the archival man- agement practices and the most common methods used by archives for stor- ing and cataloging materials. Moreover, we analyzed the best practices used by archive workers in the metadata collection process. This phase allowed us to detect the main concepts useful to represent the domain of interest. Conceptualization and Formalization. In the light of the knowledge gained in the previous phase, we have drawn up a glossary of terms that identify the proper terminology used in the archival domain. The aim of the conceptualization result- ing from this activity was intended to structure the domain knowledge, in terms of concepts and relations, in order to meet the pre-established requirements. In particular, we compute a taxonomy for describing the archival arrangement lev- els, from the concept of collection, which can contain items or other collections as fonds, to the concept of single item, which typically is the smallest indivisible unit. Integration. Some of the concepts resulting from the conceptualization phase can be represented by reusing existing standard metadata and vocabularies. For this purpose, we integrated arkivo with the several core ontologies and vocab- ularies. In details, DCMI, FOAF2 , and schema.org [13] were used to model some 2 http://www.foaf-project.org/ 2nd Workshop on Humanities in the Semantic Web (WHiSe 2017) 113 general information related to documents, organizations, places and persons. We also referred to BIBO3 ontology in order to have a detailed classification of doc- uments. In order to link a place name to its current geographical location, we used Geonames4 . Finally, we integrated LODE [15] ontology to model events and their properties. Implementation. arkivo has been developed using the OWL2 language [6] with the protégé [5] editor. The ontology is composed of 43 classes, 24 object prop- erties, and 34 data properties. In the following, we pinpoint some of the main classes and properties of arkivo ontology. Notice that we include the core ontologies prefixes, namely dc for Dublin Core, foaf for FOAF, schema for schema.org, and bibo for BIBO ontology. Finally, the empty prefix is used for original classes and properties of arkivo. bibo:Collection is the class that represents set of documents or collections. This class has several sub-classes, including :File and :Fonds. The former is the class devoted to describe a file, namely an organized unit of items grouped together, while the latter relates the whole of the records organically created and/or accumulated by a particular person, family, or corporate body in the course of that creator’s activities and functions. :Date is the class containing dates mentioned in an item. foaf:Organization is used to describe an organization related to bibliographic items or to events. foaf:Person represents people related to a bibliographic item or to a specific event. :Item represents the archival item, in other words the smallest intellectually indivisible archival unit. This class contains several sub-classes, such as bibo:Article, :Document and bibo:Letter. dct:creator is the relationship that shows who has created a specific item, connecting individuals in :Item class to individuals in foaf:Agent class. dc:created indicates the date when it was created an individual of the class Item. schema:isPartOf indicates that an individual in the class :Item is part of a col- lection, by linking that individual to another in the class bibo:Collection. schema:mentions is useful to indicate that an instance of foaf:Person and/or schema:Place is mentioned in an individual of the class bibo:Collection. :isSectionOf connects instances of :File to instances of :Fonds. :repository connects instances of foaf:Organization class to instances of bibo:Collection class, in order to describe that an organization can be a repository of collections or items. arkivo ontology is licensed under a Creative Commons Attribution 3.0 Un- ported License and it can be downloaded at http://purl.org/arkivo. For more details about the full list of classes and properties see also the documentation at https://github.com/ArkivoTeam/ARKIVO. 3 http://bibliontology.com/ 4 http://www.geonames.org/ontology 114 A. Adamou, E. Daga and L. Isaksen (eds.) 3 Case Study: the Józef Pilsudski Digital Archival Collections The Józef Pilsudski Institute of America5 was established in 1943 in New York City for the purpose of continuing the work of the Institute for Research of Modern History of Poland established in Warsaw in 1923. The Polish State was re-established in 1918 in the aftermath of the Great War and after several regional wars and uprisings, the borders were settled in 1922. Soon after a group of historians and officers begun to travel around the country to collect archival documentation. At the beginning of World War II, part of the archives were evacuated and landed in Washington, eventually creating the seed of the Institute archival collections, which grew in time by donations from politicians, officers and organizations of prewar Poland and Polish diaspora. Today, the Institute has some 240 linear meters, namely 2 million pages, of archives covering mostly the Polish, European and American history of late 19th and 20th century. The collection includes documents, photographs, films, posters, periodicals, books, personal memoirs of diplomats, and political and military leaders, as well as collection of paintings by Polish and European masters. For the last nine years, the archival collections are being digitized, and gradually put online. The main objective of the historical research is to understand the past through the study of historical sources, such as documents stored in archives. In this context, researches are mainly interested in detecting facts (e.g., peo- ple, places, events) cited in the documents in order to analyze them, discover relationships and draw inferences. arkivo ontology, unlike, e.g., EAD, provides elements to represent both the hierarchical structure of archival documents and the historical data expressed in them. As an example, in the following we report the description (in Turtle language) of one of the document stored in the Józef Pilsudski Institute archive, namely the “Letter to comrades in London”. Such document has been wrote by Pilsudski in 1898, and it contains a mention of different people and places, as depicted in Figure 1. :LetterToComradesInLondon a bibo:Letter . :A701.001.012 a :File . :A701.001 a :Fonds . :PilsudskiInstitute a foaf:Organization . :PilsudskiJosef a foaf:Person . :JedrzejowskiBoleslaw a foaf:Person . :MalinowskiAleksander a foaf:Person . :Sachalin a schema:Place . :Bialystok a schema:Place . :LetterToComradesInLondon schema:isPartOf :A701.001.012 . :A701.001.012 schema:isSectionOf :A701.001 . :A701.001 :repository :PilsudskiInstitute . :LetterToComradesInLondon dc:creator :PilsudskiJosef . 5 http://www.pilsudski.org/ 2nd Workshop on Humanities in the Semantic Web (WHiSe 2017) 115 Fig. 1. A graphical example of entities and relationships in Pilsudski digitized collec- tions using arkivo. :LetterToComradesInLondon schema:mentions :JedrzejowskiBoleslaw . :LetterToComradesInLondon schema:mentions :MalinowskiAleksander . :LetterToComradesInLondon schema:mentions :Sachalin . :LetterToComradesInLondon schema:mentions :Bialystok . Finally we report that, actually, in the version of arkivo used for the Józef Pilsudski archival collections are stored about 270,000 triples, and it is populated by more than 130,000 individuals. In detail, there are 13,326 individuals related to items, 15,678 titles, 6,458 authors, 29,280 persons mentioned, 47,185 places, and 28,039 dates. 4 Conclusion and Future Work In this paper we briefly presented arkivo, an ontology designed to model the archival description of historical document collections. In the paper we also show the current usage of arkivo in the context of the historical archive of the Józef Pilsudski Institute of America. Currently, we are working on the realization of an ontology-based digital archive. Future work will include the implementation of automated and adaptive on- tology population processes exploiting the techniques presented in [12, 11], as 116 A. Adamou, E. Daga and L. Isaksen (eds.) well as the investigation of user interfaces aimed at providing the user with a rich interface to explore interesting relationships that arise from encountering a single item or file in the archive. It should help the users to find the unexpected and hidden knowledge accumulated both in the archive and in the Web. References 1. Giovanni Adorni, Marco Maratea, Laura Pandolfo, and Luca Pulina. An ontology for historical research documents. In International Conference on Web Reasoning and Rule Systems, pages 11–18. Springer, 2015. 2. Tim Berners-Lee, James Hendler, Ora Lassila, et al. The semantic web. Scientific american, 284(5):28–37, 2001. 3. Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data – the story so far. Semantic services, interoperability and web applications: emerging concepts, pages 205–227, 2009. 4. Martin Doerr. The cidoc conceptual reference module: an ontological approach to semantic interoperability of metadata. AI magazine, 24(3):75, 2003. 5. John H Gennari, Mark A Musen, Ray W Fergerson, William E Grosso, Monica Crubézy, Henrik Eriksson, Natalya F Noy, and Samson W Tu. The evolution of protégé: an environment for knowledge-based systems development. International Journal of Human-computer studies, 58(1):89–123, 2003. 6. Bernardo Cuenca Grau, Ian Horrocks, Boris Motik, Bijan Parsia, Peter Patel- Schneider, and Ulrike Sattler. Owl 2: The next step for owl. Web Semantics: Science, Services and Agents on the World Wide Web, 6(4):309–322, 2008. 7. Stephan Grimm, Andreas Abecker, Johanna Völker, and Rudi Studer. Ontologies and the semantic web. In Handbook of Semantic Web Technologies, pages 507–579. Springer, 2011. 8. Rebecca S Guenther. Mods: the metadata object description schema. Portal: libraries and the academy, 3(1):137–150, 2003. 9. Sebastian Kruk, Bernhard Haslhofer, P Piotr, Adam Westerski, and Tomasz Woroniecki. The role of ontologies in semantic digital libraries. In European Net- worked Knowledge Organization Systems (NKOS) Workshop, 2006. 10. Sebastian Ryszard Kruk and Bill McDaniel. Semantic digital libraries. Springer, 2009. 11. Laura Pandolfo and Luca Pulina. Adnoto: A self-adaptive system for automatic ontology-based annotation of unstructured documents. In To appear in Proc. of the 30th International Conference on Industrial, Engineering, Other Applications of Applied Intelligent Systems. Springer, 2017. 12. Laura Pandolfo, Luca Pulina, and Giovanni Adorni. A framework for automatic population of ontology-based digital libraries. In AI* IA 2016 Advances in Artificial Intelligence, pages 406–417. Springer, 2016. 13. Peter F Patel-Schneider. Analyzing schema.org. In International Semantic Web Conference, pages 261–276. Springer, 2014. 14. Daniel V Pitti. Encoded archival description: An introduction and overview. 1999. 15. Ryan Shaw, Raphaël Troncy, and Lynda Hardman. Lode: Linking open descrip- tions of events. In Asian Semantic Web Conference, pages 153–167. Springer, 2009. 16. Mike Uschold and Michael Gruninger. Ontologies: Principles, methods and appli- cations. The knowledge engineering review, 11(02):93–136, 1996. 17. Stuart L Weibel and Traugott Koch. The dublin core metadata initiative. D-lib magazine, 6(12):1082–9873, 2000.