Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) World Views - A Digital Archive Infrastructure for the Georg Eckert Institute for International Textbook Research Lena-Luise Stahn1, Steffen Hennicke1, Ernesto William De Luca1 1 Georg Eckert Institute Leibniz Institute for International Textbook Research {stahn,hennicke,deluca}@leibniz-gei.de Abstract This paper outlines the aims of the newly established project World Views. The paper mainly shows the work still to be done, as the project has only recently started and is still in initial consolidation phase. It presents an overview of the planned infrastructure which will work as a digital archive in the textbook research field. Therefore a data middleware is to be implemented to enable integration and standardization of the various GEI data existing in diverse forms as well embed it in a broader semantic context, thus enabling “World Views”. 1 Introduction Textbook research constitutes a rather diverse area of interest and research. By bringing together professionals engaged in textbook research and their manifold knowledge and expertise, the Georg Eckert Institute 1 (GEI) is the central research institution in this field. This role results in various new research projects which aim at promoting and using the textbook as a medium for research in the historical sciences generating large amounts of data which are especially characterized by their heterogeneity and furthermore, are bound to specific infrastructures tailored towards the different kinds of data requirements. It has become apparent that data curation has not been thoroughly considered in the projects’ workflows. Standardization and archiving strategies have mostly been neglected. Up to this day this has resulted in collections containing valuable and important data which, however, exist parallel in separate environments with mostly no interfaces or linking possibilities. This goes against the “Good Scientific Practice” [8] postulated by the German Research Society (DFG) as the projects were funded by public money and the projects’ data and results constitute valuable scientific knowledge worth of long-term curation and preservation so as to allow sustainable usage. Often no knowledge exists of the existing data even within the GEI thus resulting in double work. Even if data has been “discovered” by other researchers its reuse is often difficult or even impossible because of legacy or out-of-date data formats. Additionally, this situation has a negative effect on information retrieval and reuse by external parties since the GEI data neither is interlinked nor is it enriched with information from external sources. Despite its rich diversity and variety in terms of the available research data the GEI infrastructure lacks semantic contextualization. The new project World Views nationally funded (BMBF) and started in February 2015, is an effort to engage with the aforementioned issues. The absence of joint data storage is considered the main cause for this situation, which led to the decision to focus on establishing a suitable infrastructure, where the data integration of each existing project is made possible and which additionally serves as a standardizing basis for future project environments, eventually also leading to the implementation of a long-term curation strategy. With one joint search index which will work on all project data simultaneously the improvement of information retrieval is intended. 1 http://www.gei.de/home.html 42 Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) Another question concerns the semantic enrichment of the (meta-) data, which already forms a common method in other research environments [10, 11, 12]. 1.1 Related work Guideline papers about research infrastructures and research data reuse in the Humanities were found in [11] and in several DINI papers, e. g., [4]. A more international view was found in [6]. In Germany mostly CLARIN-D [1], DARIAH-DE [10, 11, 12] are used to build large information infrastructures, which is why these system have to considered during the evaluation and decision process. In case the World Views infrastructure requires a generic framework apart from these facilities [5] and [3] will provide a basis for evaluating the repository software. The adaption and use of the DTABf is shown in [7], revealing some of the system’s main advantages. The remainder of this paper is organized as follows: part 2 provides an overview of the status quo at the GEI and its various data collections giving two examples of the GEI data in order to provide a more precise picture of how diverse the ways are in which the data is handled (2.1 and 2.2). A summary of the drawbacks resulting from this infrastructure closes this section. Part 3 discusses the aims of World Views and concludes the paper with a summary of the steps which have been taken so far. 2 GEI data An overview of the GEI systems and their technical specifics illustrates the big gap between current research and information infrastructure guidelines [13, p. 11] and the actual situation: Edumeres2, the information and communication portal for international educational media research amongst other provides access to the GEI’s working papers with manually edited metadata and papers held in PDF; edu.data holds information on textbook systems worldwide; edu.experts, a database for textbook research professionals. The structure is implemented in Typo3 where every module has its own search and browse ui, partly with its own website as well (edu.data with Typo3 backend, edu.experts planned as Semantic Media Wiki). The Curricula Workstation provides central access to German and international curricula and also aims to create an archive of curricula. As they are mostly printed the curricula need to be scanned and stored in a DSpace repository, whereas the metadata is manually exported from the library OPAC. Parallel to the OPAC the VuFind-based TextBookCat provides a search entry point for the text book collection, with additional facettes and its own Solr index. And also infrastructures and web representations resulting from scientific projects 3 form a big data pool, e.g. „Nuances” providing teaching materials in various multimedia-based forms, or „Children and their world” trying topic modeling and again own Solr index. As such project proposals require a suitable web-based presentation, every time a new one is implemented and accompanied by its own system which seldom corresponds to the existing infrastructures, a behavior not to be expected to change in the future. With the end of the project's financial support these systems cannot be maintained appropriately and form the institute's legacy data, e.g. „DeuFraMat”. 2.1 Data example: GEI-Digital The GEI hosts one of the biggest research libraries in the field of textbook research. One of the goals is to make this library fit for the future by giving the library a so-called hybrid profile 2 http://www.edumeres.net/nc/en/information/home.html 3 http://www.gei.de/en/projects/current-projects.html 43 Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) through digitizing its content. For this purpose, the project GEI-Digital has been initiated, in which the conversion of historic German speaking holdings into a machine readable format is being undertaken. An adequate research corpus has been created which can be used in diverse research areas. The presentation platform provides digital images (generated through external providers) and automatically generated full text recognition file (via OCR) of every book page, written mostly in Gothic type. With this a basis for Digital Humanities tools is facilitated: a couple of projects, as e.g. “Children and their world”, have started to use methods such as topic modeling on this corpus. In June 2015 the database contained ca. 3,500 digitized and indexed textbooks with a time-span from 1648 to 1918 (ca. 900.000 digitized single documents). The used metadata format is METS for structural data and MODS for bibliographic descriptions, accessible through the GEI’s OAI-PMH interface. Data Integration takes place, amongst others, in Europeana4 and the Deutsche Digitale Bibliothek5 (DDB). The corpus can be searched on metadata level as well as on full text level. Also a facetted search is provided, using the collection division which is based on metadata especially created for GEI-Digital, describing the type (atlases or storybooks) or subject of the textbook (geography, history), plus time frame in which it was used. Browsing options include the common bibliographic data. For the visualization of the digitized images the intranda viewer 6 is used. For each digitized document additional (meta-)data like ToC, thumbnail gallery, bibliographic data (partly also in English) and full text are provided and can be downloaded as METS/XML, MARCXML and DC via the OAI interface, Europeana Semantic Elements (ESE), OPAC/PICA, and PDF. Figure 1 shows an example screenshot. The Open Source Software Goobi7 is used for digitization which provides an adequate environment for workflow handling and metadata editing. The metadata profile is generic; the Goobi interface has been customized and is filled manually except for the bibliographic data, which is harvested through the OPAC interface, thus also using the Gemeinsame Normdatei 8 (GND) data and Handle9 service provided by GBV Common Library Network 10. The same applies for the Solr index which is built at the GBV and adapted for GEI use. For backup the GEI cooperates with the Gauß-IT-Zentrum at the TU Braunschweig, where it holds server and storage capacities. Within this contract also the long-term preservation of the GEI-Digital data (i.e. digitized images, derivatives, metadata) is ensured. 2.2 Data example: project EurViews11 Through a comprehensive selection a collection of texts, maps and images from 20th and 21st century textbooks is established with the intention to present which notions of Europe and Europeans are conveyed through national textbooks. Historical and contemporary textbook sources from all European and many non-European countries are being incorporated, furnished with commentaries and contextualized information like histories of education, both of which 4 http://www.europeana.eu/portal/ 5 https://www.deutsche-digitale-bibliothek.de/ 6 https://www.intranda.com/digiverso/intranda-viewer/intranda-viewer-overview/ 7 http://www.goobi.org/en/ 8 http://www.dnb.de/DE/Standardisierung/GND/gnd_node.html 9 http://www.handle.net/ 10 https://www.gbv.de/?set_language=en 11 http://www.eurviews.eu/nc/start.html 44 Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) are written by external researchers. Translations in German, English and sometimes also French or Spanish are provided by the EurViews members. Fig. 1. Bibliographic data view in Intranda Viewer Although the workflow is similar, EurViews uses Typo3 as its digitization and metadata editing backend which is completely different from the one used in GEI-Digital. The homemade DigiSource extension supports handling the digitized image, storing the (meta-)data in a MySQL database, even though they are mostly already stored in GEI-Digital. The backup works through the TU Braunschweig, however no Solr index is used. The textbook sources can be searched on metadata level and on full text level. Additionally the sources are indexed with predefined search terms for time spans, categories, and keywords. A short summary completes each item’s description. The collection is also accessible through a facetted search which clusters time periods, countries, and source types (i.e. structure type of the document). Figure 2 shows a screenshot of the Typo3 backend. Both projects show the main problem to be handled: the data storage is done in a separate way, depending partly on old fashioned software. Especially the EurViews backend now turns out to be outdated resulting in unexpected high maintenance. The separate search indices prevent a comprehensive information retrieval. Instead of being of any help for the information seeking textbook researcher the separate search entry points constitute an obstacle which most people are not willing or able to overcome. Therefore, usability improvements as well as better information retrieval facilities are needed. Furthermore, there exists no connection to other GEI data: the sources' full texts and annotations as well as commentaries could serve as a knowledge base, easily enriched using other GEI data. However, lack of appropriate interfaces prevents the use by other GEI projects. Also the custom-built Typo3 extension prohibits data reuse. The metadata schema is generic and manually filled, despite the possibility of using bibliographic data from the library OPAC. 45 Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) Fig. 2. EurViews metadata editing backend in Typo3 Analogous ways into the context of LOD/Semantic Web are absent: where GEI-Digital barely uses the controlled vocabularies provided by the GND, the EurViews data lack semantic contextualization completely. 3 Aims of World Views and first steps The aforementioned projects are just two examples of the disconnected character of the collections, which in spite of its frequent use, reproduces this data and stores it several more times. Additionally there is the problem of the format variety: difficult to maintain, to curate and preserve, and almost every new project using a new data format. This leads to the emerging problem of data inaccessibility and thereby data loss. The aim of World Views is therefore to consolidate the various data sources. Based on a three-tier architecture model the project’s focus is on implementing a central middleware, which will serve as the logic tier, where the metadata integration and standardization will be executed. The distributed and separated GEI data (this forms the data tier in the architecture) in its various formats will be drawn together, migrated into more standardized formats using metadata crosswalks and semantically enriched using internal and external links. Eventually the construction of one joint search Index (possibly based on Lucene/Solr) is planned, providing a comprehensive search through the main retrieval platform edumeres, additionally to the presentation on each project’s platform, thus forming the infrastructure’s presentation layer. Since the project has only just started no technical decisions have been made so far. To make learned decisions, comprehensive knowledge about the GEI's infrastructure and projects needed to be gained first. The project started with evaluation of main technologies available at the moment (open source as a requirement). Here mainly the product Fedora as the leading solution has been tested. But also DSpace and several other software solutions, e.g. infrastructures in the context of CLARIN-D and DARIAH-DE, are taken into account. The evaluation process contains creating a catalogue of requirements, testing the software on a virtual machine on both requirements and data to be used and comprehensive 46 Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) documentation. Eventually a decision is planned in the coming months. The decision on the bibliographic metadata does not seem to be the problem (main formats like DC are in the focus, as well as METS/MODS, as being already used) and crosswalks can be easily implemented, the main work lies in deciding on the extent of annotation. Here the infrastructure’s intended character of persistence and sustainable usage has to be given consideration as it shall provide interfaces also for future projects, whose focus and functionality cannot be determined yet. For this project’s part cooperation with professionals in the textbook research field is essential, since they form the user community. Their requirements and possible future ways of use will be surveyed through workshops and evaluations of other Digital Humanities projects. Thereby World Views is intended to function as a platform, which prepares the data for further use in the DH context. TEI [2, 3] as the most promising and prevalent format is the main focus of analysis. It was mainly chosen for its applicability on annotated texts produced in the humanities. Also its large community is considered as a benefit. Questions like annotation functionalities and how to exploit them in the most adequate way for the resource textbook have to be answered, forming a main part of the project’s scientific work, since no metadata formats especially focusing on textbooks seem to be publicly available. Standardization plays also an obligatory part in the repository certification process, for example to achieve the DINI certificate [9], as is planned for World Views in the long run. The GEI data still resides in a mostly isolated position which stands in the way of representing the different “views”, as the project title claims them. Therefore a vital point of World Views is the data contextualization. To get it enriched also provides its embedding in a semantic context which comprises interlinking the GEI projects further with controlled vocabularies up to Semantic Web applications, and thus providing comprehensive information retrieval possibilities as well as enriched corpora adequate for future DH research questions. Fig. 3. World Views schematic representation 47 Proceedings of the 5th International Workshop on Semantic Digital Archives (SDA 2015) References [1] Kommission Selbstkontrolle in der Wissenschaft Deutsche Forschungsgemeinschaft: Vorschläge zur Sicherung guter wissenschaftlicher Praxis: Empfehlungen der Kommission "Selbstkontrolle in der Wissenschaft"; Denkschrift. Wiley-VCH (1998) [2] DTA TEI Basisformat (2007-2015), http://www.deutschestextarchiv.de/doku/basisformat_en [3] Dobratz, S. Open-Source-Software zur Realisierung von Institutionellen Repositories– Überblick. Humboldt-Universität zu Berlin, Zentraleinrichtung Universitätsbibliothek, Berlin (2007) [4] Deutsche Initiative für Netzwerkinformation e.V.: Positionspapier Forschungsdaten. Arbeitsgruppe „Elektronisches Publizieren". (2009) http://edoc.hu- berlin.de/series/dini-schriften/2009-10/PDF/10.pdf [5] Bagdanov, A., Katz, S., Nicolai, C., & Subirats, I.: Fedora Commons 3.0 versus DSpace 1.5: Selecting an enterprise-grade repository system for FAO of the United Nations. (2009) [6] Battino Viterbo, P., Gourley, D.: Digital humanities and digital repositories: sustainable tech-nology for sustainable communications. In: Proceedings of the 28th ACM International Conference on Design of Communication (SIGDOC '10), pp.109-114. ACM, New York, NY (2010), USA, doi:10.1145/1878450.1878469 [7] Haaf, S., Geyken, A., Wiegand, F.: The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources (2012), doi: 10.4000/jtei.1114 [8] Wissenschaftsrat: Empfehlungen zur Weiterentwicklung der wissenschaftlichen Informationsinfrastrukturen in Deutschland bis 2020, Berlin (2012), http://www.wissenschaftsrat.de/download/archiv/2359-12.pdf [9] DINI: DINI-Zertifikat für Open-Access-Repositorien und-Publikationsdienste 2013 (2014), http://edoc.hu-berlin.de/series/dini-schriften/2013-3/PDF/3.pdf [10] Beer, N., Herold, K., Kolbmann, W., Kollatz, Th., Romanello, M., Rose, S., Walkowski, N.-O.: Interdisciplinary Interoperability. DARIAH-DE Working Papers Nr. 3. DARIAH-DE, Göttingen (2014), urn:nbn:de:gbv:7-dariah-2014-1-0 [11] Fiedler, N., Werthmann, A., Stührenberg, M., Schonefeld, O., Bingel, J., & Witt, A.: Forschungsinfrastrukturen in außeruniversitären Forschungseinrichtungen: Forschungsbericht. (2014), http://dok.ids- mannheim.de/xmlui/bitstream/handle/10932/00-0230-5FEB-262D-CA01- 8/Forschungsinfrastrukturen.pdf?sequence=4 [12] Puhl,J., Andorfer, P., Höckendorff, M., Schmunk, St., Stiller, J., Thoden, K.: Diskussion und Definition eines Research Data LifeCycle für die digitalen Geisteswissenschaften. DARIAH-DE Working Papers Nr. 11. DARIAH-DE, Göttingen (2015), urn:nbn:de:gbv:7-dariah-2015-4-4 [13] Research Infrastructures in the Leibniz Association (2015), http://www.leibniz- gemeinschaft.de/fileadmin/user_upload/downloads/Presse/Publikati onen/Leibniz_Infrastrukturen_2-2015_web.pdf 48