Multimodal Datasets of the Berlin State Library
David Zellhöfer
Berlin State Library/Staatsbibliothek zu Berlin, Germany
https://staatsbibliothek-berlin.de/
david.zellhoefer@sbb.spk-berlin.de

        Abstract
To facilitate the handling of digital library content and its accompanying metadata, four multimodal
and multilingual datasets are presented that are relying on the publicly available information systems
of the Berlin State Library. They range from pre-processed extracts of the full main catalog of the
library with ca. 9.8 million records, over various networks graphs modeling, e.g., relations between
authors and languages, to more than half a million extracted illustrations detected by the day-to-day
OCR process of ca. 22,000 historical media units such as historical books.

2012 ACM Subject Classification Applied computing → Digital libraries and archives

Keywords and phrases Digital Library, Bibliographic Metadata, Graph, Digitized Media, Dataset

Supplement Material https://github.com/elektrobohemian/StabiHacks

Funding Funded by the BMBF as part of the ‘QURATOR - Curation Technologies‘ project.


    1    Introduction
Most large research libraries such as the Berlin State Library are handling the core challenge
of the digital transformation – the presentation of digitized content and retro-converted
digitized catalog records – as a day-to-day routine, even at large scale. Media are digitized at
a daily basis, extended with structural information, indexed, and treated with OCR engines,
to be presented in web-based digitized collections1 or to be distributed via typical metadata
interchange interfaces such as OAI-PMH2 .
    However, new tasks for research libraries are emerging, e.g., the digital curation of the
owned collections and the provision of data for various research tasks from the wide field
of digital humanities (DH). As these use cases fall outside the traditional bibliographic use
case, i.e., the indexing and retrieval of different media, traditional bibliographic records do
not not satisfy the requirements of both researchers in DH as well as digital curators. On
the one hand, these records are often missing vital information such as named entities or
other information that can be used to enable explorative information seeking strategies. On
the other hand, these records contain very detailed information that is necessary for the
bibliographic use case while being over-complex and cryptic for DH researchers. Furthermore,
proprietary character encodings or system-specific annotations are putting an additional
burden on the usage of the data outside the scope of common library tasks. Listing 1
illustrates this phenomenon very well with the help of the library management system’s
internal Pica+ format.
    Because of the sheer amount of data available in large libraries, a manual conversion or
augmentation of these records to fit the aforementioned needs would be very cost-intensive
and hardly possible if it had to be carried out by library staff. Thus, a machine-assisted
approach to transform traditional metadata records into datasets usable by digital curators
or DH researchers is needed to cope with this problem. A recent proof of concept [2] shows


1
    https://digital.staatsbibliothek-berlin.de/
2
    https://www.openarchives.org/OAI/openarchivesprotocol.html
            © David Zellhöfer;
            licensed under Creative Commons License CC-BY
LDK 2019 - Posters Track.
Editors: Thierry Declerck and John P. McCrae
XX:2   Multimodal Datasets of the Berlin State Library


               Listing 1 Excerpt of a bibliographic record in Pica+ format
           011 @  a1812
           011 B  a2004 - b2007
           019 @  aXD - US
           021 A  aAn @oration pronounced at Dedham on the anniversary of
                  American independence , July 4 , 1812 hby Jabez Chickering
           028 A dJabez aChickering h1753 -1812
           033 A pBoston nPrinted by Joshua Belcher
           101 @ a11
           201 B /01 014 -03 -17 t23 :01:04.000


       the feasibility of such an approach relying on methods from machine-based learning, data
       analysis, and traditional data management and batch processing.
          The following section presents the core characteristics of four multimodal and multilingual
       datasets based on publicly available catalog and other metadata of the Berlin State Library
       that have been transformed by tools presented in [2]. For the sake of reproductiveness and
       transparency, all scripts are made available3 with a permissive license.


           2       Characteristics of the Datasets
       All of the presented datasets are inter-linkable with the help of the so-called PPN (Pica
       production number). In most cases, the PPN can be seen as a unique identifier for analog or
       digitized media that is used in many systems of the Berlin State Library and the libraries
       of the GBV alliance4 , e.g., the central catalog. PPN can also be used to download image
       content via the IIIF5 endpoint or metadata and OCR content via the OAI-PMH interface.
       For some sample scenarios, refer to [7].

       2.1         Extract from the Library’s Main Catalog
       This dataset [3] is derived from the Pica+ serialization of the full library’s main catalog
       from 2018 containing 9,850,467 records of analog, digitized, and digital-born material. The
       following fields have been extracted: title, author (incl. optional GND6 ID), publisher,
       place of publication, country of publication, and year of publication. To facilitate further
       processing, the publications are split by language groups (ranging from ancient to modern
       languages).
           The records are stored in a simple tabulator-separated field-based text format. Records
       are isolated by empty lines, whereas @ serves as a subfield indicator in case a GND ID or
       detailed location information is available. Table 1 presents a sample records, whose complete
       data can be referenced with the help of the given PPN7 . Details on the different Pica+ field
       IDs and their contents are available under [3] accompanied by the creation script. A full list
       of available fields (in German) is also available8 .


       3
         https://github.com/elektrobohemian/StabiHacks
       4
         https://www.gbv.de/?set_language=en
       5
         https://iiif.io/
       6
         https://www.dnb.de/EN/Standardisierung/GND/gnd_node.html
       7
         http://stabikat.de/DB=1/SET=1/TTL=1/PRS=PP%7F/PPN?PPN=0249445468
       8
         https://www.gbv.de/bibliotheken/verbundbibliotheken/02Verbund/01Erschliessung/02Richtlinien/
         01KatRicht/inhalt.shtml
David Zellhöfer                                                                                     XX:3


     Table 1 Sample record of PPN 0249445468

       PPN       Pica+ field ID   Content
    0249445468       011@         1939
    0249445468       019@         XD-US
    0249445468       021A         The plays of William Shakespeare in thirty-seven volumes
    0249445468       028A         Shakespeare, William@gnd/118613723
    0249445468       033A         The Limited Editions Club@New York, NY


2.2      Metadata, Title Pages, and Network Graph of the Digitized
         Content of the Berlin State Library
The dataset has been downloaded via the OAI-PMH Dublin Core endpoint of the Berlin
State Library’s Digitized Collections9 and has been converted into common tabular formats
and graph representations in GML. It contains 146,000 records of digitized material older
than 1920 in the format described in Table 2.
    In addition to the bibliographic metadata, representative images of the works have been
downloaded and resized to a 512 pixel maximum thumbnail JPEG image preserving the
original aspect ratio. Title pages have been derived from structural metadata created by
scan operators and librarians. If this information was not available, first pages of the media
have been downloaded. In case of multi-volume media, title pages are not available. As
a consequence, only 141,206 images title/first pages are present. Additionally, geo-spatial
coordinates have been added to each record using the OpenStreetMap web service10 . For
details, refer to [5].

2.3      Title, Author, Publisher, Place of Publication, and
         Language-related Network Graphs of the Library Main Catalog
Three graphs (in GraphML, GML, and JSON) are made available in this dataset linking:

     authors, publishers, and places of publication (author_publisher_location);
     authors, publishers, places of publication, and titles (author_publisher_location_title);
     authors, publishers, and the language of publication (languageLink).

   The languages of publication graphs spans all of the languages mentioned above and
has 1,555,119 nodes and 1,659,596 edges (see Fig. 1 for an exemplary subgraph). Table 3
subsumes the core properties of each provided graph. For additional details, see [6].

2.4      Extracted Illustrations of the Berlin State Library’s Digitized
         Collections
The largest dataset consists of ca. 22,142 digitized media units11 that have been OCR-
processed with the ABBYY FineReader Engine (at least version 11) and whose full-texts are
made available in ALTO XML referenced in the METS/MODS XML file of each object12 .
Based on the results of the OCR, all found illustrations have been extracted and saved in


9
   https://digital.staatsbibliothek-berlin.de/oai
10
   https://www.openstreetmap.org/
11
   This number is subject to change as the OCR and image extraction process is ongoing.
12
   The data is available over the OAI-PMH METS/MODS endpoint of the digitized collections.


                                                                                                 L D K Po s t e r s
XX:4   Multimodal Datasets of the Berlin State Library


          Table 2 Description of the tabular format of the extended metadata

        Column Name          Description
        title                The title of the medium
        creator              Its creator (family name, first name)
        subject              A collection’s name as provided by the library
        type                 The type of medium
        format               A MIME type for full metadata download
        identifier           An additional identifier (most often the PPN)
        language             A 3-letter language code of the medium
        date                 The date of creation/publication or a time span
        relation             A relation to a project or collection a medium has been digitized for.
        coverage             The location of publication or origin (ranging from cities to continents)
        publisher            The publisher of the medium.
        rights               Copyright information.
        PPN                  The unique identifier that can be used to find more information about the
                             current medium in all information systems of Berlin State Library.
                             The following fields contain data that is based on different processing steps.
        spatialClean         In case of multiple entries in coverage, only the first place of origin has been
                             extracted. Additionally, characters such as question marks, brackets, or the like
                             have been removed. The entries have been normalized regarding whitespaces
                             and writing variants with the help of regular expressions.
        dateClean            As the original date may contain various format variants to indicate unclear
                             creation dates (e.g., time spans or question marks), this field contains a mapping
                             to a certain point in time.
        spatialCluster       The cluster ID determined with the help of the Jaro-Winkler distance on the
                             spatialClean string. This step is needed because the spatialClean fields still
                             contain a huge amount of orthographic variants and latinizations of geographic
                             names.
        spatialClusterName   A verbal cluster name (controlled manually).
        latitude             The latitude provided by OpenStreetMap of the spatialClusterName if the
                             location could be found.
        longitude            The longitude provided by OpenStreetMap of the spatialClusterName if the
                             location could be found.
        century              A century derived from the date.
        textCluster          A text cluster ID on the basis of a k-means clustering relying on the title field
                             with a vocabulary size of 125,000 using the tf*idf model and k=5,000.
        creatorCluster       A text cluster ID based on the creator field with k=20,000.
        titleImage           The path to the first/title page relative to the img/ subdirectory or None in
                             case of a multi-volume work.
David Zellhöfer                                                                      XX:5


  Table 3 Graph properties per language

 Language   Graph Type                           Nodes       Edges     Records
    fry     author_publisher_location               298         264         360
    fry     author_publisher_location_title         622         726         360
    ice     author_publisher_location               505         448       1,200
    ice     author_publisher_location_title       1,509       1,393       1,200
   por      author_publisher_location             5,217       5,392       8,937
   por      author_publisher_location_title      12,848      15,219       8,937
   nor      author_publisher_location             4,948       6,049      12,016
   nor      author_publisher_location_title      15,276      21,737      12,016
   dan      author_publisher_location             9,127      11,711      20,089
   dan      author_publisher_location_title      26,144      39,278      20,089
   swe      author_publisher_location            15,350      18,367      30,628
   swe      author_publisher_location_title      41,933      61,000      30,628
   spa      author_publisher_location            24,404      27,477      42,540
   spa      author_publisher_location_title      59,339      77,779      42,540
   dut      author_publisher_location            36,503      42,128      67,000
   dut      author_publisher_location_title      94,803     127,785      67,000
    ita     author_publisher_location            71,151      95,054     158,851
    ita     author_publisher_location_title     206,656     316,282     158,851
    lat     author_publisher_location            91,584     148,224     230,588
    lat     author_publisher_location_title     273,724     469,322     230,588
    fre     author_publisher_location           174,650     204,299     380,569
    fre     author_publisher_location_title     487,053     693,245     380,569
   eng      author_publisher_location           606,112     880,989   1,309,172
   eng      author_publisher_location_title   1,778,957   2,710,807   1,309,172
    ger     author_publisher_location           705,468   1,104,502   2,316,600
    ger     author_publisher_location_title   2,497,239   3,947,482   2,316,600
   n/a      languageLink                      1,555,119   1,659,596   4,578,537


                                                                                  L D K Po s t e r s
XX:6   Multimodal Datasets of the Berlin State Library


          Figure 1 Network of Authors, Publishers, and Languages of Publication (Subgraph of the
       Languages: fry, ice, por, nor, dan, swe)


       original size in JPEG format. In total, 531,484 illustrations have been extracted from 22,142
       media units, i.e., an average of 24 extracted illustrations per unit [4].
           In order to remove false positives from the corpus, e.g., stamps, hand-written signatures,
       or empty pages, pre-trained classifiers are provided in form of different Python scripts[1]
       based on a pre-trained VGGnet models implemented with Keras/TensorFlow.

            References
       1    Julia Berauer, Ralitsa Doncheva, Linh Nguyen, Luisa Rademacher, Carlos Tan, and Caglar
            Özel. Chasing Unicorns and Vampires in a Library. Technical report, HTW Berlin, 2018.
            URL: https://github.com/elektrobohemian/imi-unicorns.
       2    David Zellhöfer. Exploring Large Digital Libraries by Multimodal Criteria. In Norbert
            Fuhr, László Kovács, Thomas Risse, and Wolfgang Nejdl, editors, Research and Advanced
            Technology for Digital Libraries - 20th International Conference on Theory and Practice of
            Digital Libraries, TPDL 2016, Hannover, Germany, September 5-9, 2016, Proceedings, volume
            9819 of Lecture Notes in Computer Science, pages 307–319. Springer, 2016.
       3    David Zellhöfer. Extract from the Library’s Main Catalog, March 2019. URL: https:
            //doi.org/10.5281/zenodo.2590752.
       4    David Zellhöfer. Extracted Illustrations of the Berlin State Library’s Digitized Collections,
            March 2019. URL: http://doi.org/10.5281/zenodo.2602431.
       5    David Zellhöfer. Metadata, Title Pages, and Network Graph of the Digitized Content of the
            Berlin State Library (146,000 items), March 2019. URL: https://doi.org/10.5281/zenodo.
            2582482.
       6    David Zellhöfer. Title, Author, Publisher, Place of Publication, and Language-related Network
            Graphs of the Library Main Catalog, March 2019. URL: https://doi.org/10.5281/zenodo.
            2587801.
       7    David Zellhöfer. What is a PPN and Why is it Helpful?, May 2019. URL: http://doi.org/
            10.5281/zenodo.2702544.