<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal Datasets of the Berlin State Library</article-title>
      </title-group>
      <pub-date>
        <year>1939</year>
      </pub-date>
      <abstract>
        <p>To facilitate the handling of digital library content and its accompanying metadata, four multimodal and multilingual datasets are presented that are relying on the publicly available information systems of the Berlin State Library. They range from pre-processed extracts of the full main catalog of the library with ca. 9.8 million records, over various networks graphs modeling, e.g., relations between authors and languages, to more than half a million extracted illustrations detected by the day-to-day OCR process of ca. 22,000 historical media units such as historical books. 2012 ACM Subject Classification Applied computing → Digital libraries and archives Most large research libraries such as the Berlin State Library are handling the core challenge of the digital transformation - the presentation of digitized content and retro-converted digitized catalog records - as a day-to-day routine, even at large scale. Media are digitized at a daily basis, extended with structural information, indexed, and treated with OCR engines, to be presented in web-based digitized collections1 or to be distributed via typical metadata interchange interfaces such as OAI-PMH2. However, new tasks for research libraries are emerging, e.g., the digital curation of the owned collections and the provision of data for various research tasks from the wide field of digital humanities (DH). As these use cases fall outside the traditional bibliographic use case, i.e., the indexing and retrieval of different media, traditional bibliographic records do not not satisfy the requirements of both researchers in DH as well as digital curators. On the one hand, these records are often missing vital information such as named entities or other information that can be used to enable explorative information seeking strategies. On the other hand, these records contain very detailed information that is necessary for the bibliographic use case while being over-complex and cryptic for DH researchers. Furthermore, proprietary character encodings or system-specific annotations are putting an additional burden on the usage of the data outside the scope of common library tasks. Listing 1 illustrates this phenomenon very well with the help of the library management system's internal Pica+ format. Because of the sheer amount of data available in large libraries, a manual conversion or augmentation of these records to fit the aforementioned needs would be very cost-intensive and hardly possible if it had to be carried out by library staff. Thus, a machine-assisted approach to transform traditional metadata records into datasets usable by digital curators or DH researchers is needed to cope with this problem. A recent proof of concept [2] shows 1 https://digital.staatsbibliothek-berlin.de/ 2 https://www.openarchives.org/OAI/openarchivesprotocol.html</p>
      </abstract>
      <kwd-group>
        <kwd>and phrases Digital Library</kwd>
        <kwd>Bibliographic Metadata</kwd>
        <kwd>Graph</kwd>
        <kwd>Digitized Media</kwd>
        <kwd>Dataset</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Listing 1 Excerpt of a bibliographic record in Pica+ format
011 @ a1812
011 B a2004 - b2007
019 @ aXD - US
021 A aAn @oration p r o n o u n c e d at Dedham on the a n n i v e r s a r y of</p>
      <p>American independence , July 4 , 1812 hby Jabez C h i c k e r i n g
028 A dJabez a C h i c k e r i n g h1753 -1812
033 A pBoston nPrinted by Joshua Belcher
101 @ a11
201 B /01 014 -03 -17 t23 : 0 1 : 0 4 . 0 0 0
the feasibility of such an approach relying on methods from machine-based learning, data
analysis, and traditional data management and batch processing.</p>
      <p>The following section presents the core characteristics of four multimodal and multilingual
datasets based on publicly available catalog and other metadata of the Berlin State Library
that have been transformed by tools presented in [2]. For the sake of reproductiveness and
transparency, all scripts are made available3 with a permissive license.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Characteristics of the Datasets</title>
      <p>All of the presented datasets are inter-linkable with the help of the so-called PPN (Pica
production number). In most cases, the PPN can be seen as a unique identifier for analog or
digitized media that is used in many systems of the Berlin State Library and the libraries
of the GBV alliance4, e.g., the central catalog. PPN can also be used to download image
content via the IIIF5 endpoint or metadata and OCR content via the OAI-PMH interface.
For some sample scenarios, refer to [7].
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Extract from the Library’s Main Catalog</title>
      <p>This dataset [3] is derived from the Pica+ serialization of the full library’s main catalog
from 2018 containing 9,850,467 records of analog, digitized, and digital-born material. The
following fields have been extracted: title, author (incl. optional GND6 ID), publisher,
place of publication, country of publication, and year of publication. To facilitate further
processing, the publications are split by language groups (ranging from ancient to modern
languages).</p>
      <p>The records are stored in a simple tabulator-separated field-based text format. Records
are isolated by empty lines, whereas @ serves as a subfield indicator in case a GND ID or
detailed location information is available. Table 1 presents a sample records, whose complete
data can be referenced with the help of the given PPN7. Details on the different Pica+ field
IDs and their contents are available under [3] accompanied by the creation script. A full list
of available fields (in German) is also available8.
3 https://github.com/elektrobohemian/StabiHacks
4 https://www.gbv.de/?set_language=en
5 https://iiif.io/
6 https://www.dnb.de/EN/Standardisierung/GND/gnd_node.html
7 http://stabikat.de/DB=1/SET=1/TTL=1/PRS=PP%7F/PPN?PPN=0249445468
8 https://www.gbv.de/bibliotheken/verbundbibliotheken/02Verbund/01Erschliessung/02Richtlinien/
01KatRicht/inhalt.shtml
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Metadata, Title Pages, and Network Graph of the Digitized</title>
    </sec>
    <sec id="sec-5">
      <title>Content of the Berlin State Library</title>
      <p>The dataset has been downloaded via the OAI-PMH Dublin Core endpoint of the Berlin
State Library’s Digitized Collections9 and has been converted into common tabular formats
and graph representations in GML. It contains 146,000 records of digitized material older
than 1920 in the format described in Table 2.</p>
      <p>In addition to the bibliographic metadata, representative images of the works have been
downloaded and resized to a 512 pixel maximum thumbnail JPEG image preserving the
original aspect ratio. Title pages have been derived from structural metadata created by
scan operators and librarians. If this information was not available, first pages of the media
have been downloaded. In case of multi-volume media, title pages are not available. As
a consequence, only 141,206 images title/first pages are present. Additionally, geo-spatial
coordinates have been added to each record using the OpenStreetMap web service10. For
details, refer to [5].
2.3</p>
    </sec>
    <sec id="sec-6">
      <title>Title, Author, Publisher, Place of Publication, and</title>
    </sec>
    <sec id="sec-7">
      <title>Language-related Network Graphs of the Library Main Catalog</title>
      <p>Three graphs (in GraphML, GML, and JSON) are made available in this dataset linking:
authors, publishers, and places of publication (author_publisher_location);
authors, publishers, places of publication, and titles (author_publisher_location_title);
authors, publishers, and the language of publication (languageLink).</p>
      <p>The languages of publication graphs spans all of the languages mentioned above and
has 1,555,119 nodes and 1,659,596 edges (see Fig. 1 for an exemplary subgraph). Table 3
subsumes the core properties of each provided graph. For additional details, see [6].
2.4</p>
    </sec>
    <sec id="sec-8">
      <title>Extracted Illustrations of the Berlin State Library’s Digitized</title>
    </sec>
    <sec id="sec-9">
      <title>Collections</title>
      <p>The largest dataset consists of ca. 22,142 digitized media units11 that have been
OCRprocessed with the ABBYY FineReader Engine (at least version 11) and whose full-texts are
made available in ALTO XML referenced in the METS/MODS XML file of each object12.
Based on the results of the OCR, all found illustrations have been extracted and saved in
9 https://digital.staatsbibliothek-berlin.de/oai
10 https://www.openstreetmap.org/
11 This number is subject to change as the OCR and image extraction process is ongoing.
12 The data is available over the OAI-PMH METS/MODS endpoint of the digitized collections.
original size in JPEG format. In total, 531,484 illustrations have been extracted from 22,142
media units, i.e., an average of 24 extracted illustrations per unit [4].</p>
      <p>In order to remove false positives from the corpus, e.g., stamps, hand-written signatures,
or empty pages, pre-trained classifiers are provided in form of different Python scripts[1]
based on a pre-trained VGGnet models implemented with Keras/TensorFlow.
1
2
3
4
5
6
7</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Julia</given-names>
            <surname>Berauer</surname>
          </string-name>
          , Ralitsa Doncheva, Linh Nguyen, Luisa Rademacher,
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Tan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Caglar</given-names>
            <surname>Özel</surname>
          </string-name>
          .
          <source>Chasing Unicorns and Vampires in a Library</source>
          .
          <source>Technical report</source>
          , HTW Berlin,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Zellhöfer</surname>
          </string-name>
          .
          <article-title>Exploring Large Digital Libraries by Multimodal Criteria</article-title>
          . In Norbert Fuhr, László Kovács, Thomas Risse, and Wolfgang Nejdl, editors,
          <source>Research and Advanced Technology for Digital Libraries - 20th International Conference on Theory and Practice of Digital Libraries, TPDL</source>
          <year>2016</year>
          , Hannover, Germany, September 5-
          <issue>9</issue>
          ,
          <year>2016</year>
          , Proceedings, volume
          <volume>9819</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>307</fpage>
          -
          <lpage>319</lpage>
          . Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Zellhöfer</surname>
          </string-name>
          .
          <article-title>Extract from the Library's Main Catalog</article-title>
          ,
          <year>March 2019</year>
          . URL: https: //doi.org/10.5281/zenodo.2590752.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Zellhöfer</surname>
          </string-name>
          .
          <article-title>Extracted Illustrations of the Berlin State Library's Digitized Collections</article-title>
          ,
          <year>March 2019</year>
          . URL: http://doi.org/10.5281/zenodo.2602431.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Zellhöfer</surname>
          </string-name>
          . Metadata,
          <article-title>Title Pages, and Network Graph of the Digitized Content of the Berlin State Library</article-title>
          (
          <volume>146</volume>
          ,000 items),
          <year>March 2019</year>
          . URL: https://doi.org/10.5281/zenodo.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Zellhöfer</surname>
          </string-name>
          . Title, Author, Publisher, Place of Publication, and
          <article-title>Language-related Network Graphs of the Library Main Catalog</article-title>
          ,
          <year>March 2019</year>
          . URL: https://doi.org/10.5281/zenodo.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Zellhöfer</surname>
          </string-name>
          .
          <article-title>What is a PPN and Why is it Helpful?</article-title>
          , May
          <year>2019</year>
          . URL: http://doi.org/ 10.5281/zenodo.2702544.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>