<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Linked Open Data Service and Portal for Pre-modern Manuscript Research</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eero Hyvo¨ nen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Esko Ikkala</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jouni Tuominen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mikko Koho</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Toby Burrows</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lynn Ransom</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hanno Wijsman</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>HELDIG - Helsinki Centre for Digital Humanities, University of Helsinki</institution>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institut de recherche et d'histoire des textes</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Oxford e-Research Centre, University of Oxford</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Schoenberg Institute for Manuscript Studies, University of Pennsylvania</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Semantic Computing Research Group (SeCo), Aalto University</institution>
          ,
          <country country="FI">Finland</country>
        </aff>
      </contrib-group>
      <fpage>220</fpage>
      <lpage>229</lpage>
      <abstract>
        <p>This paper presents a Linked Open Data publishing model for aggregating data from heterogeneous, distributed pre-modern manuscript databases into a global, harmonized data model and service. Our research hypothesis is that on top of the global data service based on ontologies and well-defined semantics, tools and applications can be created for solving novel research problems in manuscript studies using Digital Humanities methods. First results in implementing such a system in the international Mapping Manuscript Migrations project are described with lessons learned discussed in dealing with complex and imperfect historical data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Thousands of European pre-modern manuscripts have survived until the present day.
As the primary surviving witnesses to the world of pre-modern Europe they provide
crucial evidence for research in many disciplines, including textual and literary studies,
history, cultural heritage, and the fine arts. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] As the result of changes of ownership
over the centuries, they are now spread all over the world. They often feature among
the treasures of libraries, museums, galleries, and archives, and they are frequently the
focus of exhibitions and events in these institutions.
      </p>
      <p>Over the last twenty years there has been a proliferation of digital data relating to
these manuscripts, including catalogues, specialist databases, and numerous collections
of digital images – many of them IIIF-compliant6. But there is little in the way of
coherent, interoperable digital infrastructure, with the result that large-scale discovery
and analysis requires the time-consuming exploration of numerous disparate resources.</p>
      <p>
        This paper introduces the Mapping Manuscript Migrations (MMM) project7 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
that aims to address this problem, and presents its first experiences and results from a
technical, Linked Data publishing perspective. Our goal is to build a Linked Open Data
(LOD) [
        <xref ref-type="bibr" rid="ref5 ref6">5,6</xref>
        ] framework and web system for harmonizing manuscript data from various
      </p>
      <sec id="sec-1-1">
        <title>6https://iiif.io 7http://mappingmanuscriptmigrations.org</title>
        <p>disparate sources, in order to provide researchers with searchable and browsable access
to aggregated evidence about the history of pre-modern manuscripts.</p>
        <p>In the following, we first present the publications model underlying the MMM
initiative, where data from distributed heterogeneous manuscript data silos is converted
and aggregated into a harmonized form and published as the MMM Data Service,
based on Linked Data principles and best practices of W3C8. After this, a workbench
MMM Portal under construction for studying the manuscripts is described. In
conclusion, lessons learned from the work thus far are summarized, and directions for further
research are outlined.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Publishing and Harmonizing Manuscript Metadata</title>
      <p>
        Publishing Model Figure 1 depicts the publication model of the MMM system. In
its initial phase, the project brings together more than 450 000 records from four
important European and North American datasets, which use very different data models: the
Schoenberg Database of Manuscripts9, Medieval Manuscripts in Oxford Libraries10,
and Bibale11 and Medium12 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] of the Institut de recherche et d’histoire des textes
(IRHT). These data are transformed (T1–T3 in the Figure) into a unified harmonizing
data model used in the MMM Data Service platform that is depicted in the middle of
the Figure.
      </p>
      <sec id="sec-2-1">
        <title>8https://www.w3.org/standards/techs/linkeddata#w3c all</title>
        <p>9https://sdbm.library.upenn.edu
10https://medieval.bodleian.ox.ac.uk
11http://bibale.irht.cnrs.fr
12http://medium.irht.cnrs.fr</p>
        <p>
          Harmonizing Data Model As each of the source manuscript datasets involved in
the project have their own preconditions and goals, and thus follow their own data
modeling conventions, a unified data model for harmonizing them is needed. This model –
still under some development and to be published in detail later on – is based on input
from manuscript researchers with the goal of being semantically sufficient for
answering a wide range of research questions. The most essential core classes in the model
are 1) physical manuscript objects with properties such as shelf-mark, owner and
title, and 2) provenance events that describe events related to the manuscripts, such as
production, appearance in an auction, acquisition, etc. The model makes use of the the
CIDOC CRM13 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and FRBRoo [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] ontologies, where the essential distinctions
between works, expressions, manifestations, and items are considered [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. These ontology
models were chosen as the basis because they are modern standards for data
harmonization in museums and libraries, respectively, and they support event-based modeling
needed in modeling provenance data that are essentially chains of events concerning the
manuscript objects.
        </p>
        <p>
          The MMM publishing model facilitates the aggregation and concurrent use of
heterogeneous, distributed datasets with shared user interfaces, tools, and SPARQL queries.
In our work, the MMM Portal is implemented on top of the MMM Data Service (the
bottom in the Figure 1), but also other applications (left in the Figure) can make use of
the data in a similar way, including also the original data providing web services. The
MMM Data Service also includes a wide range of tools for managing and using linked
data, such as a linked data browser, SPARQL query interfaces, automatic
documentation tool, and so on. These services are provided by the Linked Data Finland platform14
that is used as the basis [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] for the MMM Data Service.
        </p>
        <p>Data Transformation The data harmonization work was initiated by investigating
the source datasets, starting with the Schoenberg Database. In our first demo system,
the MMM Data Service contains a version of this dataset that is mapped into the
harmonizing data model. Other datasets and transformations will be integrated later on in
the system. Integrating the datasets has involved building individual pipelines for
transforming the source datasets into simple RDF formats by the data providers. Three of
them are customized relational databases, while the fourth dataset – the Bodleian
Library’s catalogue – consists of XML documents encoded in accordance with the Text
Encoding Initiative (TEI) Manuscript Description guidelines15. A special pipeline has
been built by the data provider to extract and convert a selection of elements from these
TEI documents into a record-like form more suitable for transformation to RDF. In the
case of the Schoenberg Database, a new SPARQL endpoint has also been implemented
by the data provider, which is available for general use16.</p>
        <p>Matching between the various datasets has initially focused on shared places and
persons. Two of the four data sources have now annotated their records with identifiers
from the Getty Thesaurus of Geographic Names (TGN)17 and three with those from
13http://cidoc-crm.org
14http://ldf.fi
15http://www.tei-c.org
16https://sdbm.library.upenn.edu/sparql-space
17http://www.getty.edu/research/tools/vocabularies/tgn/index.html
223
the Virtual International Authority File (VIAF)18, as well as references to a range of
other vocabularies. These identifiers have been used to match places and persons in the
aggregated MMM project data. Matching manuscripts themselves is more problematic,
since there is currently no standard for constructing and managing unique identifiers
for manuscripts, though an International Standard Manuscript Identifier has been under
discussion19. MMM has been testing and evaluating different approaches for assigning
identifiers, including the use of ARKs (Archival Resource Keys)20 from the Bodleian
Library and in the IRHT’s Medium database.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>MMM Portal for Manuscript Studies</title>
      <p>The goal of the MMM Portal is to provide a search and discovery interface for users
with or without clearly defined research questions. The portal offers four main
application perspectives based on the following classes of aggregated MMM project data:
1) Manuscripts, 2) Places, 3) People, and 4) Organizations. The instances of the core
classes can be presented to the user as a paginated table, on a map based on various
geographical information of the instances, and as a percentage frequency distribution
based on arbitrary properties of the instances.</p>
      <p>
        In each application perspective, the focus is on enabling the user to both explore
and browse the data freely and identify a group of instances of the core classes based
on a combination of criteria. In the Manuscripts perspective, a combination of criteria
could be manuscripts produced in Castile, including Spanish texts, previously owned by
English private collectors, currently owned by an institution in North America. Faceted
search [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is an effective paradigm for formulating such criteria in a user-friendly way.
      </p>
      <p>At the moment a first version of the Manuscripts application perspective has been
implemented. The perspectives for places, people and organizations can be constructed
in a similar fashion by re-using the components of the Manuscripts perspective. Figure
2 depicts the Manuscripts perspective with faceted search. By default all manuscripts
are shown in a paginated table. The key properties of manuscripts, such as shelf-mark,
author or creation date, can be filtered with facet selections. Figure 3 shows an example
of a hierarchical facet selection with search functionality. Besides hierarchy, the facet
selections need to support further filtering. For example, the author facet must provide
an additional filter for the birth date of the authors. All facet selections are connected,
so whenever the user makes a selection, the value list of other facets is updated. This
way it is impossible to end up with an empty result set by using any combination of the
facets.</p>
      <p>Moreover, each instance is associated with an information ”home page” with an
aggregated description on the instance and how it is related to other instances. For a
person instance there could be, e.g., lists of related manuscripts based on different roles
such as author, scribe or owner.</p>
      <p>18https://viaf.org
19https://www.irht.cnrs.fr/?q=fr/agenda/manuscript-ids-identifiants-des-manuscrits
20https://www.ifla.org/best-practice-for-national-bibliographic-agencies-in-a-digital-age/
node/8793</p>
      <p>Result format
Facet selections</p>
      <p>Pagination</p>
      <p>Figure 4 illustrates how the aggregated manuscript data (with optional filters
selected by the user) are rendered on a map with clustered map markers, based on the
creation places of the manuscripts. The numbers on clusters and markers indicate how
many manuscripts were created in the area or specific place in question. When zooming
in closer, Figure 5 shows how historical map sheets aligned on modern maps can be
used to provide contextual information. However, using historical maps in this way is
problematic in many ways and further user interface research is needed in order not give
the end user wrong impressions about the data. Here 5 dots are spread over 5 spots in
Paris, but there are also lots of dots that are all in one place corresponding to the general
annotation ”Paris” that occurs frequently in the data. In general the place annotations
can anything between a continent and a specific building, so a method for visualizing
the varying granularity of geocoded data is obviously needed.</p>
      <p>Furthermore, the times of the maps shown typically do not match with the times
of the underlying manuscripts (or other data) that are also usually different from each
other, too. For example, the place selected in Figure 5 is in fact the geographical spot
of the current building of the Bibliothe`que nationale de France (BnF), which did not
exist at the time of the map (neither the BnF as an institution, nor the building). The
predecessor of the BnF, the French Royal Library, moved to this spot in the 18th century
(well after 1705), but the urban landscape as depicted on the 1705 map has changed
since.</p>
      <p>The general architecture of the MMM Portal21 is presented in Figure 6. The system
consists of a NodeJS22 backend build with Express framework23 (in the middle) and
a frontend based on React24 and Redux25 (on the right). The MMM Data Service is
shown on the left. An instance26 of MapWarper27 (on the left) can be used for aligning
and publishing historical maps. When designing the architecture, the main goal of the
backend was to ease the combining of attribute data from multiple SPARQL endpoints
and raster data from various spatial data sources into a React frontend.</p>
      <p>The data is published on the Linked Data Finland platform, which is powered by
a combination of Fuseki SPARQL servers28 running in Docker containers29 for storing
the primary data and a Varnish Cache web application accelerator30 for routing URIs
and content negotiation.</p>
      <p>21https://github.com/SemanticComputing/mmm-web-app
22https://nodejs.org/en/
23https://expressjs.com
24https://reactjs.org
25https://redux.js.org
26http://mapwarper.onki.fi
27https://mapwarper.net
28https://jena.apache.org/documentation/fuseki2/
29https://www.docker.com
30https://varnish-cache.org</p>
      <p>Map Warper
Map and spatial
data services
Data service
Linked Data Finland</p>
      <p>WFS</p>
      <p>WMS, WMTS</p>
      <p>MMM Portal
Backend</p>
      <p>MMM Portal
User interface
GIS software, statistical
tools, etc.
Digging into manuscript data has turned out in many ways more challenging from a
data modeling and technical perspective than expected. Defining the very concept of
”the manuscript” itself raised many ontological modeling questions, since manuscripts
can be just fragments of a whole, can be separated into parts, copied, annotated, and
united to others over time. Also identifying records describing the same manuscript can
be very hard, in many cases probably impossible, as they have been described in
different contexts in different ways, in terms of different titles, and in different languages.
There is no unique identifier scheme for manuscripts, in contrast to printed books, and
library shelf-marks are not quoted consistently or accurately. The amount of data is also
fairly large, hundreds of thousands of records, which sets efficiency requirements for
the technical solutions.</p>
      <p>
        The data are often also incomplete, uncertain, and imprecise in many ways. A
major goal of the project is to map manuscript migrations, i.e., to illustrate and study
manuscripts in spatio-temporal spaces using maps and and timelines, but references to
locations in many cases are missing, the mentions refer to historical places that may not
exist on modern maps or may have changed over hundreds of years of history [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and
initially the placenames mentioned were not even geocoded.
      </p>
      <p>Also the datasets turned out to be fundamentally different in nature. The data models
used in the datasets were different in different collections from TEI to relational models
and RDF. But most importantly, there are substantial differences in the semantic
con228
tents of the datasets: the Schoenberg Database records primarily provenance events and
observations of manuscripts at specific points in time, based on, e.g., auction catalogs,
and does not focus on manuscripts as unique objects, while in Bibale, Medium, and the
Bodleian catalogs the main focus is on describing manuscripts as objects.</p>
      <p>The project started by creating a list of Digital Humanities research questions
relating to manuscript histories, and continued by trying to figure out what kind of data
model and data are needed to solve them. The next step was to find out, given the
constraints imposed by the actual data available, what questions can be addressed and under
what assumptions on data. Section 3 illustrated the first steps towards this ultimate goal
of the project.</p>
      <p>Acknowledgements Thanks to Kevin Page, David Lewis and Athanasios Velios for
collaborations in developing the unified data model and working on the transformations
related to the Bodleian library data. Benjamin Heller developed the transformation from
the Schoenberg Database format to raw RDF from which it was transformed into the
unified model. Similarly, Guillaume Porte was in charge of the transformation from the
Bibale database to raw RDF. Discussions with Pip Willcox, Mitch Fraas, Doug Emery,
Emma Cawlfield, Antoine Brix, Synn ve Myking and other members of the project
team are acknowledged.</p>
      <p>Our work is funded by the Trans-Atlantic Platform under its Digging into Data
Challenge31 for 2017–2019. The project is led by the University of Oxford, in partnership
with the University of Pennsylvania, Aalto University and Helsinki Centre for Digital
Humanities (HELDIG) at the University of Helsinki, and the Institut de recherche et
d’histoire des textes (IRHT). The authors wish to acknowledge CSC – IT Center for
Science, Finland, for computational resources.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Berman</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mostern</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Southall</surname>
          </string-name>
          , H. (eds.):
          <article-title>Placing names. Enriching and integrating gazetteers</article-title>
          . Indiana University Press (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Burrows</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , Hyv o¨nen, E.,
          <string-name>
            <surname>Ransom</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wijsman</surname>
          </string-name>
          , H.:
          <article-title>Mapping Manuscript Migrations</article-title>
          .
          <article-title>Digging into Data for the History and Provenance of Medieval and Renaissance Manuscripts</article-title>
          .
          <source>Manuscript Studies. A Journal of the Schoenberg Institute for Manuscript Studies</source>
          <volume>3</volume>
          (
          <issue>1</issue>
          ),
          <fpage>249</fpage>
          -
          <lpage>252</lpage>
          (
          <year>2018</year>
          ), https://mss.pennpress.org/home/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Clemens</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graham</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Introduction to Manuscript Studies</article-title>
          . Cornell University Press, Ithaca (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Doerr</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The CIDOC CRM - an ontological approach to semantic interoperability of metadata</article-title>
          .
          <source>AI</source>
          Magazine
          <volume>24</volume>
          (
          <issue>3</issue>
          ),
          <fpage>75</fpage>
          -
          <lpage>92</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Linked Data: Evolving the Web into a Global Data Space (1st edition)</article-title>
          .
          <source>Synthesis Lectures on the Semantic Web: Theory and Technology</source>
          , Morgan &amp; Claypool (
          <year>2011</year>
          ), http://linkeddatabook.com/editions/1.0/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Hyvo¨nen, E.:
          <article-title>Publishing and Using Cultural Heritage Linked Data on the Semantic Web</article-title>
          .
          <source>Synthesis Lectures on the Semantic Web: Theory and Technology</source>
          , Morgan &amp; Claypool, Palo Alto, CA, USA (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Hyvo¨nen, E.,
          <string-name>
            <surname>Tuominen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alonen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Ma¨kela¨, E.:
          <article-title>Linked Data Finland: A 7-star Model and Platform for Publishing and Re-using Linked Datasets</article-title>
          .
          <source>In: Proceedings of ESWC 2014 Demo and Poster Papers</source>
          . Springer-Verlag (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Le</surname>
            <given-names>B uf</given-names>
          </string-name>
          , P.:
          <article-title>Modeling rare and unique documents: Using FRBROO/CIDOC CRM</article-title>
          .
          <source>Journal of Archival Organization</source>
          <volume>10</volume>
          (
          <issue>2</issue>
          ),
          <fpage>96</fpage>
          -
          <lpage>106</lpage>
          (
          <year>2012</year>
          ), https://doi.org/10.1080/15332748.
          <year>2012</year>
          . 709164
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Riva</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doerr</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zˇumer</surname>
          </string-name>
          , M.:
          <article-title>FRBRoo: Enabling a common view of information from memory institutions</article-title>
          .
          <source>International Cataloguing and Bibliographic Control</source>
          <volume>38</volume>
          (
          <issue>2</issue>
          ),
          <fpage>30</fpage>
          -
          <lpage>34</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Tunkelang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Faceted search</article-title>
          .
          <source>Synthesis lectures on information concepts</source>
          ,
          <source>retrieval, and services 1(1)</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>80</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Wijsman</surname>
          </string-name>
          , H.:
          <article-title>The Bibale Database at the IRHT: A Digital Tool for Researching Manuscript Provenance</article-title>
          .
          <source>Manuscript Studies. A Journal of the Schoenberg Institute for Manuscript Studies</source>
          <volume>1</volume>
          (
          <issue>2</issue>
          ),
          <fpage>328</fpage>
          -
          <lpage>341</lpage>
          (
          <year>2017</year>
          ), https://repository.upenn.edu/mss sims/vol1/iss2/10
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>