=Paper=
{{Paper
|id=Vol-2084/paper6
|storemode=property
|title=Reassembling the Republic of Letters – A Linked Data Approach
|pdfUrl=https://ceur-ws.org/Vol-2084/paper6.pdf
|volume=Vol-2084
|authors=Jouni Tuominen,Eetu Mäkelä,Eero Hyvönen,Arno Bosse,Miranda Lewis,Howard Hotson
|dblpUrl=https://dblp.org/rec/conf/dhn/TuominenMHBLH18
}}
==Reassembling the Republic of Letters – A Linked Data Approach==
<pdf width="1500px">https://ceur-ws.org/Vol-2084/paper6.pdf</pdf>
<pre>
                Reassembling the Republic of Letters
                    – A Linked Data Approach

                  Jouni Tuominen1,2 , Eetu Mäkelä1,2 , Eero Hyvönen1,2 ,
                   Arno Bosse3 , Miranda Lewis3 , and Howard Hotson3
           1
           Semantic Computing Research Group (SeCo), Aalto University, Finland
                            http://seco.cs.aalto.fi
                         firstname.lastname@aalto.fi
      2
        HELDIG – Helsinki Centre for Digital Humanities, University of Helsinki, Finland
                                 http://heldig.fi
                 3
                   Faculty of History, University of Oxford, Oxford, UK
                   firstname.lastname@history.ox.ac.uk


        Abstract. Between 1500 and 1800, a revolution in postal communication al-
        lowed ordinary men and women to scatter letters across and beyond Europe. This
        exchange helped knit together what contemporaries called the respublica litter-
        aria, or Republic of Letters, a knowledge-based civil society, crucial to that era’s
        intellectual breakthroughs, and formative of many modern European values and
        institutions. To enable effective Digital Humanities research on the epistolary data
        distributed in different countries and collections, metadata about the letters have
        been aggregated, harmonised, and provided for the research community through
        the Early Modern Letters Online (EMLO) catalogue. This paper discusses the
        idea and benefits of using Linked Data as the basis for a potential future frame-
        work for EMLO, and presents our experiences with a first demonstrator imple-
        mentation of such a system.


Keywords: Semantic Web, Linked Open Data, Digital Humanities, Early Modern,
Reconciliation, Correspondence

1     Introduction
The revolution in postal communication in the early modern period allowed scholars
and ordinary people to share their thoughts via letters in an efficient manner, in Europe
and beyond. This development was a vital requirement for the respublica litteraria,
or Republic of Letters, a knowledge-based civil society, crucial to that era’s intellec-
tual breakthroughs, and formative of many modern European values and institutions.
However, for the modern scholars of the subject the scattered nature of the letter poses
challenges, as the letter manuscripts are held in different libraries, archives, and private
collections around the world.
    Digital resources on early modern learned correspondence are proliferating rapidly
but without a common framework for sharing data, tools, and systems development.
Such resources include Europeana4 , Kalliope5 , The Catalogus Epistularum Neerlandi-
 4
     http://www.europeana.eu
 5
     http://kalliope.staatsbibliothek-berlin.de
carum6 , Electronic Enlightenment7 , ePistolarium8 , the Mapping the Republic of Letters
project9 , and Early Modern Letters Online (EMLO)10 . To reassemble the material and
to facilitate its efficient study, coordinated discussions amongst librarians and archivists,
scholars, IT and media experts are needed to collectively plan a shared digital infras-
tructure for publishing, reconciling, visualising, and analysing correspondence. Many
of these conversations have been taking place over the last three years under the aus-
pices of the EU COST Action IS1310 ’Reassembling the Republic of Letters’11 .
     This paper presents a Linked Data approach for such an infrastructure, using the
Early Modern Letters Online (EMLO) collection as a pilot dataset. EMLO is a collab-
oratively populated union catalogue of sixteenth-, seventeenth-, and eighteenth-century
letters, created by the Cultures of Knowledge project12 at the University of Oxford. It
brings manuscript, print, and electronic resources together in one space, increasing ac-
cess to and awareness of them, and allows disparate and connected correspondences to
be cross-searched, combined, analysed, and visualised.
     The paper is organized as follows. First, the general vision and process descrip-
tion in our case study of creating, aggregating, and utilizing distributed epistolary data
about letters is outlined, based on a Linked Data approach. After this, the underlying
data models, data conversion, ontology services, tooling, and use of the data service in
research are discussed.


2    A Distributed Publishing Model


Fig. 1 illustrates the overall process and setting considered in this paper. Epistolary data
from different countries is being aggregated by the EMLO service (see the directed red
arcs in the figure). The data can be accessed by the scholars using the portal. In our
experiment, the legacy EMLO data was transformed into linked data, and published as
a Linked Data Service in a SPARQL endpoint with additional services, such as content
negotiation, linked data browsing etc. based on the W3C standards and best practices of
Linked Data publishing [4]. In our experiment, the Linked Data Finland platform13 [8],
hosted by Aalto University was used14 (see the blue arrow in the figure). Using Linked
Data as a basis for aggregating and publishing the data has the following potential ben-
efits for the overall process:

 6
   http://picarta.pica.nl/DB=3.23/
 7
   http://www.e-enlightenment.com
 8
   http://ckcc.huygens.knaw.nl/epistolarium/
 9
   http://republicofletters.stanford.edu
10
   http://emlo.bodleian.ox.ac.uk
11
   http://republicofletters.net
12
   http://www.culturesofknowledge.org
13
   http://ldf.fi
14
   Due to IP restrictions the data is currently not freely available, but access is being negotiated
   with the metadata owners.
Fig. 1. Overview of the Linked Data approach in creating and aggregating distributed epistolary
data.


 1. Data aggregation. The RDF data model underlying the Semantic Web and Web of
    Data15 is very flexible and simple for combining heterogeneous data from multiple
    data silos.
 2. Support for sharing ontologies. Ontologies used in populating the metadata, such
    as historical people and places, can be shared within the community using ontology
    services [12].
 3. Crowdsourcing. When cataloguing, new resources created in the distributed content
    creation network can be shared, as suggested in [7].
 4. Support tooling. SPARQL endpoint provides a flexible standard API for creating
    tools for data cleaning, entity linking, ontology mapping, etc.
 5. Open application development. In the same vein, the SPARQL API can be used
    in a standardized way for creating rich internet applications (RIA). No server side
    programming and data management is needed, if the API is available, which can
    simplify application development substantially and make it possible to virtually
    anyone.

    The dashed arrows in Fig. 1 illustrate the fact, that the Linked Data service can
be used not only in application development, but also during the data cataloguing pro-
cess in the participating organizations. Using shared up-to-date ontology services, dis-
ambiguated identifiers for, e.g., persons and places can be assigned more easily and
15
     http://www.w3.org/2013/data/
duplication of work is avoided. Also tooling for, e.g., data cleaning, reconciliation, and
duplicate checking can be shared in this way, saving human resources of the community
as a whole and leading to more accurate and interoperable metadata from the outset.


3    Data Models and Linked Data Conversion

In order to allow scholars to efficiently study the vast amount of epistolary data from
different data sources as a whole, the data has to be made semantically interoperable,
either by mapping different data models (e.g., by using Dublin Core16 and the Dumb-
Down Principle17 ), or by providing a harmonised data model to transform the datasets
into linked data [6]. We are suggesting the use of a shared data model for all the datasets.
Unlike many other manuscript genres, letters share readily identifiable basic features
(sender, recipient, date of sending and arrival, place of origin and destination) which
facilitate the formation of a common data model.
    In the context of EMLO, we have converted the original relational database via a
straightforward conversion process using a script18 into an RDF format. The conversion
retains EMLO’s internal data model, and thus follows a simple attribute-based model-
ing approach. A letter is represented as an instance of the class ”Letter”, and it has
properties, such as ”created” (inverse property), ”was addressed to”, ”was sent from”,
”was sent to”, ”has time-span” (date), ”original calendar”, ”language”, ”repository”,
”shelfmark”, ”printed edition details”, and ”source” (the catalogue the letter belongs
to). The data model utilises CIDOC CRM19 [2] (for time spans, people, and places),
Dublin Core (for language, date, description, and subject), FOAF20 (for person names
and gender), and SKOS21 (for labels) vocabularies.
    In addition to purely epistolary data, EMLO contains prosopographical infomation
related to the people in the database, modeled as events and social relationships. Events
cover activities that the people have participated in during their lives, such as birth
and death, ecclesiastic and educational activities, creations of works, travels and resi-
dences. The event metadata includes the event name, type, participants and their roles,
time span, location, and source information. We converted the prosopographical data
into RDF format using CIDOC CRM for the event-based modeling and W3C’s PROV
model [10] for representing the roles of participants in the events.
    As a continuation of this work, we have also developed Bio CRM22 [13], a se-
mantic data model for harmonising and interlinking heterogeneous biographical infor-
mation from different data sources. It is a domain specific extension of CIDOC CRM,
effectively providing compatibility with other cultural heritage information as well. The
data model includes structures for basic data of people, personal relations, professions,
16
   http://dublincore.org/documents/dcmi-terms/
17
   https://github.com/dcmi/repository/blob/master/mediawiki_wiki/
   Glossary/Dumb-Down_Principle.md
18
   http://github.com/jiemakel/anything2rdf
19
   http://cidoc-crm.org
20
   http://xmlns.com/foaf/spec/
21
   http://www.w3.org/TR/skos-reference/
22
   http://ldf.fi/schema/bioc/
and events with participants in different roles. One of the novelties of Bio CRM is the
VIVO/BFO-inspired23 [11], intuitive, and simple approach for the modeling of roles in
different contexts – unitary roles, binary relationships, and events.


4    Ontologies and Ontology Services
For authority control, shared ontologies of people, places, and other relevant entity
types, such as events, are needed. A natural starting point for creating such ontolo-
gies are the existing authority files, listings, and databases used in the data sources. In
our use case, we converted the people and places used in EMLO into RDF format, using
CIDOC CRM classes E21 Person and E53 Place. The idea is to store them in their own
graphs in a public triple-store, where they can be queried and utilized by the community
using SPARQL.
    In cases where a data source uses a shared, established authority database, it can
be used as such with a Linked Data approach. A number of authority sources such
as VIAF24 , Getty ULAN25 , and CERL Thesaurus26 already provide their data in RDF
format, which further simplifies their utilisation.
    For efficient use of the shared ontologies, we have developed the Federated SPARQL
Search Widget27 , a user interface component that can be integrated into, e.g., letter cat-
aloguing systems. Using such an approach, the different data providers already receive
strong identifiers for the people and places as part of the data input process [1], with no
need to reconcile the data later. Fig. 2 depicts an example of a SPARQL search widget
for Finnish historical people, with contextual information supporting the selection of
the correct person, including a person’s photograph, short biographical description, and
the places of activity visualised on a map.


5    Tooling for Reconciliation
When combining data from different sources, support tooling for reconciling the data
into a harmonised format is needed. In the context of EMLO, there already exists a
network of contributors – including scholars working on a specific collection or edi-
tion of correspondence, librarians, and publishers. These contributors provide metadata
pertaining to the correspondence for ingestion into EMLO. The metadata can be input
using a custom spreadsheet or via the EMLO-Collect online web form. Names of both
authors and recipients (people), and origins and destinations (places) are included in the
provided metadata. When inputting this data into EMLO, these people and places have
to be matched to existing person and place records in the EMLO database or else as-
signed new person and place IDs. A semi-automatic tool, Recon28 , has been developed
to assist with this matching process.
23
   http://vivoweb.org
24
   http://viaf.org/viaf/data/
25
   http://vocab.getty.edu
26
   http://www.cerl.org/resources/cerl_thesaurus/linkeddata
27
   http://github.com/SemanticComputing/federated-sparql-search-widget
28
   http://github.com/jiemakel/recon
           Fig. 2. A SPARQL autocompletion widget for Finnish historical people.


     Recon is designed for digital humanities scenarios where trusted accuracy is of
paramount importance. This means that: a) the matching cannot be done entirely au-
tomatically; b) the tool has to return as many potential matches as possible for the user
to consult and consider a ’match’; and c) the user has to be supported in the manual
verification process with the provision of contextual information concerning the match
candidates. Compared to reconciliation tools such as Silk [15] and OpenRefine [14], Re-
con focuses on a manual review of potential match candidates, using a browser-based
user interface to afford a simple, fast, and intuitive workflow.
     The Recon user interface is depicted in Fig. 3. The tool reads a spreadsheet of
names of people or places, possibly with contextual information, such as the years
in which a person was active (floriat). Working through the data rows, Recon runs
SPARQL queries to a triple-store containing people and places extracted from the cur-
rent EMLO database. For each person or place in the spreadsheet, a list of potential
candidate matches is offered to the user, based on the string similarity of the name, and
potentially other criteria based on the SPARQL query used in the matching process. For
example, the years of activity of a person can be used to rank candidates with suitable
birth and death years higher than those similarly named people who have lived at some
other time period. The user has the option to specify whether there is a match or not, or
to leave a query open in case there is an uncertainty; this query might request further in-
vestigation be carried out. When the spreadsheet has been processed, Recon re-exports
to the user the original data supplemented with the EMLO IDs of the matched people
or places. Where no matches have been identified, new EMLO records are created and
their IDs inserted. Following this, the revised dataset can be ingested into EMLO using
this complete list of people and place IDs.


                             Fig. 3. The user interface of Recon.


    For pre-processing tabular letter metadata into a more efficient format before Recon
is used, a complementary tool called Mare29 has been developed. Mare is a map/reduce
user interface for tables. The tool is used in the EMLO spreadsheet workflow to collect
all unique people and place names from a correspondence dataset with contextualizing
information, such as the years of activity based on the dates of the letters that involve
particular people or places. A sample output of Mare is depicted in Fig. 4.
    In addition to using Recon for the semi-automated matching of newly contributed
datasets, the tool has been piloted to enable the identification and linking of records for
the same letters contained in separate catalogues within EMLO. To achieve this, Recon
is configured to run SPARQL queries across the EMLO dataset to identify potential
’matching’ letters, i.e., letters that have the same sender and recipient, and share similar
or exact data in other metadata fields, in particular repository and shelfmark references,
or printed edition details, dates, and places of origin and destination. The tool ranks the
potential duplicate matches for a given letter by taking into account the proximity of
the dates, string similarities of textual metadata fields, etc. The EMLO editors are then
able to assess whether the entries provided by different contributors in different letter
collections (whether they be listings of an early modern individual’s correspondence or
of a thematic collection) refer to the same letter; if the same letter has been entered by
different contributors, a bridge link between the two ’interpretations’ of the same letter
can be inserted in EMLO.
29
     http://github.com/jiemakel/mare
Fig. 4. A sample output of the Mare tool listing unique people, their activity years, and places
involved in a letter catalogue.


    Whilst working with Recon, EMLO’s editors are able to call up records to identify
matches allowing them to review people, place, and letter records in different combina-
tions and to view the correspondence metadata ’from different angles’. In consequence,
errors are spotted and corrected more easily, as well as partial matches, and can be
cleaned and augmented in tandem, as appropriate.


6     Visualisation and Analysis Tools

The epistolary data published in a structured format can be conveniently visualised
using general-purpose data visualisation and exploration tools, such as Palladio30 [3],
RAW31 , or SPARQL Faceter [9]. Palladio can not only ingest data from a spreadsheet,
but the data can also be loaded directly from a SPARQL endpoint. This allows for
the creation of live visualisations without the need to export data manually each time.
Palladio can be used, e.g., for graph, timeline, or map-based visualisations. SPARQL
Faceter allows a scholar to interactively examine a dataset by filtering it using different
facets, such as sender, recipient, origin, destination, date, or catalogue.
    Fig. 5 visualises the temporal distribution of the catalogues included in EMLO,
using the RAW data visualization framework. One can see that EMLO contains dif-
ferent catalogues (colour-coded) of letters from the time period 1500–1800, with the
highest peak representing correspondence activity in the 1640’s. Fig. 6 visualises the
social relationships of Samuel Hartlib based on the prosopographical data in EMLO
30
     http://hdlab.stanford.edu/palladio/
31
     http://app.rawgraphs.io
(connections of two steps from Hartlib), using Palladio. The map shows the connec-
tions Hartlib had to various locations around Europe (the size of a circle represents the
amount of connections), and from the timeline one can see, e.g., that Hartlib was most
active in the 1640’s. Further visualizations of Hartlib’s network using extended proso-
pographical data not yet integrated into EMLO may be viewed and queried via a pilot
Shiny/R dashboard32 .


       Fig. 5. The temporal distribution of the correspondence in the catalogues in EMLO.


7     Discussion
This paper presented the idea of using Linked Data as a basis for aggregating, harmon-
ising, publishing, and using epistolary data in a distributed setting. To test and demon-
strate the ideas, the existing EMLO service data was re-used, transformed into Linked
Data, and published as a “5-star”33 Linked Data service. On top of the SPARQL end-
point provided by the data service, further tools were created which could be utilised
32
     https://idn.web.ox.ac.uk/article/cultures-knowledge-case-study
33
     http://5stardata.info
        Fig. 6. Samuel Hartlib’s social relationships visualised on a map and timeline.


by the scholarly community. The Mare and Recon tools are already in active use by
EMLO’s editors at the University of Oxford. We also demonstrated the potential of ap-
plication development on top of the linked data service, by using Palladio and RAW for
visualising the epistolary data from a digital humanities research perspective.

    This paper focused on epistolary data only, but the Republic of Letters is of course
not only about letters, but scholarly communications and the exchange of knowledge
more broadly, including books, essays, artifacts, etc. A major benefit of the Linked
Data approach in the future is that the model is flexible enough for representing differ-
ent kind of forms of scholarly and cultural heritage content in an interoperable, machine
“understandable” (semantic) way, including both tangible and intangible aspects of cul-
ture and history [6]. Based on semantic representations of knowledge, new kind of
services based on, e.g., intelligent data analysis, Artificial Intelligence, and Knowledge
Discovery can be conceived and created.

     However, the envisioned potential and benefits also have a price tag. Legacy systems
already in use do not yet support Linked Data, and the technology is new and not con-
sistently established in IT departments. The most important challenge is, however, that
using the new model requires greater collaboration and mutual agreements between the
participating organizations, which complicates the process. One has to take into consid-
eration the shared ontologies and vocabularies used by the community, not only one’s
own preferred standards and practices. However, since in this case the final goal of the
community is to create a global view of the Republic of Letters, it is a better idea to
avoid interoperability problems before they arise by a Linked Data infrastructure than
to try to solve them afterwards when the damage is already done [5]. As Alfred Einstein
put it: Intellectuals solve problems, geniuses prevent them.
Acknowledgements Our work is part of the EU COST Action project Reassembling
the Republic of Letters34 and the Cultures of Knowledge project, funded by The An-
drew W. Mellon Foundation. The work is also part of the Open Science and Research
Programme35 , funded by the Ministry of Education and Culture of Finland.


References
 1. Andert, M., Berger, F., Molitor, P., Ritter, J.: An optimized platform for capturing metadata
    of historical correspondence. Digital Scholarship in the Humanities 30(4), 471–480 (2015),
    https://doi.org/10.1093/llc/fqu027
 2. Doerr, M.: The CIDOC CRM—an ontological approach to semantic interoperability of
    metadata. AI Magazine 24(3), 75–92 (2003), https://doi.org/10.1609/aimag.
    v24i3.1720
 3. Edelstein, D., Findlen, P., Ceserani, G., Winterer, C., Coleman, N.: Historical research in
    a digital age: Reflections from the mapping the republic of letters project. The American
    Historical Review 122(2), 400–424 (2017), https://doi.org/10.1093/ahr/122.
    2.400
 4. Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space (1st edi-
    tion). Synthesis Lectures on the Semantic Web: Theory and Technology, Morgan & Claypool
    (2011), http://linkeddatabook.com/editions/1.0/
 5. Hyvönen,      E.:     Preventing       interoperability    problems     instead     of    solv-
    ing      them.      Semantic         Web        Journal      1(1–2),     33–37       (December
    2010),                      http://www.semantic-web-journal.net/content/
    preventing-interoperability-problems-instead-solving-them
 6. Hyvönen, E.: Publishing and using cultural heritage linked data on the semantic
    web. Morgan & Claypool, Palo Alto, CA (2012), https://doi.org/10.2200/
    S00452ED1V01Y201210WBE003
 7. Hyvönen, E., Tuominen, J., Ikkala, E., Mäkelä, E.: Ontology services based on crowdsourc-
    ing: Case national gazetteer of historical places. In: Proceedings of the ISWC 2015 Posters &
    Demonstrations Track. CEUR-WS Proceedings (2015), http://www.ceur-ws.org/
    Vol-1486/paper_45.pdf, vol 1486
 8. Hyvönen, E., Tuominen, J., Alonen, M., Mäkelä, E.: Linked data Finland: A 7-star model
    and platform for publishing and re-using linked datasets. In: Proceedings of the ESWC
    2014 Demo and Poster Papers. Springer–Verlag (2014), https://doi.org/10.1007/
    978-3-319-11955-7_24
 9. Koho, M., Heino, E., Hyvönen, E.: SPARQL Faceter—Client-side Faceted Search Based on
    SPARQL. In: Joint Proceedings of the 4th International Workshop on Linked Media and
    the 3rd Developers Hackshop. CEUR Workshop Proceedings (2016), http://ceur-ws.
    org/Vol-1615/semdevPaper5.pdf, vol 1615
10. Lebo, T., Sahoo, S., McGuinness, D.: PROV-O: The PROV Ontology (2013), http:
    //www.w3.org/TR/2013/REC-prov-o-20130430/, W3C Recommendation 30
    April 2013
11. Smith, B., Almeida, M., Bona, J., Brochhausen, M., Ceusters, W., Courtot, M., Dipert, R.,
    Goldfain, A., Grenon, P., Hastings, J., Hogan, W., Jacuzzo, L., Johansson, I., Mungall, C.,
    Natale, D., Neuhaus, F., Overton, J., Petosa, A., Rovetto, R., Ruttenberg, A., Ressler, M.,
    Rudniki, R., Seppälä, S., Schulz, S., Zheng, J.: Basic formal ontology 2.0 – specification and
34
     http://www.republicofletters.net
35
     http://openscience.fi
    user’s guide (2015), https://github.com/BFO-ontology/BFO/raw/master/
    docs/bfo2-reference/BFO2-Reference.pdf, June 26
12. Tuominen, J., Frosterus, M., Viljanen, K., Hyvönen, E.: ONKI SKOS server for publish-
    ing and utilizing SKOS vocabularies and ontologies as services. In: Proceedings of the 6th
    European Semantic Web Conference (ESWC 2009). pp. 768–780. Springer–Verlag (2009),
    https://doi.org/10.1007/978-3-642-02121-3_56
13. Tuominen, J., Hyvönen, E., Leskinen, P.: Bio CRM: A data model for representing biograph-
    ical data for prosopographical research. In: Biographical Data in a Digital World (BD2017)
    (2017), https://doi.org/10.5281/zenodo.1040712
14. Verborgh, R., De Wilde, M.: Using OpenRefine. Packt Publishing (2013)
15. Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Discovering and maintaining links on
    the web of data. In: Proceedings of the 8th International Semantic Web Conference
    (ISWC 2009). pp. 650–665. Springer–Verlag (2009), https://doi.org/10.1007/
    978-3-642-04930-9_41

</pre>