-

Semantically Enriched Historical Data. Drawing on the Example of the Digital Edition of the "Urfehdebucher der Stadt Basel"

Christopher Pollin

christopher.pollin@uni-graz.at 0

Georg Vogeler

georg.vogeler@uni-graz.at 0 0 University of Graz, Centre for Information Modelling , Austria

2017

27 32

Historical data is widely recognized as a rather complex type of data that contains records about multi-layered, context-sensitive entities and can often be represented as a graph. This paper describes the digital edition of the "Urfehdebucher der Stadt Basel" as an example of how semantic web technologies can o er comprehensive tools in response to the challenges coming with historical data. It introduces the FEDORA Commons based GAMS-infrastructure, reports the work ow from XML/TEI1 encoded historical documents to semantically enriched data in form of XML/RDF data, and describes the speci c data model for the resource. Finally, the paper discusses how the data can be used beyond a standard web interface with reading and search functionalities, for analysis with network visualisation functionalities.

GAMS historical data digital edition semantic enrichment Urfehde TEI RDF SKOS data visualisation

The High Level Expert Group on Scienti c Data formulated their shared vision for 2030: 'Our vision is a scienti c e-infrastructure that supports seamless access, use, re-use and trust of data. In a sense, the physical and technical infrastructure becomes invisible and the data themselves become the infrastructure.' [Neuroth et al. 2012] Scienti c data is always related to the context of a scienti c problem. Research data in the humanities, including historical data, is interlinked to its scienti c discipline and tends to be complex in a speci c way. [Thaller 1989] points out particular challenges concerning historical data: Historical terms, for example 'Prussia', can vary in relation to spatial and temporal context. This leads to a de nition of historical data by [Meron~o-Pen~uela / Hoekstra 2014] as the union of a static, unique primary source and dynamic secondary sources, where the latter point at the primary source in di erent time- and context-sensitive ways. For this reason the authors recommend to describe historical data as a graph and connect it to linked open data sources using taxonomies or ontologies on the 1 http://www.tei-c.org, 21.7.2017 one hand, and dereferencing services inside a digital archive on the other hand. Thinking of the mentioned vision and the fact of historical data being multilayered and context-sensitive, semantic web technologies can o er comprehensive tools that address these problems. The aim of this paper is it to outline the process of semantical enrichment of a historical dataset, the 'Urfehdebucher der Stadt Basel ', as the primary source, and their representation as secondary sources, the 'Urfehdebucher der Stadt Basel { digitale Edition ' [Burghartz / Calvi / Vogeler 2016]. The process of semantically enriching and formalizing data using semantic web technologies could ful l the vision of data becoming its own infrastructure.

The text of the digital edition of the 'Urfehdebucher der Stadt Basel { digitale Edition' was created in a small scale project by Susanna Burghartz at the University of Basel in a teaching project together with her students, with particular contributions by Sonia Calvi and Anna Reiman. The technical realization was developed by the Centre for Information Modelling at the University of Graz. The aim of this low-budget and student supported project lies more in an experimental approach applying semantic web technologies to a historical source. 'Urfehde' can be roughly translated as 'oath of truce'. The purpose of the so called 'Urfehde' was to settle a dispute between two con ict parties and urge a sentenced criminal to a unilateral oath not taking revenge for its judge. This was legal practice in most of central Europe in the late middle ages and the early modern period, and recorded in the so called 'Urfehdebucher', which have survived in many archives, as demonstrated in the data of the Index Librorum Civitatum project.2 The rst Urfehdebuch X of the city of Basel (StadtA Basel Ratsbucher O10) records 'Urfehde' oaths from 1563 to 1569. This source can be used as an exemplary dataset as it shows the signi cant structure to be used in a statistical analysis: in addition, with 625 entries, the dataset is large enough to contribute to research on the cultural, social, and economic history of early modern people.

The task in realising this digital edition was therefore to combine established and easy to use transcription work ows using XML/TEI annotation with the conversion to RDF data to prepare a basis for data analysis and data publication. This calls into question which advantages semantic web technologies can o er to scholars regarding the retrieval, visualisation and analysis of historical data in the humanities. 1.1

Related Work Maybe the rst digital edition making use of semantic web technologies in a similar way was the edition of the Henry III ne rolls3 [Ciula et al. 2008]. The project combined a TEI transcription with a CIDOC-CRM based ontology expressed in OWL. The RDF data of the project was not made openly accessible. The Henry III ne rolls project follows the approach of building an extended 2 http://www.stadtbuecher.de/literatur/schlagwort/137667, 18.07.2017. 3 http://www. nerollshenry3.org.uk, 12.7.2017 index for the digital representation of the primary source as [Poupeau 2006] has described it. A successful example for this approach is 'Sandrart.net '. This digital scholarly edition encodes data using XML/TEI. The data is made available Linked Open Data in RDF.4 Similar to this the platform for historical research SYMOGIH5 is preparing a SPARQL endpoint. Recently the 'Semantic Blumenbach' project explores new approaches for linking between artefacts and text [Wettlaufer et al. 2015]. It uses the 'scienti c communication infrastructure' WissKI 6 to implement semantic web methods for data acquisition, storage and re-use.

All these projects follow the extended index approach. This, however, fails short when it comes to the analysis of abstract concepts apart from the classical index on named entities like places, persons, and objects. Historical texts like the Urfehdebucher need additional modelling to become data sets for historical analysis, in particular regarding the classi cation of criminal o ence, punishment, and social status of the people involved. Thus the 'The Proceedings of the Old Bailey ' project comes much closer to the Urfehdebucher. The project de nes itself as a searchable edition of criminal trials held at London's central criminal court. XML/TEI markup of digitized text o ers the possibility to search and analyse the source.7 The data set is accessible via an API8, but is not available as RDF or via a SPARQL endpoint. Therefore there is no project in the same research area as the 'Urfehdebucher'. Common standards have still to be established and the challenges of interoperability are not solved yet. 2

The digital Edition

The work ow of the Urfehdebucher -project is embedded in the GAMS9, which is described by [Steiner / Stigler 2017]. GAMS de nes itself as an asset management system for the humanities and serves the purpose of administration, publication and long-term preservation of digital resources. It is based on the open source repository software FEDORA-Commons. Using Cocoon-services and project speci c content models for varying data streams scholarly data can be stored and disseminated for public use. The data is represented as readable web site, as archival data structures in XML, and via various API. GAMS implements a disseminator for RDF data via a RDF-triplestore. Currently, the open source software Blazegraph10 is in use, which allows SPARQL-queries and fulltext search in literals.

Expert academics transcribed and encoded the source in XML/TEI, structuring the text, marking up text-speci c phenomena and normalizing places, persons 4 http://ta.sandrart.net/de, 12.07.2017 5 http://symogih.org, 12.7.2017 6 http://www.wiss-ki.eu/, 12.7.2017 7 https://www.oldbaileyonline.org, 12.07.2017 8 https://www.oldbaileyonline.org/static/API.jsp, 12.7.2017 9 gams.uni-graz.at, 12.7.2017 10 https://www.blazegraph.com, 12.07.2017 or concepts. Additionally the TEI attribute ana was used to add speci c semantics to the applied TEI markup. ana is used because it allows to add global and multiple interpretation to the TEI markup. This XML/TEI illustrates the TEI markup.11 The div element with the attribute ana="#uf_Eintrag" de nes the content of the whole div as an 'Urfehde'-entry representing a single case. The semantics of the rst entry in the XML/TEI can be summarized as follows: The hireling 'Heinrich Peter ' from 'Zurich' was judged as an o ender, due to alcohol abuse on the 'kornmerkt ' (grain market). This statement is encoded in the XML/TEI using the attribute ana, like ana="#uf_male" for annotating the gender of a person. The value in the ana attribute is taken from a taxonomy of categories de ned by the the colleagues in Basel following their methodological access of the source12. Its hierarchical structure of concepts can easily be converted into a SKOS-resource. When ingesting the XML/TEI into the GAMS infrastructure, a project-speci c XSLT-Stylsheet transforms all semantically enriched data into XML/RDF and writes the triples in the triplestore. The XML/RDF shows the outcome of the transformation.13 The assertions describe the aforementioned 'Urfehde'-entry and all its properties linking to other concepts like uf:PersonOffender, where further properties refer to literals or refer to concepts normalizing data, like the place 'Zurich'.

The extracted XML/RDF follows a simple RDFs14 which de nes the entry (Eintrag) at the core. It represents the case and has properties to identify the o ence and its classi cation, the persons named in the record and their role, the type of punishment, and other properties connected directly to the entry and the legal procedure, e.g. date of the oath (DatumUrfehde), date of the o ence (DatumTat), the notarial authentication of the entry (NotarialSubscription). Fulltext (advanced) search functionalities are implemented using SPARQL and the fulltext capabilities of the blazegraph triple store.15 For this purpose GAMS o ers a query content model which returns XML data on demand. This can subsequently be transformed to HTML to o er additional functionalities like visualisations and data download. 3

Results and Potentials

The outcome is a semantically enriched digital edition using RDF data representation16 for fulltext and advanced search.17 Using the advanced search functionalities a user is able to employ regular expressions, make temporal constraints of search results and use the normalization of place names to query the data. 11 gams.uni-graz.at/o:ufbas.1563/TEI SOURCE, 12.09.2017 12 gams.uni-graz.at/o:ufbas.kategorien/TEI SOURCE, 12.07.2017. 13 gams.uni-graz.at/o:ufbas.1563/RDF, 12.09.2017 14 gams.uni-graz.at/o:ufbas.schema, 15.7.2017 15 wiki.blazegraph.com/wiki/index.php/FullTextSearch, 12.7.2017 16 gams.uni-graz.at/o:ufbas.1563/RDF, 13.07.2017. 17 gams.uni-graz.at/query:ufbas.search/get, 13.07.2017. Regular expressions are particularly useful to so search for orthographic alternatives, e.g. eefrou?wen which returns words like eefrouwen and eefrowen (for `spouse'). An example for using normalized data is that a query like Kleinbasel returns all entries connected to the place named mindren Basell in the text. A navigation menu leads through the chronologically listed data. The user can collect entries from search results or while browsing the text into a personal data basket18, implemented by using the local storage of the browser and simple JavaScript. Collected entries can be exported as simple CSV to be processed with a spreadsheet application for further work.

Because of the fact that the whole data set is de ned as RDF graph and the data itself has network character, adequate ways of information visualization are possible. Exemplary scholarly questions regarding the 'Urfehdebucher' could be if female o enders of a speci c type of crime were treated and punished differently than male o enders. Visualizing the relations between o ender, places, time, punishment or crime in the whole data set, or parts of it, could open new approaches to work with the source, or open possibilities to identify at a glance which category or question could be interesting. We did some experiments using d3.js19 library for creating forced graphs, based on the result of the search for a category.

This Figure shows a graph of the search by category uf:ThreatOfPunishment.20 The light green node in the center represents this category. Every dark blue node refers to a case reported in the 'Urfehdebuch', which is connected to the node with the value uf:male (large blue node). The light blue nodes represent cases connected to women (uf:female, large yellow node). The other paths from the case nodes represent dates (light blue), occupations (green), and places of origin (orange). The gender nodes are obviously the major bridge-nodes, but other properties to the cases can serve as additional bridge-nodes, e.g. when several cases contain the same date (1568-06-14 ), same profession (tagloner ) or same place (Zurich). The forced atlas graph allow a rst instantaneous interpretation: The degree and centrality of the gender nodes moves the node for female o enders at the outer part of the graph. A threat of punishment was therefore much more often applied to male o enders, and alcohol abuse was a problem recorded mostly for men. Certainly detailed research has to establish the numbers relative to the number of all cases. The graph visualization can assist retrieval and discover functionalities in the future. 4

Conclusion and Further Work

The example of the Urfehdebucher demonstrates that creating XML/TEI transcription of a text prepared to be used as semantic web data o ers new approaches for scholarly edition, ts to the graph-like understanding of historical 18 gams.uni-graz.at/context:ufbas?mode=datenkorb, 13.07.2017. 19 https://d3js.org, 21.07.2017. 20 gams.uni-graz.at/context:ufbas/StrafeStrafandrohung, 12.09.2017 data, and the data becomes more expressive and self-describing. The transformation of the textual statements in RDF is made with easy annotation and little programming e ort. The RDF dataset can be used as a fundamental database technology in the online publication as well as for advanced research questions. The data created can be queried and visualized in a way that it can be bene cial for historical research. Finally the publication of this data with semantic web technologies allows to make the data model, the taxonomy and the data itself openly available in a standardized way as RDFs, SKOS and generic RDF data. Aligning the data model and the taxonomy with other resources like the Old Bailey project is envisioned future improvement and can be the rst step to a common vocabulary. 5

Appendix

{ TEI-Source: gams.uni-graz.at/o:ufbas.1563/TEI SOURCE { RDF-Source: gams.uni-graz.at/o:ufbas.1563/RDF { Graph of uf:ThreatOfPunishment:

gams.uni-graz.at/context:ufbas/StrafeStrafandrohung

[Burghartz / Calvi / Vogeler 2016] Burghartz, Susanna / Calvi, Sonia / Vogeler, Georg: Urfehdebucher der Stadt Basel { digitale Edition, Graz 2016 , gams .unigraz.at/ufbas.

[Ciula et al. 2008] Ciula, Ariana / Spence, Paul / Veira, Jose Miguel: Expressing complex associations in medieval historical documents. The Henry III Fine Rolls Project , in: Literary and Linguistic Computing 23 ( 2008 ), p. 311 { 325 , DOI: 10.1093/llc/fqn018.

[Meron~o-Pen~uela / Hoekstra 2014] Meron~o-Pen~uela, Albert / Hoekstra, Rinke: What is linked historical data? , in: International Conference on Knowledge Engineering and Knowledge Management . Springer, Cham, p. 282 { 287 .

[Neuroth et al. 2012] Neuroth, Heike / et al.: Langzeitarchivierung von Forschungsdaten . Eine Bestandsaufnahme. Hulsbusch, 2012 , p. 15 .

[Poupeau 2006] Poupeau, Gautier: De l' index nominum a l'ontologie. Comment mettre en lumiere les reseaux sociaux dans les corpus historiques numeriques? , in: Digital Humanities 2006 . The First ADHO International Conference: Conference Abstracts. Universite Paris-Sorbonne. 2006 .

[Steiner / Stigler 2017] Steiner, Elisabeth / Stigler, Johannes : GAMS and

Cirilo

Client . Policies, documentation and tutorial . Graz , 2014 {2017 http://gams.unigraz.at/docs.

[Thaller 1989] Thaller, Manfred: The Need for a Theory of Historical Computing , in: Denley, Peter / et al.: History and Computing II, Manchester and New York, 1989 , p. 4 { 6 .

[Wettlaufer et al. 2015] Wettlaufer, Jorg / et al.: Semantic Blumenbach. Exploration of Text{Object Relationships with Semantic Web Technology in the History of Science, in: DSH Digital Scholarship in the Humanities 30 , suppl . 1 . 2015 , p. 187 { 198 .