TEI Beyond XML – Digital Scholarly Editions as Provenance Knowledge Graphs Andreas Kuczera University of Applied Sciences (THM) Gießen, Germany Abstract This paper proposes to detach TEI semantics – a widely accepted stand- ard for the description of textual phenomena – from its hierarchical XML framework in order to integrate its descriptive structures into a digital scholarly edition (DSE) of Hildegard von Bingen’s Liber epis- tolarum based on a knowledge graph enriched with provenance inform- ation. To which problem is digitization the solution? (Nassehi, 2019)1 1 Introduction The search for origins is a quintessential human activity. Scholars in the hu- manities – especially historians – engage in this activity by examining cul- tural artefacts such as texts, objects, and images. In so doing, they make use Creative Commons License Attribution 4.0 International (CC BY 4.0). In: Tara Andrews, Franziska Diehr, Thomas Efer, Andreas Kuczera and Joris van Zun- dert (eds.): Graph Technologies in the Humanities - Proceedings 2020, published at http://ceur-ws.org This long paper is based on research presented at “Graph Technologies in the Humanities 2019” (January 18-19, Academy of Sciences and Literature | Mainz, Germany). 1 Translation by the author 101 of consensus-based techniques and methods, which enable a common un- derstanding of their findings once they are published, for example, in the form of a critical edition. And yet these scholarly standards are themselves subject to a constant process of alteration and development: historical re- search is now increasingly permeated by digitization and the possibilities that come with it. Digital technology has significantly extended the methodolo- gical repertoire of researchers in the humanities, and the fact that scholars are no longer bound to paper as a medium means that a rich variety of new interpretative approaches have emerged – a state of affairs that is clearly at odds with Barbara Bordalejo’s contentious assertion that “there is no such thing as digital scholarly editing” (Bordalejo, 2018, pp 24). The majority of today’s digital scholarly editions (DSEs) use the Text En- coding Initiative (TEI) standard in combination with XML and its inher- ent hierarchies, as it is widely considered to be “a well-documented format for archival long-term preservation” which allows researchers to describe “a large number of textual phenomena in general ways” (Cummings, 2018). But research data in general, and the data of DSEs in particular, is highly con- nected and far from easy to express within a hierarchy (Witt, 2018, pp 222- 223); a point to which I will return in Section 4. James Cummings certainly does not exaggerate when he states that the notion that “XML has difficulty with overlapping hierarchies is not, in itself, strictly a myth” (Cummings, 2018, pp i70). And things become even more complex if we begin to in- clude the divergent perspectives of researchers concerning the transcription and edition of a given text. One crucial step towards addressing this issue is to move from the text-as-document paradigm toward what Zundert and Robinson refer to as the text-as-work paradigm (van Zundert, 2016, p. 103- 104). But as I will argue in what follows, we ought in fact to go one step fur- ther by fully recognizing that researchers in their various roles as transcribers, editors, annotators, and users are themselves a key factor in the system of tex- tual editing. Such an approach follows Niklas Luhmann’s observation that [t]he inclusion of the observer and the instruments of observation in the objects of observation themselves is a specific characteristic of universal theories.2 It goes without saying that this essay is not about presenting a universal the- ory – the point is that we would be well advised to think of editors and users as integral parts of an interconnected system. Everything contributors do, all their observations and decisions, become part of the DSE as a work (Kuczera and Kasper, 2019). Jeffrey C. Witt has already suggested to conceive of DSEs 2 (Luhmann, 1987, 164) (translation by the author). 102 as multipartite networks (Witt, 2018), which is essentially a way of describ- ing a graph. For Witt, however, researchers themselves do not form part of this network: on the textual level, he models “each text as an Ordered Hier- archy of Content Objects (OHCO)” (Witt, 2018, p. 231).3 What I would like to propose instead is to model a DSE as a proven- ance knowledge graph which contains the entire critical apparatus, one or more transcription(s) and, if applicable, details concerning the relationships among them, as well as information on the origin of every statement. De- scribing textual phenomena (including the actions of the editor or editors) by means of TEI semantics has the advantage of maintaining semantic in- teroperability, as TEI is the established standard in the field. Moreover, TEI renders the connection between researchers and their work transparent. Made available in the form of a provenance knowledge graph, this crucial information in effect turns research data into a collection of subjective de- cisions made by researchers – it is then up to individual users to decide how much they trust these decisions based on the expertise and academic profile of the scholar(s) in question. To manage this highly connected trove of re- search data, a labeled property graph (LPG) database can be used. With this groundwork in place, the next step will be to connect individual knowledge graphs either in part or in their entirety to a broader system of concurring and/or diverging statements and interpretations. 2 The Rise of Connected Research Data To better understand the desideratum articulated in this paper, let us con- sider the process of digitization in the field of medieval history, a develop- ment that has taken place in at least two distinct stages. 2.1 Image Digitization The first stage, which lasted until the end of the 1990s, was characterized by a strong focus on image digitization. In Germany, one important protagonist in the field was the dMGH project, in the course of which the volumes of the Monumenta Germaniae Historica (MGH) were scanned, saved as image files, and made accessible on the internet (Sahle and Vogeler, 2013). By way of example, Figure 1 shows the scan of a page from the MGH with a transcription of a charter of Emperor Frederick Barbarossa. In most cases, these early attempts at digitization did not allow for text to be copied out of the images, but they were still a step in the right direction: a large amount of research material was made available to researchers even if the paper copies were absent from their library. 3 On OHCO, see DeRose et al. (1990). 103 Figure 1: Scan of MGH DDFI.2, p. 260 with a charter of Emperor Frederick Bar- barossa (Source: dMGH https://www.dmgh.de/mgh_dd_f_i_2/index.htm#page/260/mode/1up) 2.2 Full Text Digitization Around the turn of the millennium, this first stage evolved into a phase of full text digitization with projects such as Regesta Imperii Online (Schulz, 2017). Having been personally involved in this project, I can vividly remem- ber the discussions about whether image digitization or full text digitization should be used. One major argument advanced by the proponents of image digitization was that optical character recognition (OCR) was still in its in- fancy and highly error-prone. We addressed this issue by linking every full text item on our website to a scan of the corresponding book page in the Regesta Imperii, which gave users direct access to the material that was being digitized and allowed them to identify inaccuracies. Figure 2: Full text version of the charter from Figure 1 (Source: http://www. regesta-imperii.de/id/1162-09-07_2_0_4_2_2_587_1145) With the advent of full text digitization, large-scale computer-based text retrieval from historical documents became a possibility, and this major im- 104 provement brought with it entirely new ways of scholarly exploration. 2.3 Entities in Focus Today, we are facing the next important step: it is now time to focus on the entities in the text. Figure 3: Fol. 341r of the Liber epistolarum Identifying, annotating, and connecting these entities with data from au- thority files like GND or Wikidata enables the interconnection of individual research projects. It also becomes possible to model scholarly interpretations and the various steps of the research process in machine-readable statements – Section ?? will describe this process by example of a project dedicated to Hildegard von Bingen’s correspondence. 3 “Myths and Misconceptions about the TEI” – Thoughts From an Expert In a recent article, Cummings (2018) shared his thoughts on what he per- ceives to be widespread myths or misconceptions concerning the TEI. 3.1 “XML is broken or dead” The first of these myths is that “the TEI is XML (and XML is broken or dead).” As Cummings points out, [t]he TEI Guidelines were first expressed in SGML as a markup lan- guage and only as of TEI P4 moved to recommending XML, but even this recommendation may change in the future (Cummings, 2018, i59). With SGML, there had been no problems with overlapping markup – only with the shift to XML did overlap create the need for various work- arounds.4 On the other hand, however, the use of XML gave access to its entire ecosystem, and XML was the rising star in the field of markup at the time. Yet there is no reason why this arrangement has to be permanent: 4 https://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html 105 [A]s new languages, technologies, and methodologies for text encod- ing emerge in future, the TEI Guidelines may move to them or include them as one of a set of ways to serialize digital text, so as long as they meet the basic requirements for easy long-term preservation, expressive- ness, validation, integration, and mass adoption that is seen with XML. (Cummings, 2018, p. i59). And this is precisely where our DSE of Hildegard von Bingen’s letters comes into play: its core principle is to employ TEI semantics without the hierarch- ical structure of XML. 3.2 The Future of XML as a Format for Text Encoding At this juncture, I would like to share a personal observation regarding the future of XML. When we started using the format in our project Regesta Imperii Online in the early 2000s (Rübsamen and Kuczera, 2006), the pro- ject involved only comparatively basic annotation, so that any of the large number of freely available XML editors in plain XML mode was up to the task. Nowadays, many edition projects in the digital humanities employ the commercial software OxygenXML, often in combination with the virtual research environment ediarum,5 which provides customizable GUI features. The reason for this is really quite simple: today’s annotation structures are of- ten very complex, but OxygenXML’s author mode makes the intricate XML elements editable while conveniently hiding them from the user’s view. There is a good reason why, in the broader field of software technology, XML is employed for purposes such as data exchange and the structuring of data in configuration files, but not for the sophisticated annotation of texts. Of course, publishers do use XML for their books, but these texts are nowhere near as deeply annotated as the ones that we are dealing with in the digital humanities today. In fact, one could argue that the TEI community is in real danger of hit- ting a dead end, unless viable alternatives to XML are found in a timely fash- ion. 3.3 “XML (and TEI) cannot handle overlapping hierarchies” Another myth discussed by Cummings is that “XML (and TEI) cannot handle overlapping hierarchies” (Cummings, 2018, p. i70-i71). Clearly, this is a bit of an overstatement: the TEI community has developed several mech- anisms to deal with the issue of overlapping markup – at least to a certain extent.6 But as the number of annotation hierarchies grows, these strategies 5 https://www.bbaw.de/bbaw-digital/telota/forschungsprojekte-und-software/ediarum 6 https://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html 106 do run into increasing problems. In light of this, we could rephrase Cum- mings’ statement as follows: “XML (and TEI) cannot handle a sufficient number of overlapping hierarchies without complicated and ultimately inad- equate workarounds.” In some projects, a single annotation hierarchy may well be all that is needed – but being able to manage more of them should the necessity arise gives researchers considerably more flexibility. To give an example, our graph-based DSE environment Codex (Kuczera and Neill, 2019) contains regions of text with up to 6 layers of annotation: • Layout (page breaks, columns, alignment, etc.) • Style (highlighted text, etc.) • Entities (persons, places, concepts, etc.) • Syntax • Morphology • Language Customized annotation layers can easily be added to this list by the user. The substantial benefits of flexible, multidimensional annotation hierarchies in a DSE will be explained in detail in the following section. 4 Hildegraph: TEI without Hierarchies In March 2020, a project based on the idea of using TEI semantics without the accompanying XML hierarchy was inaugurated under the title The Book of Letters of Hildegard von Bingen. Genesis – Structure – Composition.7 4.1 The Sources The transmission history of Hildegard von Bingen’s (1098–1179) letters has taken many twists and turns. Within this complex and convoluted story, the so-called Riesen-Codex [’giant codex’]8 – a book of letters (Liber epistolarum) which consists of brief epistolary texts arranged to form a cohesive theolo- gical whole (see Figure 4 – the beginning of each letter is marked with larger characters and red ink) – assumes a particularly prominent position. The reason for this can be explained by two separate, albeit closely related, aspects of its reception history: first, both medieval and modern audiences are un- animous in their verdict that the Liber epistolarum is of equal importance to Hildegard’s works of visionary theology; second, the Riesen-Codex can lay claim to the special status of a last hand edition, as it was compiled by Hilde- gard’s staff from the entirety of her correspondence during her own lifetime and in accordance with her wishes. 7 The project is funded by the Deutsche Forschungsgemeinschaft (DFG) https://gepris. dfg.de/gepris/projekt/429863245?language=en 8 https://tudigit.ulb.tu-darmstadt.de/show/Hild_R_Riesencodex 107 Figure 4: Fol. 341r of the Liber epistolarum Wiesbaden, Hochschul- und Landes- bibliothek RheinMain (ehemals: Wiesbaden, HessischeLandesbibliothek) Hs 2 („Riesenkodex“) urn:nbn:de:hebis:43-972. 108 Our project is the first to present the Liber epistolarum in the form of a digital scholarly edition. As opposed to the existing critical edition of Hilde- gard’s letters (van Acker and Klaes-Hachmöller, 1993-2001), which seeks to reconstruct ’the correspondence that actually took place’ while combin- ing different stages of transmission, our focus lies on the final authorized form that Hildegard’s letters assumed during her lifetime. Moreover, the individual letters found in the Liber epistolarum are not treated as mere his- torical witnesses, but rather as constituent parts of a deliberate and highly sophisticated theological-cum-literary composition. Our edition of the Liber epistolarum is designed to be as media neutral as possible, allowing parts of it to be printed in book form. The changes that the text underwent over time can be traced by means of a graph model, in which the genesis of the individual letters – from the oldest known version to the form that appears in the Liber epistolarum – is modeled on the basis of the pertinent manuscripts. By way of example, Figure 5 shows the inter- dependencies of the various versions of letter #52 found in manuscripts Z, W, M, Wr, and R. While information concerning the evolution of the Liber epistolarum over time is stored in a graph model, the texts of the letters themselves will be tran- scribed in a standoff property editor with the project name hildegraph.9 For our purposes, standoff property (SPO) means that the texts are annotated on an index base, whereas TEI-based XML markup is mainly inline. The technical outline of SPO is explained in detail in (Kuczera and Neill, 2019). The project began in April 2020 with a critical transcription of the text of the Riesen-Codex. As hildegraph was not yet operational at that point, the task was initially undertaken using an adapted version of the Leiden Conven- tions,10 a system that employs various types and combinations of brackets to express textual phenomena in plain text. Currently, we are working on the transfer of these transcriptions into the hildegraph environment with the aid of TEI semantics. As noted above, our DSE uses TEI semantics without XML hierarchies – in hildegraph, multiple annotation hierarchies can coexist in one system. Table 1 shows a preliminary list of annotation types based on TEI semantics. Here it is important to keep in mind that an annotation can be assigned to multiple semantic spaces – Hildegraph is capable of managing several index- ing systems at once. For example, the first line in Table 1 shows that text between lines can be identified both by Leiden annotation leiden/supralin- 9 hildegraph is derived from the Codex system described in (Kuczera and Neill, 2019) https://www.hildegraph.org/ 10 https://en.wikipedia.org/wiki/Leiden_Conventions 109 Figure 5: The Liber epistolarum as a theological-cum-literary composition. IS_FOLLOWED_BY edges connect letters, represented by nodes to the ones that follow in the manuscript, IS_RELATED_TO edges connect letters to the respective reply forming a pair of letters, and IS_COMPARABLE_WITH edges connect letters across various manuscripts whose wording or style is very similar or identical) (Kuczera). eam and by TEI annotation . This allows us to use dif- ferent annotation layers as and when they are required, even if they violate the hierarchical structures of XML – a flexibility that does not compromise machine-readability in any way. 4.2 Do We Need Containment? As the following example demonstrates, the process of translating annota- tions based on the Leiden Conventions into the SPO format by means of TEI semantics is not without its challenges. Figure 6 shows the end of one letter and the beginning of another: This is the transcription of the corresponding passage according to the Leiden Conventions (transcription by Sr. Maura Zatony): ... neq(ue) odiu(m) alicui(us) p(er)sonę adtendemus! sed | solius iusticię respectu equitate(m) iudicare | proponimus. 110 Explanation Type Description TEI Coding supra between eam: orange text margin note leiden / mar- to text text leiden / recensi additional ZPA start and leiden / column: end of text column EOC “//” column a-b leiden / correc- visible corr. text tion: red under- correction line original spelling corrected and leiden / sic: interpr. (transcrip- to/sic! green underline tion) in rasura been erased) ition: pink un- other line derline erased text through line rubricated leiden / em- letters) red resolution � leiden / expan- of abbrevi- interpr. expansion �� sion: styled in ations blue empty space white space start of line text line : EOL “/” Table 1: List of annotation types 111 Figure 6: Part of fol. 341r with rubricated text \#52\# [ru[friderico imp(er)atori hildeg(ardis).]ru] | [ru[A]ru] summo iudice. hęc uerba diriguntur | ad te. Ualde admirabile est q(uo)d hanc || p(er)sona(m) homo habet necessaria(m)! scilicet quę | tu rex es. Audi. Quida(m) uir stabat in excel The manuscript’s red, or rubricated, characters and the capitalized ‘A’ are a signal to the reader that a new letter is about to begin. In the transcrip- tion, these parts of the text are represented by square brackets and the siglum ru: [ru[A]ru] (for Rubrum), whereas the characters in round brackets spell out the abbreviations used by the original scribes. As this system of annota- tion unequivocally marks which start element belongs to which end element, overlapping markup is possible. But what is the best way to represent rubricated text by means of TEI ele- ments? Which of the several options provided by TEI is the most promising? In an effort to find an answer to this question, we reached out to two TEI experts. One suggestion was to use the element11 to highlight “words or phrases which are stressed or emphasized for linguistic or rhetorical effect,”12 while the attribute could be employed to convey the information that the text’s color is red. Here is what this approach would look like: friderico imp(er)atori hildeg(ardis) 11 https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-emph.html 12 The rhetorical aspect reminds me of Zundert’s idea of a computational edition with performative texts (van Zundert, 2019). 112 The other expert proposed to use the element13 to mark “a word or phrase as graphically distinct from the surrounding text, for reasons concern- ing which no claim is made,” a solution that would look like this: friderico imp(er)atori hildeg(ardis) The element appeared to be a fitting choice to mark a section of text that had been highlighted for a specific purpose – but given that the red characters were graphically distinct from the surrounding text, a strong case could also be made for the element. Clearly, the rubricated text ful- filled at least two different roles: in the context of the overall layout of the page, the red characters mark the beginning of a new letter and thus serve a rhetorical and structural function, yet they also identify the sender and re- cipient of the letter in question. What, then, if we employed both elements to represent the distinctive red ink? Which element should come first and contain the other? And does this kind of containment even make sense? In long discussions with various colleagues, no convincing arguments in support of the need for containment were put forward. There is simply no plausible need for it when it comes to accommodating different annotation layers like layout or rhetoric: as our Hildegraph environment attests, all of these layers can be combined in various ways without the application of hier- archies. 4.3 “So what’s the text, then?” The brief transcription from the Liber epistolarum discussed in the previous section contains several expansions of scribal abbreviations employed in the original text. The use of abbreviations in manuscripts was a very common practice in the Middle Ages and posed few obstacles to contemporary read- ers. A modern critical DSE, on the other hand, is expected to provide an expanded and normalized version of the text for convenience and ease of ref- erence. But which version of the text should be displayed in Hildegraph’s plain text field? As a medievalist, I am inclined to argue that the version of the text that is as close as possible to the original should be shown in this prominent location; on the other hand, the expanded versions are much easier for casual users (and also for persons charged with maintaining the database) to read and understand. In the end, there are good arguments in favor of each of the two alternatives, and one of the major advantages of using SPO is that it does not force us to make a choice: as all versions can easily be converted 13 https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-hi.html 113 into one another, there is simply no need to decide once and for all whether the plain text to be indexed in SPO is the original or the normalized version – this decision can be made according to the specific requirements of the individual use case. 5 Digital Scholarly Editions as Provenance Knowledge Graphs 5.1 What is a Knowledge Graph? If we continue along this line of thought, the whole DSE can be conceptual- ized as a provenance knowledge graph, in which every piece of information is stored together with information on where it comes from, who made which statement and when, etc. Broader scholarly interest in knowledge graphs began to arise when Google discussed their own approach to the issue in a blog post, which es- sentially described an enhancement of their search engine through semantics without going into the technical details (Amit Singhal, 2012). Since then, a fair amount of research has been carried out in this area, notwithstanding the absence of a universally accepted definition of what constitutes a knowledge graph (Ehrlicher and Wöß, 2016). At times, the term has simply been used as a synonym for ontology. Ac- cording to Paulheim (2016), [a] knowledge graph (i) mainly describes real world entities and their interrelations, organized in a graph, (ii) defines possible classes and rela- tions of entities in a schema, (iii) allows for potentially interrelating ar- bitrary entities with each other and (iv) covers various topical domains. For Ehrlicher and Wöß (2016), [k]nowledge graphs are large networks of entities, their semantic types, properties, and relationships between entities, whereas Pujara et al. (2013) point out that there are systems that [u]se a variety of techniques to extract new knowledge, in the form of facts, from the web. These facts are interrelated, and hence, recently this extracted knowledge has been referred to as a knowledge graph. Another common definition14 is that a knowledge graph represents a collec- tion of interlinked descriptions of entities (real-world objects, events, situ- 14 See, for example, https://www.ontotext.com/knowledgehub/fundamentals/ what-is-a-knowledge-graph. 114 ations, or abstract concepts), while other scholars use the term to refer to any knowledge base modeled as a graph. As if this confusion was not enough, none of these definitions say any- thing about the technical specifications of a knowledge graph: some expli- citly mention RDF (Resource Description Framework) (Färber et al., 2016) and some suggest node properties (Ehrlicher and Wöß, 2016), but as of yet, no clear picture as to possible technical backgrounds of knowledge graph systems has emerged. 5.2 The Provenance of a Statement The various concepts of knowledge graphs discussed in the previous section do have one thing in common: they treat information as objective truth. But in the field of the (digital) humanities, the ‘truth’ is always a matter of in- terpretation. The interpretative process begins the moment the very first characters of a text are transcribed. Each of the editor’s decisions is open to discussion and constitutes a subjective statement – and that is precisely the point where provenance comes into play. Once we begin to model a DSE as a knowledge graph that includes comprehensive provenance information, we end up with a huge amount of statements. Expressing all of this information in RDF would produce a huge and completely unmanageable graph, which is why we have opted to use labeled property graphs (LPGs) in our DSE. 5.3 LPG vs. RDF RDF is a W3C standard for data exchange in the web that represents data as a graph, and this is the most important point of commonality it shares with LPGs. RDF structures information in triples in the form of subject- predicate-object, with the subject, predicate, and object being identified by Uniform Resource Identifiers (URI) (Barrasa, 2016). The statement that Emperor Frederick Barbarossa was a human being would look like this: (Emperor Frederick Barbarossa)-(INSTANCE_OF)-(human) Here is the same statement translated into URIs: (https://www.wikidata.org/wiki/Q79789) (https://www.wikidata.org/wiki/Property:P31) (https://www.wikidata.org/wiki/Q5) Adding his place of death – Weingarten – would involve another triple: 115 (https://www.wikidata.org/wiki/Q79789) (https://www.wikidata.org/wiki/Property:P19) (https://www.wikidata.org/wiki/Q572427) In principle, RDF understands the world as a network of connected en- tities and literals. Its popularity surged with the rise of the Semantic Web15 , which operates on the basic idea that users should publish data in structured formats with well defined semantics so that this data can be ‘understood’ by machines. Originally, this structured information was to be contained in RDF triple stores16 , a vision that soon evolved to quad stores which added a named graph to each RDF triple. Today, the product of this evolution is commonly referred to as “semantic graph database” (Barrasa, 2016). In LPGs, each node and edge not only has a unique and distinctive ID, but also a set of key-value pairs (or properties) that characterize it. Our example of Emperor Frederick Barbarossa and his place of death could be expressed like this: (e:Entity{type:'human', wikidataId:'Q79789'}) -[r:PLACE_OF_DEATH {wikidataId:'P19'}]-> (p:Place {label:'Weingarten', wikidataId:'Q572427'); When comparing RDF triples with an LPG, it is important to keep in mind that in the latter, nodes and relationships have an internal structure. In RDF, on the other hand, a triple is composed of two nodes connected by an edge (subject-predicate-object); the subject and the relationship are each identified by a URI, and the object can be another node or a literal, so that neither nodes nor relationships have an internal structure – they are merely unique labels. It is evident from this that an RDF graph could easily reach ten times the size of an LPG containing the same amount of information. Another important difference is that RDF does not uniquely identify in- stances of relationships of the same type, nor does it allow instances of rela- tionships to be qualified. In an LPG, the information is stored in the graph structure and in the internal structure of nodes and relationships. In RDF, all of this must be expressed in simple RDF triples. In light of this, Hildegraph uses an LPG to store all of the information contained in the DSE as a statement, which allows us to explore and com- pare diverse (and potentially competing) interpretations. The provenance information – who made what statement when and where – is stored in the properties. Here, the versioning of graphs as discussed in Martina Bürger- meister’s contribution to this volume plays an important role. 15 https://www.scientificamerican.com/article/the-semantic-web/ 16 https://en.wikipedia.org/wiki/Triplestore 116 5.4 Manuscript Structures in the Graph The physical structure of the manuscript (and its relationship to its digital descendants) can also be modeled as a graph. In Figure 7, the main manu- script R is represented by the image in the upper part of the picture. This node is then connected to the folio nodes, which correspond to the indi- vidual folios that make up the manuscript, and which are connected by IS_FOLLOWED_BY edges that model the order of the folios within the manuscript. This part of the graph – or in other words, this subgraph – contains the in- formation about the physical structure of the manuscript. One example of the usefulness of this information is a manuscript in which the order of the folios has been changed at some point – in a graph, both the original order and the new order can easily be modeled. Figure 7: Physical structure of the manuscript as represented by the graph 5.5 Transcription as Connected Parts of Text A TEI/XML-based transcription aims at expressing the layout structure of a manuscript with inline markup in one XML document. In Hilde- graph, every distinct unit of text is assigned its own SPO node. The re- 117 lationships between the different text blocks are represented by means of IS_NEIGHBOUR_OF edges, which encode the visual impression of vicin- ity in the graph. In addition, the individual passages of text are linked to a corresponding image of the folio in question. Figure 8 shows two lines of text (initium libri Epistolarum et orationum Sanctae Hildegardis) added by a later hand in the upper margin of fol. 328r. Figure 8: Addendum by a later hand (fol. 328r.) With XML/TEI, this text would be contained in the main body of the letter in the XML document. In Hildegraph, the added text re- ceives its own SPO node, and the two SPO nodes are connected with an IS_NEIGHBOR_OF edge (Figure 11). While textual information is thus stored separately from its visual arrangement on the folio page with all the benefits such an approach entails, the combination of text and layout can easily be examined if and when this is needed. Figure 9 shows the entire data model of Hildegraph. Figure 9: Data model for the digital scholarly edition of the Liber epistolarum 118 5.6 Text as a Graph Given the continuing lack of technical solutions for managing text directly as a graph (Kuczera, 2016), we developed our own set of standoff proper- ties (Kuczera and Neill, 2019) based on Desmond Schmidt’s ideas (Schmidt, 2016, 63-69). Since SPOs are index based, one must select a base text for indexation. In practice, however, every version of a text can be used as base text because they can all be converted into one another. Indexing the base text makes every character of the text addressable – they are strung out like pearls in a long row, forming a chain of nodes in the graph which is given order by the direction of the text. All additional information is then connected to these indexed characters in a process that builds a bridge between the more transcription- related sphere of the text and the predominantly semantic and interpretative sphere of the graph. In this regard, Hildegraph goes well beyond Witt’s above quoted proposition to model text as an ordered hierarchy of content objects (OHCO). 5.7 Transcription with Annotations of Annotations Another SPO is created whenever a user adds an annotation, which can then be annotated again (most likely by another user) with yet another SPO, and so on. With standoff properties, every annotation is stored together with a Globally Unique Identifier (GUID) and can be traced back to the user who added it (Kuczera and Neill, 2019). From this perspective, annotations can be seen as a statements by a certain user – as users add these statements to the base text (which is itself a statement), and as these statements are in turn annotated by other users, the resulting knowledge graph continues to grow. Figure 10 shows the subgraph concerned with transcription and interpret- ation. The manuscript consists of letters from and to Hildegard. These let- ters are assigned one SPO node each, which are connected with IS_PART_OF edges to the corresponding folio nodes. A letter can belong to one or more folio pages, and may have a predecessor or a successor connected with IS_FOLLOWED_BY edges (See Figure 9). The red, blue, and green circles on the right represent metadata, layout, and semantic content of the letters. 5.8 Annotations as Individual Statements Figure 11 shows how provenance is stored. Every information is connec- ted to a node which represents the user who created the statement.17 In our example, the user Andreas Kuczera has transcribed a margin note. This 17 Using TEI semantics, this information can be stored with ( https://www.tei-c. org/release/doc/tei-p5-doc/en/html/ref-resp.html, and . 119 Figure 10: Transcriptions with annotations, entities, described by other text nodes and metadata. Figure 11: Every annotation can be traced to the user 120 transcription is not stored in the SPO node of the corresponding letter, but rather in a separate SPO node which is then connected with an IS_NEIGHBOUR_OF edge to a zero point annotation in the letter text – it is this separate storage of transcription and allocation that makes the modeling of multiple interpret- ations possible. 6 Conclusion By way of conclusion, I would like to return to the brief epigraph of my essay: “To which problem is digitization the solution?” From my point of view, di- gitization enables researchers to publish their findings with a maximum of flexibility and transparency. Ideally, hierarchies should only be involved in this process when they are actually needed, and not because they are forced upon us by technological limitations. One of the fundamental properties of research data in the (digital) humanities is that it is highly connected, and I would argue that scholars should be granted the capacity to store every bit of information concerning these connections even if, for the time being, a standard or suitable ontology to express them might still be lacking. From a technical perspective, graph technologies can provide us with the capabil- ity to model multiple and multidimensional layers of information. TEI se- mantics could be another important piece of the puzzle, but their practical utility is dramatically reduced by the limitations of the XML hierarchies with which they are currently yoked together. As the Hildegraph environment shows, there is no reason why the problematic coupling of TEI semantics and XML should continue. References Amit Singhal (2012). Introducing the Knowledge Graph: Things, Not Strings. https://googleblog.blogspot.com/2012/05/ introducing-knowledge-graph-things-not.html. Barrasa, J. (2016). RDF Triple Stores vs. Labeled Property Graphs: What’s the Difference? https://neo4j.com/blog/ rdf-triple-store-vs-labeled-property-graph-difference/. Bordalejo, B. (2018). Digital Versus Analogue Textual Scholarship or The Revolution is Just in the Title. Digital Philology: A Journal of Medieval Cultures, 7(1):7–28, DOI: 10.1353/dph.2018.0001. Cummings, J. (2018). A World of Difference: Myths and Misconceptions About the TEI. Digital Scholarship in the Humanities, 34(1):i58–i79, DOI: 10.1093/llc/fqy071. 121 DeRose, S. J., Durand, D. G., Mylonas, E., and Renear, A. H. (1990). What Is Text, Really? Journal of Computing in Higher Education, 1(2):3–26, DOI: 10.1007/BF02941632. Ehrlicher, L. and Wöß, W. (2016). Towards a Definition of Knowledge Graphs. In Martin, M., Cuquet, M., and Folmer, E., editors, Joint Proceed- ings of the Posters and Demos Track of the 12th International Conference on Semantic Systems - SEMANTiCS2016 and the 1st International Work- shop on Semantic Change & Evolving Semantics (SuCCESS’16), number 1695 in CEUR Workshop Proceedings. http://ceur-ws.org/Vol-1695/. Färber, M., Bartscherer, F., Menne, C., and Rettinger, A. (2016). Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. http://www.semantic-web-journal.net/content/ linked-data-quality-dbpedia-freebase-opencyc-wikidata-and-yago-0. Kuczera, A. (2016). Digital Editions Beyond XML – Graph-Based Di- gital Editions. In Düring, M., Jatowt, A., Preiser-Kappeller, J., and van Den Bosch, A., editors, Proceedings of the 3rd HistoInformatics Workshop on Computational History, number 1632 in CEUR Workshop Proceed- ings, pages 37–46. http://ceur-ws.org/Vol-1632/. Kuczera, A. and Kasper, D. (2019). Modellierung von Zweifel – Vorbild TEI im Graphen. In Kuczera, A., Wübbena, T., and Kollatz, T., editors, Die Modellierung des Zweifels – Schlüsselideen und -konzepte zur graphbasier- ten Modellierung von Unsicherheiten, volume 4 Special Issue of Zeitschrift für digitale Geisteswissenschaften. DOI: 10.17175/SB004_003. Kuczera, A. and Neill, I. (2019). The Codex – An Atlas of Relations. In Kuczera, A., Wübbena, T., and Kollatz, T., editors, Die Modellierung des Zweifels – Schlüsselideen und -konzepte zur graphbasierten Modellier- ung von Unsicherheiten, volume 4 Special Issue of Zeitschrift für digitale Geisteswissenschaften. DOI: 10.17175/sb004_008. Luhmann, N. (1987). Archimedes und wir : Interviews. Number 143 in Internationaler Merve-Diskurs. Merve, Berlin. Nassehi, A. (2019). Muster: Theorie der digitalen Gesellschaft. C.H.Beck, München, 3rd edition. Paulheim, H. (2016). Knowledge Graph Refinement: A Survey of Ap- proaches and Evaluation Methods. Semantic Web, 8(3):489–508, DOI: 10.3233/SW-160218. 122 Pujara, J., Miao, H., Getoor, L., and Cohen, W. (2013). Knowledge Graph Identification. In Hutchison, D., Kanade, T., Kittler, J., Kleinberg, J. M., et al., editors, Advanced Information Systems Engineering, number 7908 in Lecture Notes in Computer Science, pages 542–557. Springer, Ber- lin/Heidelberg, DOI: 10.1007/978-3-642-41335-3_34. Rübsamen, D. and Kuczera, A. (2006). Verborgen, vergessen, verloren? Perspektiven der Quellenerschließung durch die digitalen ’Regesta Im- perii’. In Hering, R., Sarnowsky, J., Schäfer, C., and Schäfer, U., editors, Forschung in der digitalen Welt. Sicherung, Erschließung und Aufbereit- ung von Wissensbeständen, number 20 in Veröffentlichungen aus dem Staatsarchiv der Freien und Hansestadt Hamburg, pages 109–124. Ham- burg University Press, DOI: 10.15460/HUP.STAHH.20.77. Sahle, P. and Vogeler, G. (2013). Digital Monumenta Germaniae Historica (dMGH). Digital Philology: A Journal of Medieval Cultures, 2(1):135– 139, DOI: 10.1353/dph.2013.0006. Schmidt, D. A. (2016). Using Standoff Properties for Marking-up Historical Documents in the Humanities. it - Information Technology, 58(2):63–69, DOI: 10.1515/itit-2015-0030. Schulz, J. (2017). A review Of: Regesta Imperii Online, Ed. By Deutsche Kommission für die Bearbeitung der Regesta Imperii e.V., 2001-2017. http://www.regesta-imperii.de/. RIDE, 6, DOI: 10.18716/RIDE.A.6.5. van Acker, L. and Klaes-Hachmöller, M. (1993-2001). Epistolarivm. [Hildergard von Bingen]. Brepols, Turnholti. van Zundert, J. (2016). Barely Beyond the Book? In Driscoll, M. J. and Pierazzo, E., editors, Digital Scholarly Editing: Theories and Practices, pages 83–106. Open Book Publishers, DOI: 10.11647/OBP.0095.05. van Zundert, J. (2019). Why the Compact Disc Was Not a Revolution And «Cityfish» Will Change Textual Scholarship, or What Is a Computational Edition? Ecdotica, 15:129–156, https://pure.knaw.nl/portal/en/publications/ why-the-compact-disc-was-not-a-revolution-and-cityfish-will-chang. Witt, J. C. (2018). Digital Scholarly Editions and API Consuming Ap- plications. In Bleier, R., Bürgermeister, M., Klug, H. W., Neuber, F., et al., editors, Digital Scholarly Editions as Interfaces, volume 12 of Schriften des Instituts für Dokumentologie und Editorik, pages 219–247. BoD, Norderstedt, http://nbn-resolving.de/urn:nbn:de:hbz:38-91182. 123