Between Flexibility and Universality: Combining TAGML and XML to Enhance the Modeling of Cultural Heritage Text Elli Bleekera , Bram Buitendijka and Ronald Haentjens Dekkera a R&D group, Humanities Cluster, Royal Netherlands Academy of Arts and Sciences Abstract This short paper first presents a conceptual workflow of a digital scholarly editor, and then illustrates how the smaller components of the workflow can be supported and advanced by technology. The focus of the paper is on the need to encode a historical text from multiple, co-existing research perspectives. Step by step, we show how this need translates to a computational pipeline, and how this pipeline can be implemented. The case study constitutes the transformation of a TAGML document containing multiple concurrent hierarchies into an XML document with one single leading hierarchy. We argue that this data transformation requires input from the editor who is thus actively involved in the process of text modeling. Keywords text modeling, text encoding, TEI XML, overlapping hierarchies, Multi-Colored Trees, editorial workflows, digital scholarly editing, TAGML, computational pipelines 1. Introduction1 How can we effectively model the workflow of a scholarly editor so that we can develop computa- tional technology to support and advance this research process? This has been the overarching research question that informs the work of the R&D group of the Humanities Cluster (part of the Royal Netherlands Academy of Arts and Sciences). Considering the vastness of the question, the fact that our research is ongoing, and the limited number of pages allowed for a short paper, this contribution focuses on two smaller aspects of this question. First, how can we model different research perspectives on the same cultural heritage text? And secondly, how can we ensure that the resulting documents can be processed by generic text analysis tools? Our research takes place in the context of the Text As Graph (TAG) model, under development at the R&D group since 2017, which is set up to address several long-existing challenges for the digital editing of cultural heritage texts [8, 10, 23]. Text encoding can be considered an intellectual activity as scholars are compelled to trans- late their interpretation of the text into a computer-readable format. This includes selecting a data model, a markup syntax, and a specific vocabulary to encode cultural heritage documents such as literary or historical texts. The choices made here directly influence the subsequent CHR 2020: Workshop on Computational Humanities Research, November 18–20, 2020, Amsterdam, The Netherlands £ elli.bleeker@di.huc.knaw.nl (E. Bleeker) DZ 0000-0003-2039-7300 (E. Bleeker); 0000-0002-3755-5929 (B. Buitendijk); 0000-0001-6737-7986 (R.H. Dekker) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 1 The authors are grateful for the comments of the reviewers that have helped them improve their argument and clarify their paper. 340 processing, querying, analysing, or repurposing of the encoded document. Ideally, then, schol- arly editors base their choice for a data model and an encoding vocabulary on their research question(s), the goal(s) of the encoding project, and/or the properties of the source material. In reality, the majority of editing projects opt for XML as a data model [5]. XML is after all the de facto standard for text encoding and accordingly omnipresent: the Text Encoding Initiative (TEI) Guidelines [25] are currently based on XML; the family of X-languages offers a wide support in navigating, querying, and transforming XML documents; and many text edit- ing tools take XML as input format. However, the single-rooted, fully ordered tree structure of XML offers only limited support for the modeling of cultural heritage texts (see section 2). The questions we address in this paper – how can we enable scholarly editors to model different research perspectives on text and make the result widely available – also involve a form of knowledge dissemination among the scholarly editing community: we want the user to understand how their abstract, conceptual idea of the text corresponds with the way the computer understands that text, how the textual data transforms during the editing process, and how they may influence these transformations. To this end, we conceptualized the work- flow of a scholarly editor as a step-by-step process that can be subdivided into smaller tasks or “modular components”. This conceptual workflow can be quite easily translated into a compu- tational pipeline in which the output from one step forms the input for the next. As a result, we could develop technology to address the individual components, making conscious choices about which step(s) to automate and which step(s) needed to remain a manual act because they require user input. After briefly describing our conceptual model of the editorial workflow in section 3, we dis- cuss how we approached the translation from workflow to data model to implementation. After introducing the data model of TAG (section 4.1), we highlight two features of the implemen- tation: the representation of multiple perspectives on a text (section 4.2) and the export from one data format to another (section 4.3). In both cases we will indicate how these features correspond with the steps in the conceptual editorial workflow. Section 5 dives deeper into the export feature by providing the high-level description of the code flow of the export, and section 6 illustrates the transformation process with a small input and output sample.2 2. Related work Probably the most exemplified limitation of XML for text modeling is the fact that the tree structure does not inherently provide for overlapping, discontinuous, or non-linear struc- tures [11, 27, 22, 8]. These types of structures are nevertheless common in cultural heritage texts: the various research perspectives from which a text can be encoded often constitute different hierarchies that (partly) overlap [6]. Since modeling a text from different research perspectives remains a widely acknowledged objective of digital scholarly editing, several al- ternatives to XML for text encoding have been developed ranging from stand-off approaches to entirely new markup languages (see [11]; see also [8] for a more recent overview of these alternatives). When using XML, scholarly editors are compelled to use local, project-specific approaches or to do significant additional coding work in order to model overlapping structures in XML. 2 It should be noted that the implementation of the pipeline is currently on prototype-level. Still, we consider our findings thus far to be relevant for a productive discussion on conceptual models of scholarly activities, and to what extent these can be delegated to software. 341 As a consequence, the result is in a non-generic file format or uses proprietary software. The result will thus significantly hinder any subsequent querying and interchange of the encoded documents [19, 3]. The approach of the digital edition of Johann Wolfgang Goethe’s Faust (2016) [4] is a good example of a local solution to modeling a text from multiple, overlap- ping perspectives. The entire text of Faust has been transcribed in TEI/XML several times. Each transcription represents a different research perspective and has a different hierarchical structure. The transcriptions are subsequently synchronized using the collation algorithm of CollateX [9], and the result is stored in a graph database similar to an MCT data structure [17]. Users can switch between perspectives in the edition’s graphical user interface. Another relevant example is presented by the Annotated Turki Manuscripts from the Jarring Collection Online (ATMO) project [21]. This project combines regular embedded TEI/XML markup with Trojan Horse markup [11, 21]. Trojan Horse markup elements are a specific type of milestone elements or “markers” that have a namespace definition th: to differen- tiate them from regular milestone elements. Two related markers are linked by means of matching start and end attributes, so the regular XML The sun is yellow becomes The sun is yellow in Trojan Horse markup. It is quite a well-known strategy to express (partly) overlapping hierarchies in XML, and various XSLT strategies exist to convert Trojan Horse to regular XML content elements and vice versa [2]. In the ATMO project, transcriptions contain three hierarchical structures, and nei- ther structure is permanently dominant: users are able to switch between dominant structures via an XSLT template converting the Trojan Horse milestones to regular XML content ele- ments. A third approach to representing multiple textual perspectives making use of XML and related X-technologies is Concurrent XML [12, 7], which represents multiple concurrent hierarchical structures in XML by dividing text into “atomic text nodes” (usually based on word division). Each node is reachable via an XPath Expression locating its place in one or more hierarchies. Finally, there exist several stand-off systems, such as the Just In Time Markup system (1999- 2005) in which users create “on demand user-customized versions of electronic editions of his- torical documents” [1], or the stand-off properties system of Desmond Schmidt and others [20]. The downsides of stand-off systems are that they often require a specialized, non-generic edit- ing environment, and that they require a stable base text (while in practice it often is unstable and subject to changes). What is more, they usually produce quite illegible transcriptions (see [21]). So even though it may not be the perfect choice for text modeling from a computa- tional perspective, XML’s substantial role for text encoding as well as the significant amount of XML-based text processing and analysis tools makes it an undeniable community standard. Alternative text encoding approaches are therefore more likely to succeed if they take XML into account, instead of offering a completely new encoding model. 3. Conceptual model In order to best accommodate the work of digital scholarly editors, we first mapped their research activities (see figure 1).3 The model has three “levels”: the upper level representing the intellectual activity of the editor; the middle level showing the editor’s action(s) associated to this thinking; the third and lowest level indicating the product(s) of the actions. Starting at the upper left corner, for example, we can see that the research perspective of the editor influences 3 This workflow is inspired by [26, 18, 13] among others. 342 Figure 1: Visualization of the workflow of a digital scholarly editor, illustrating how the intellectual activities of the editor (upper row) affect their actions (middle row) and the output of these actions (lower row). the conceptual modeling of the source text. When encoding, annotating, and linking the text, the editor has to make decisions about syntax, vocabulary, schemata, etc. The community’s standards are also at play here, as they are with the digitization of the source text. The actions of analyzing and/or querying are again influences by the project’s research objective. It will produce a selection of the information contained in the transcription, which can subsequently be represented in different ways: via a published edition of the text, a visualization, a data set, etc. What we intend to demonstrate with this visualization, is how an editor’s research perspec- tive(s) on the source text translates to a choice for a certain data model, a syntax, a markup vocabulary, a normalization policy, etc. In turn, these choices influence the way the text can be analyzed, queried or represented. The visualization also shows the importance of an editor who knows what they want to do with the text in terms of querying, visualising, or exporting: knowing this at the outset helps them decide what would be the most suitable data format, markup strategy, and tools. The TAG data model addresses primarily the steps of encoding, annotating, linking, analyzing, and querying in the workflow. In the remaining of the paper, we will focus on the encoding step, showing how TAG allows for the encoding of multiple, co-existing research perspectives on the source text, and on how these encoded documents can be exported to XML for analysis or publication. 4. Implementation 4.1. Data model In order to computationally support the encoding step in the workflow, TAG makes use of a variant of a General Ordered Directed Descent Acyclic Graph (GODDAG) [22]. The data model is based on the Multi-Colored Trees model (MCT), which permits nodes to be shared by multiple hierarchies that are distinguishable by color [15]. The TAG data model distinguishes Text nodes, Markup nodes, and Annotation nodes. In contrast to the mono-hierarchical, fully ordered tree that is implied by XML, the fact that TAG is based on a graph data model means that it can offer much more flexibility in modeling documents from different perspectives, including overlapping hierarchies and non-linear or discontinuous structures [10, 3]. As a result, we find that TAG is able to surpass the limitations of XML for text modeling. 343 4.2. Alexandria: from research perspectives to views TAG’s reference implementation is the text-repository system Alexandria. In contrast to many local or technologically complex approaches to modeling different perspectives on text (see section 2), the code of Alexandria is open source and designed to be implemented in any editorial workflow.4 Users can directly encode texts in the TAG markup language TAGML [24] and upload the TAGML documents in the Alexandria repository. This is the functionality that corresponds to the encoding step in the editorial workflow. Via Alexandria’s command-line interface, users can subsequently query the TAGML documents (the querying step), or export them to other formats like PNG, SVG, DOT, and XML for analysis or publication (cf. the analyzing and representing steps). The TAGML syntax is similar to that of embedded markup systems like XML or TexMECS [14], with a markup element consisting of an open and a close tag: [s>This is an [del>easytheThis is an [del|D>easytheThis is an [del|D>easy [add|T>exampleThis is an [del|D>easy [add|T>example This is an easy example illustation of the TAGML syntax. 347 Switching from flattened XML into hierarchically structured XML and back again is a com- mon need in the XML community and there exists a large body of XSLT patterns designed specifically for this purpose.5 Still, though “flattening” and “raising” the hierarchies in XML documents is by no means a novel activity, this practice usually takes place within a digital edition project, which means that the code is tailored to a specific text and rarely shared as part of the project’s output. The idea presented in this paper, by contrast, is to encourage editors to create a custom, modular workflow in which they can use both Alexandria and, via the TAGML-to-XMl export feature, generic XML-based tools. While generating an XML tree from an MCT is computationally straightforward (a depth- first graph traversal), as far as we know it is unprecedented to put the user in control by including them in the conversion process, and to take this process it out of a customized, project-specific environment. As a consequence, we believe that Alexandria provides a more powerful and flexible approach to text modeling. This paves the way for an editorial modeling workflow that strikes the right balance between flexibility and universality. As mentioned above, the TAG model is currently under active development. Among others, it is currently not possible for multiple users to collaborate in Alexandria, since the repository is now only initialized locally. We aim to allow the user of Alexandria to checkout and edit a TAGML document, and to upload the document again to Alexandria. Similar to Git, this upload first diffs the master document and the edited document. Any detected changes are subsequently merged into the master document. Such a diff and merge is quite straightforward for Git (as well as for stand-off systems), but because we want to track the edits in a TAGML document on the level of the markup as well as on the level of the text, we require a more advanced diff that is very challenging to implement. A second point of attention is that thinking about data transformations and working with Alexandria on the command line requires a level of technical know-how that is not available to all scholarly editors. For that reason, we are keen to provide training in TAG text modeling via workshops, summer schools, and online Jupyter Notebooks. Similar to the present contribution, these courses aim to illustrate the relationships between a conceptual, abstract model and its technical implementation(s). 7. Conclusion This paper discussed the text modeling approach of Alexandria and focused on the intellectual process of translating a conceptual model to a technical implementation. Using an MCT data structure, Alexandria facilitates the modeling and encoding of cultural heritage text. Editors can express different research perspectives (“views”) on the source text and subsequently export a view to XML. If the XML output contains multiple overlapping hierarchies, the editor can indicate which hierarchy should be leading. The other hierarchical structures are expressed in Trojan Horse milestone elements. The XML-export function of Alexandria allows editors to build a custom pipeline in which they combine the best of both worlds: TAG’s flexible text modeling capacity with XML’s generic publication tools. In other words, an editorial workflow that strikes the right balance between flexibility and universality. The editor’s close involvement in every step of the text modeling process contributes to the their control over and insight into the various transformations the data undergoes. 5 [2] presents an extensive survey of at least seven possible approaches to raising flattened XML; four of which use XSLT. 348 References [1] G. Barwell et al. Authenticated Electronic Editions Project: A Progress Report. 2001. url: https://ro.uow.edu.au/cgi/viewcontent.cgi?article=1535&context=artspapers. [2] D. J. Birnbaum, E. E. Beshero-Bondar, and C. Sperberg-McQueen. “Flattening and Unflattening XML Markup: a Zen Garden of XSLT and Other Tools”. In: Proceedings of Balisage: The Markup Conference. Vol. 21. 2018. doi: https://doi.org/10.4242/Balisage Vol21.HaentjensDekker01. [3] E. Bleeker, B. Buitendijk, and R. H. Dekker. “Marking Up Microrevisions with Major Implications”. In: Proceedings of Balisage: The Markup Conference. Vol. 25. 2020. doi: https://doi.org/10.4242/BalisageVol25.Bleeker01. [4] “Faustedition”. In: ed. by A. Bohnenkamp, S. Henke, and F. Jannidis. 2016. url: http: //beta.faustedition.net/. [5] T. Bray et al. XML 1.0. Tech. rep. W3C, 2006. [6] J. H. Coombs, A. H. Renear, and S. J. DeRose. “Markup Systems and the Future of Scholarly Text Processing”. In: The Digital Word: Text-Based Computing in the Human- ities. Ed. by G. P. Landow and P. Delany. 1993, pp. 85–118. [7] A. Dekhtyar and I. E. Iacob. “A Framework for Management of Concurrent XML Markup”. In: Data & Knowledge Engineering 52.2 (2005), pp. 185–208. [8] R. H. Dekker and D. J. Birnbaum. “It’s More Than Just Overlap: Text As Graph”. In: Proceedings of Balisage: The Markup Conference. Vol. 20. 2017. doi: https://doi.org/10 .4242/BalisageVol19.Dekker01. [9] R. H. Dekker and G. Middell. CollateX. 2019. url: https://collatex.net/. [10] R. H. Dekker et al. “TAGML: a Markup Language of Many Dimensions”. In: Proceedings of Balisage: The Markup Conference. Vol. 21. 2018. doi: https://doi.org/10.4242/Balis ageVol21.HaentjensDekker01. [11] S. J. DeRose. “Markup Overlap: A Review and a Horse”. In: Proceedings of the Extreme Markup Languages. 2004. [12] P. Durusau and M. O’Donnell. “Implementing Concurrent Markup in XML”. In: Extreme Markup Languages. Vol. 2001. 2001. [13] R. Hoekstra and M. Koolen. “Data Scopes for Digital History Research”. In: Historical Methods: A Journal of Quantitative and Interdisciplinary History 52.2 (2019), pp. 79–94. [14] C. Huitfeldt and C. Sperberg-McQueen. TexMECS: An Experimental Markup Meta- Language for Complex Documents. 2001. url: http://www.hit.uib.no/claus/mlcd/pape rs/texmecs.html. [15] H. Jagadish and L. Lakshmanan. “Colorful XML: One Hierarchy Isn’t Enough”. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data. ACM, 2004. doi: https://dl.acm.org/doi/abs/10.1145/1007568.1007598. [16] JSON. JavaScript Object Notation (JSON). 2007. url: https://www.json.org/json-en.h tml. [17] G. Middell. On the Value of Comparing Truly Remarkable Texts. Presented at the sym- posium “Knowledge Organization and Data Modeling in the Humanities”. 2012. url: https://datasymposium.wordpress.com/middell/. 349 [18] M. Rehbein and C. Fritze. “Hands-on Teaching Digital Humanities: a Didactic Analysis of a Summer School Course on Digital Editing”. In: Digital Humanities Pedagogy. Open Book Publishers. Retrieved from http://www. openbookpublishers. com (2012). [19] D. Schmidt. “A Model of Versions and Layers”. In: DHQ: Digital Humanities Quarterly 13.3 (2019). [20] D. Schmidt and R. Colomb. “A Data Structure for Representing Multi-version Texts Online”. In: International Journal of Human-Computer Studies 67.6 (2009), pp. 497– 514. [21] C. Sperberg-McQueen. “Representing Concurrent Document Structures Using Trojan Horse Markup”. In: Proceedings of Balisage: The Markup Conference. Vol. 21. 2018. doi: https://doi.org/10.4242/BalisageVol21.Sperberg-McQueen01. [22] C. Sperberg-McQueen and C. Huitfeldt. “GODDAG: A Data Structure for Overlapping Hierarchies”. In: Lecture Notes in Computer Science. Ed. by P. King and E. Munson. Vol. 20-23. Berlin: Springer-Verlag, 2000. [23] TAG. Text As Graph. Version alexandria 2.3. 2019. url: https://huygensing.github.io /TAG/. [24] TAGML. Text as Graph Markup Language. 2019. url: https://github.com/Huygens ING/TAG/tree/master/TAGML. [25] TEI-Consortium. /TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version version 4.0.0. 2019. url: http://www.tei-c.org/Guidelines/P5/. [26] J. Unsworth. “Scholarly Primitives: What Methods Do Humanities Researchers Have in Common, and How Might our Tools Reflect This”. In: Symposium on Humanities Computing: Formal Methods, Experimental Practice. King’s College, London. Vol. 13. 2000, pp. 5–00. [27] A. Witt. “Multiple Hierarchies: New Aspects of an Old Solution”. In: Proceedings of the Extreme Markup Languages. 2004. url: http://www.mulberrytech.com/Extreme/Proce edings/html/2004/Witt01/EML2004Witt01.html. 350