=Paper=
{{Paper
|id=Vol-3290/long_paper8134
|storemode=property
|title=Good Omens: A Collaborative Authorship Study
|pdfUrl=https://ceur-ws.org/Vol-3290/long_paper8134.pdf
|volume=Vol-3290
|authors=Leonardo Grotti,Mona Allaert,Patrick Quick
|dblpUrl=https://dblp.org/rec/conf/chr/GrottiAQ22
}}
==Good Omens: A Collaborative Authorship Study==
Good Omens: A Collaborative Authorship Study Leonardo Grotti, Mona Allaert and Patrick Quick Universiteit Antwerpen, Faculty of Arts, Prinsstraat 13, B-2000, Antwerp Abstract Good Omens is a collaborative novel written by Terry Pratchett and Neil Gaiman. Rising interest in the book, ampli昀椀ed by the success of the recent screen adaptation, has aroused curiosity regarding its realization. We use Rolling Delta and Rolling Classify to detect stylistic signals from each author as these methods reveal authorial takeovers. The same techniques are applied to compare the screenplay of the show to the novel. The results indicate that Good Omens resembles Pratchett’s work more closely. The screenplay is correctly attributed to Gaiman, its sole author, and the comparison reveals that Gaiman may have relied less on the source material over the course of the narrative arc. Keywords Good Omens, Rolling Stylometry, PCA, collaborative authorship 1. Introduction In 1983, Terry Pratchett published The Colour of Magic, the 昀椀rst book in his forty-one-book Discworld series. Although Pratchett is now recognized as one of the most popular fantasy writers of the past two decades [9], during the early 1980s he was far from the level of success he would come to enjoy. As noted by Shanahan [26], Pratchett was still working as a newspaper journalist and would not become a full-time writer until 1987. In 1985, in the early stages of his writing career, Pratchett granted his 昀椀rst interview as an author to Space Voyager Magazine to promote his series [14]. It was on this occasion that Pratchett met Neil Gaiman, who at the time was working for Space Voyager Magazine and conducted the interview. The two stayed in touch due to ‘a shared delight and amazement at the sheer strangeness of the universe, in stories, in obscure details, in strange old books in unregarded bookshops’ [14, p. 488]. Five years later, they co-wrote Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch, which became an international bestseller. Unlike other notable literary collaborations (e.g., Conrad and Ford, [25], 17th century French playwrights, [7]) that between Pratchett and Gaiman was rather unproblematic. A cause for their successful partnership can be found in their similar backgrounds: both authors operated CHR 2022: Computational Humanities Research Conference, December 12 – 14, 2022, Antwerp, Belgium £ leonardo.grotti@student.uantwerpen.be (L. Grotti); mona.allaert@uantwerpen.be (M. Allaert); patrick.quick@student.uantwerpen.be (P. Quick) ç https://github.com/corvusMidnight (L. Grotti); https://github.com/MonaDT (M. Allaert); https://github.com/patrickquick (P. Quick) ȉ 0000-0001-7914-3191 (L. Grotti) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 424 in the 昀椀eld of fantasy and science 昀椀ction. They also professed a love for comedy and claimed that the main objective of writing Good Omens was ‘to make the other one laugh’ [14, p. 484]. The collaboration seemed to work well in that they were both equally invested in the writing process: We wrote the 昀椀rst dra昀琀 in about nine weeks. Nine weeks of gloriously long phone calls, in which we would read each other what we’d written, and try to make the other one laugh. We’d plot, delightedly, and then hurry o昀昀 the phone, determined to get to the next good bit before the other one could. We’d rewrite each other, footnote each other’s pages, and sometimes even footnote each other’s footnotes. [2] Even though both Pratchett and Gaiman remained playfully evasive about attributing spe- ci昀椀c aspects to one another, it is clear that Gaiman initiated the project. He wrote 5000 words in which he created one of the main characters, Crowley, and wrote a passage regarding a baby swap, which would come to be the premise of Good Omens. The dra昀琀 was then sent to Pratchett for feedback, who suggested writing it together as a novel. In the beginning, they wrote separately: Pratchett during the day, Gaiman during the night, with a short overlap in the a昀琀ernoon to compare notes. However, towards the end of the writing process Gaiman moved into Pratchett’s spare room to polish the 昀椀nal parts before publication [2]. Both authors kept looking back with fondness on their project and remained in touch for potential cinematic adaptations of the novel. In 2008, however, Pratchett was diagnosed with early-onset Alzheimer’s and passed away in 2015. At Pratchett’s request, Gaiman took it upon himself to write the screenplay for a six-part television show, which was released in 2019 and is currently awaiting its second season. When Good Omens was written, both authors were only at the start of their careers: ‘[i]n those days Neil Gaiman was barely Neil Gaiman and Terry Pratchett was only just Terry Pratch- ett’ [14, p. 475]. This changed for Pratchett when the Discworld series achieved great renown. Gaiman gained popularity in the United States, predominantly due to his work as a graphic novelist. As the reputation of the authors 昀氀ourished, so did that of Good Omens, which earned the status of ‘cult classic’ [14, p. 478]. Consequently, interest in the creative process of Good Omens bloomed. Even though being cryptic about the writing process of the novel was a de- liberate choice [2], a stylometric approach to Good Omens could shine a light on the question of who wrote what. 2. Material Good Omens takes an unlikely setting for a comedy, namely, the end of the world. At the center of the story are two main characters: the angel Aziraphale and the demon Crowley (who was loosely based on Gaiman [11]), who operate as agents of heaven and hell on earth. The story follows their friendship that spans thousands of years and their attempts to prevent the apocalypse, which will come as a result of the birth of Satan’s son, Adam. The characters’ inability to distinguish the motives of heaven and hell in bringing about, or, rather, ensuring the end of the world, functions as the framework for the novel’s comedy. 425 The novel is composed of six chapters plus an appendix. The latter1 contains a 昀椀ctional interview regarding the collaboration between the two authors. Here, Pratchett and Gaiman make some comments regarding who wrote what. They report that most of the scenes be- tween Adam and Anathema (witch and owner of the 昀椀ctional book of prophecies The Nice and Accurate Prophecies of Agnes Nutter) were written by Pratchett, and the passages with the Four Horsemen of the Apocalypse2 were of Gaiman’s hand. Moreover, they claim that Gaiman was the dominant author at the opening of the novel, whereas Pratchett had more control towards the end [14, p. 477]. Scholars have not concerned themselves with the stylometric analysis of Good Omens. One exception is Callaway [8]. In her blog post, Callaway uses the function Rolling Classify (us- ing 50 features and 5000 word-long segments), in combination with a linear Support Vector Machine (SVM), to detect authorial takeovers in Good Omens. The resulting graph highlighted that Gaiman is indeed more present at the beginning of the book3 and in sections that include the Four Horsemen. In Callaway [8]’s analysis, however, Pratchett is still the predominant au- thor and she concluded that ’even in areas where one of the two author’s signal dominates, the other author is present. Both Gaiman and Pratchett are detectable all over their shared work’ [8]. Regarding the appendix, it is important to consider to what extent these attributions may be constructed. As previously mentioned, the two o昀琀en reworked each other’s material, with Pratchett professing that ‘both can write passably in the other one’s style’ [14, p. 477]. The appendix, although written as an interview to a third unnamed person, was authored by both Pratchett and Gaiman. It is also worth mentioning that the appendix is e昀昀ectively part of the novel. For that reason, it may be challenging to try and attribute certain passages to a speci昀椀c author or verify their claims. Despite its limited length and scope, the appendix of Good Omens remains the most reliable source. Interviews found online o昀琀en echo the same information or refer directly to it. On the other hand, Callaway’s [8] study does not reference any source material and does not relate its results to any of the author’s claims. Additionally, the results presented there are not replicable or comparable since (i) the reference corpus and the 50 features used to run Rolling Classi昀椀er are not provided and (ii) chapters markers have not been added to the graph. Thus, our expectations are based on the content of the appendix and we only partially compare our results to Callaway’s [8]. First, we anticipate that Pratchett will be the most dominant author of Good Omens since he took upon himself the role of editor (see the interview in [5]). Also, we expect Gaiman’s style to emerge in sections of the novel involving characters (e.g. Four Horsemen of the Apocalypse) attributed to him. Finally, we forecast Pratchett’s idiom to be predominant in the later sections of the book and Gaiman’s in the earlier ones. Regarding the screenplay, not many hypotheses can be formulated within the scope of this study. Firstly, a screenplay belongs to a di昀昀erent text type and cannot be reliably compared to novels. As such, its validity as a control text is limited. Secondly, neither author publicly 1 Perhaps 昀椀ttingly titled Good Omens, The Facts (or, at least, lies that have been hallowed by time). 2 For reference, War, Famine, Pollution, and Death. 3 Callaway [8] enriches the graph with short notes summarizing the content of each segment in the novel. 426 Table 1 Reference corpus for Pratchett and Gaiman with year of publication and n° of word tokens. Titles in bold are part of the Rolling Stylometry sub-corpus Title Author Year of Publication Tokens Small Gods Terry Pratchett 1982 93 042 The Colour of Magic Terry Pratchett 1983 66 073 Light Fantastic Terry Pratchett 1986 65 389 Equal Rites Terry Pratchett 1987 67 691 Mort Terry Pratchett 1987 74 089 Sorcery Terry Pratchett 1988 79 805 Wyrd Sisters Terry Pratchett 1988 86 428 Pyramids Terry Pratchett 1989 88 422 Moving Pictures Terry Pratchett 1990 98 887 Lords and Ladies Terry Pratchett 1992 90 240 Neverwhere Neil Gaiman 1996 100 714 Hogfather Terry Pratchett 1996 96 104 Stardust Neil Gaiman 1997 60 552 Jingo Terry Pratchett 1997 107 876 American Gods Neil Gaiman 2001 184 086 Thief Terry Pratchett 2001 103 881 Coraline Neil Gaiman 2002 31 504 Anansi Boys Neil Gaiman 2005 111 164 Thud! Terry Pratchett 2005 113 531 Fragile Things Neil Gaiman 2007 108 356 M is for Magic Neil Gaiman 2007 47 773 The Graveyard Book Neil Gaiman 2008 69 440 Unseen Academicals Terry Pratchett 2009 137 861 The Ocean at the end of the Lane Neil Gaiman 2013 56 287 Trigger Warning Neil Gaiman 2015 101 004 Raising Steam Terry Pratchett 2015 126 088 Total / / 2 366 257 commented on the novel’s potential for screen adaptation, and Gaiman never shared his process in writing it. We still expect Gaiman’s style to be overwhelmingly predominant since he was the sole author. Thirdly, because the show is faithful to the narrative arc of the novel, we anticipate the screenplay to take liberally from the book. 3. Methodology As a preliminary step, we 昀椀rst compiled a comprehensive corpus, consisting of ten novels for Gaiman and sixteen for Pratchett. Table 1 above summarizes the structure of our corpus. The corpus consists of 2,366,257 word tokens (see Table 1). It must be noted that Gaiman is slightly underrepresented in the dataset since he wrote fewer novels, and we did not include any of his non-昀椀ction work.4 The novels in bold are part of the sub-corpus selected to run Rolling 4 Pratchett’s texts make up 1,495,407 word tokens, whereas Gaiman’s texts consist of 870,850 word tokens. 427 Stylometry functions. The novel Good Omens consists of 110,935 tokens while the screenplay of the television adaptation consists of 86,425 tokens. The entire screenplay, rather than just the dialogue, was considered since it includes detailed character and scene descriptions. Modern stylometry studies o昀琀en do not limit themselves to the use of a single technique [27]. Rather, scholars (e.g. see [25], [20]) have shown how the implementation of di昀昀erent methodologies yields better and more reliable results. Thus, to better assess the stability of our results, the present paper proposes a combination of three di昀昀erent methods: Principal Components Analysis (PCA), Rolling Delta, and Rolling Classify [12]. Before proceeding further, it is worth noting that here and throughout we calculate stylistic distance using ‘Burrows’s Delta’ [6]. Burrow’s Delta is a metric which combines z-transformation (i.e., standardization) of frequency with Manhattan distance [13]. Roughly, to calculate delta given the x Most Frequent Words (MFW)5 in n texts, we 昀椀rst compute the relative frequency of each word in each document. By doing so, we obtain a x-scores-long representation of each document. The �㗿 (standard deviation) of each term’s frequency across the whole corpus is then calculated. The distance between two documents n1 and n2 is expressed as the absolute di昀昀erence between each individual word’s relative frequency in n1 and n2 divided by the same word’s �㗿 across the corpus. Finally, the resulting deltas are collected in a distance table which is used as the basis for the cluster analysis.6 PCA is an unsupervised dimensionality reduction technique, i.e. a method that does not require ground-truth labels for the data. As such, it is o昀琀en considered ideal for exploratory purposes [18]: instead of being driven to a speci昀椀c solution by the researcher, PCA results are data-driven [27]. The documents are 昀椀rst vectorized into a 67 × 117 matrix (sixty-seven segments by two authors and 117 MFW). Then, we normalize the resulting matrix (following L1 norm, see [18]) and scale it. PCA operates by dimensional reduction: ‘it transforms to new set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so that the 昀椀rst few retain most of the variation present in all of the original variables’ ([4, p. 447], originally in [17, p. 1]). In our case, PCA is ideal compared to other techniques7 since (i) it o昀昀ers more reliable results for smaller sets of authors and (ii) allows us to visualize the stylistic characteristics from which it was built [27]. Rolling Delta and Rolling Classify are both part of ‘Rolling Stylometry’ [12]. Rolling Stylom- etry is a sequential classi昀椀cation technique in that it operates by means of a rolling window across 昀椀xed segments of text. In other words, both functions split the text into overlapping, same-length fragments [25] and roll over it. For instance, if we take a ‘window size’ of 5000 words and a ‘step size’8 of 1000, the 昀椀rst analyzed segment will cover the range 1–-5000, the second 1001–-6001, etc. Both functions allow a maximum of twelve texts in the reference cor- pus (e.g. in our case, six texts for each of the two authors) and a separate test set (usually one 5 We here talk about words, but it worth noting that Burrows Delta works not only on words but also on most frequent items (e.g. n-grams). 6 The above explanations echoes those found in Stover and Kestemont [27] and Karsdorp, Kestemont, and Riddell [18]. For a more technical, in-depth explanation of Burrow’s Delta, see Burrows [6]. 7 Such as Agglomerative Clustering Analysis, see Müllner [22]. 8 Note that these are the parameters for Rolling Delta and not Rolling Classify, which speci昀椀es ‘slice size’ and ‘slice overlap’. Although the name and con昀椀guration of the two parameters are di昀昀erent, they can be considered equivalents to those of Rolling Delta. E.g., for a ‘slice size’ of 5000 with a ‘slice overlap’ of 4000, the 昀椀rst analyzed segment will cover the range 1–5000, the second 1001––6001, and so on [12]. 428 text, Good Omens in this case). The di昀昀erence between Rolling Delta and Classify lies in the way the segments are analyzed: Rolling Delta calculates the Burrow’s Delta distances of each segment in the test sets from the segments of texts from the reference corpus. Rolling Classify, on the other hand, uses the texts in the reference corpus to train a classi昀椀er and then classi昀椀es the text segments from the test set. Also, while Rolling Classify allows the user to select a custom set of most frequent features, Rolling Delta does not: rather, it automatically selects an X number of most frequent features. A preliminary analysis revealed that the upper tail (i.e., 250 MFW) of the extracted features contained many author-related lemmas, such as Rincewind (one of the main characters of the Discworld series) and black and white (both common collocations of the word magic, strongly related to the fantasy genre). Following Binongo [3], using only function words yields undeni- able advantages: (i) because of their scarce semantic content, they are less context-dependent compared to content words, (ii) since they are not in昀氀ected, they are o昀琀en found in only one form, and (iii) their usage is o昀琀en una昀昀ected by a writer’s stylistic choices. As such, we re- moved content words and culled personal pronouns, o昀琀en considered too indicative of a spe- ci昀椀c genre or narrative style. A昀琀er the process, the 昀椀nal MFW list, which has been used to conduct further analyses and experiments, consisted of 117 function words.9 We use PCA for two reasons: 昀椀rst, as a tool to narrow down our corpus. As noted above, both Rolling Stylometry functions work using a restricted corpus of twelve texts. As such, we want our novel selection to be as representative and comprehensive of each author’s style as possible. PCA allows us to identify stylistic clusters10 and to make an informed decision while also identifying distinctive features for both Pratchett and Gaiman. Second, PCA is useful in giving an explorative representation of where Good Omens lies stylistically. However, PCA has a major drawback: when it comes to collaborative authorship, static visualizations may be misleading. i.e., a book may be attributed in its entirety to one author through clustering or classi昀椀cation. Thus, PCA cannot help scholars in assessing whether the other author had an in昀氀uence on the writing process and to what extent. Rolling Stylometry enables the model to identify authorial dominance and takeovers throughout the text rather than attributing an entire text to a speci昀椀c author [12]. As such, it is especially 昀椀tting for collaborative authorship attribution [21] and was selected to give an in-depth insight into Good Omens’ style. Both Rolling Delta and Rolling Stylometry are implemented and can be accessed in the environment for statistical computing R [23]. To further improve the quality of our results obtained through Rolling Delta and Classify, we also performed a preliminary analysis of the twelve-text sub-corpus using SVM. SVM is specif- ically 昀椀t for text categorization due to its inductive bias and the linearly separable nature of the task [16]. Using SVM in combination with Terms Frequency-Inverse Dictionary Frequency (TFIDF) vectorizer, we were able to test how di昀昀erent parameters (i.e. MFW and text seg- ment size) a昀昀ected the model’s ability to correctly distinguish between the two authors. SVM was also compared with other models (Logistic Regression and KneighborsClassi昀椀er), which 9 Here we apply a broader de昀椀nition of function words; i.e., all those words that belong to a closed class [10]. For instance, we do not remove auxiliary modal verbs. For an extensive discussion of the role of function words and their uses in authorship attribution, see [19]. 10 Here and throughout the paper we use the term cluster to refer to the visual clusters that can be observed in the PCA visualization. 429 showed that SVM, in combination with a MFW ≥ 250 and segment size ≥ 1000, reached an accuracy of 1.0.11 4. Results Fig. 1 shows a PCA of ten novels by Gaiman (blue) and sixteen by Pratchett (black), plus the novel Good Omens (red). The plot was obtained by slicing the texts into 30,703 word segments (the length of Coraline, by Gaiman, the shortest novel in the corpus, following [18]). The plot represents how the works of the two authors cluster together, i.e., how di昀昀erent (or similar) they are from one another. Interestingly, PCA places the three segments of Good Omens in the middle of the plot. Such placement is partially misleading: although in the middle, Good Omens clusters closer to Pratchett’s novels than Gaiman’s. This is also con昀椀rmed by predicting the segments’ author using Burrow’s Delta, which attributes all three of them to Pratchett. With Pratchett’s novels, we can clearly observe a distinct group of works within the bottom- le昀琀 quadrant. These include some of Pratchett’s late-80s to mid-90s works, such as Mort, Pyra- mids, Hogfather, Wyrd Sisters, and Equal Rites. In contrast, his novels from the early 2000s cluster together across the bottom and top-le昀琀 quadrants. It is also worth noting that all nov- els written a昀琀er 2007 (Unseen and Raising), the year in which Pratchett was diagnosed with Alzheimer’s, are clustered far away from the rest of the novels in the top-le昀琀 quadrant. Fur- thermore, The Colour of Magic, Pratchett’s 昀椀rst breakthrough novel, gets clustered together with Gaiman’s early novels, which is not surprising, considering that Gaiman admittedly read The Colour of Magic and has said on many occasions to have been in昀氀uenced by it [15]. Following these observations, we select one novel closer in style to Good Omens (Pyramids), and two novels distinctive of that same 昀椀rst cluster, Wyrd Sisters and Moving Pictures, for our corpus. We actively include The Colour of Magic, as it is not only an interesting case because of its relation to Gaiman, but also because it represents Pratchett’s early style. From the second cluster, we pick novels written between the late 90s (Jingo) and early 2000s (Thud!). We exclude novels that were written a昀琀er 2007 as they represent Pratchett’s style a昀琀er his disease diagnosis and are clustered together far from the rest. For Gaiman, although the temporal span of publication is not as wide as Pratchett’s, we follow the same procedure. We pick one of the 昀椀rst novels, which was said to be in昀氀uenced by Pratchett’s style (Neverwhere), two from the early-mid (American Gods and Anansi Boys) and mid-late 2000s (M is for Magic and Graveyard Book) periods, and one of his latest novels (Trigger Warning). For Gaiman, Anansi Boys represents the closest novel to Good Omens (stylistically), while the four novels mentioned earlier come from the central, densest cluster (between the top and bottom-right quadrants) of Gaiman’s style. As a rule of thumb, our selection tries to be both representative of novel clusters in the PCA, time of publication, and stylistic proximity to Good Omens. Figure 2 represents the diagram outputted using Rolling Delta on the twelve novels (Jingo, 1997, Pyramids, 1989, Colour of Magic, 1983, Wyrd Sisters, 1988, Moving Pictures, 1990, Thud, 2005, for Pratchett; and M is for Magic, 2007, Anansi Boys, 2005, American Gods, 2001, The Grave- yard Book, 2008, Neverwhere, 1996, Trigger Warning, 2015, for Gaiman) selected through the 11 See Table 2 and Table 3 in Appendix A.2 for a summary of the SVM set-up results. 430 Figure 1: Principal Component Analysis of the reference corpus for Pratchett (black) and Gaiman (blue) above-described PCA. Pratchett’s novels are highlighted through warm colours, and Gaiman’s through cold. The horizontal axis represents Good Omens’ segments, while the vertical axis represents the delta distance for each segment compared to the reference novels. The closer a line comes to the x-axis, the more similar the novel represented by that line will be to the segment. The vertical lines represent the end of each of the six chapters.12 Finally, the seven vertical lines delimit the six chapters of the book. The text was split into 5000-word-long seg- 12 It is worth noting that Good Omens’ very 昀椀rst chapter (titled In the Beginning) is shorter than the segment length selected to run Rolling Delta and Rolling Classify. As such, the 昀椀rst line (a) in Fig. 2 and 3 appears to be outside of the text. Although this is slightly more visible in Fig. 2, we decided not to move it to retain the original partition of the book and to make Fig. 2 and 3 more comparable. 431 Figure 2: Rolling Delta diagram for Good Omens. Vertical lines indicate the end of each of the six chapters ments (window size), with a rolling window (step size) of 1000 words. Each segment was then analyzed using the 250 MFW. The warm-coloured lines come signi昀椀cantly closer to the horizontal line for most sections of Good Omens: i.e., Pratchett’s style (mostly Jingo, Pyramids, and Moving Pictures) is predom- inant throughout the whole book. This is perhaps not surprising, considering that Pratchett has declared that he and Gaiman agreed that Pratchett would take on the role of editor and 昀椀nalize the novel. Gaiman’s style appears to be predominant in only two sections:13 around the beginning of the 6th chapter (line e) and between the 65000- and 75000-word marks, with only two smaller contributions at the very beginning of the second chapter and around the 85000-word mark. This trend only partially 昀椀ts with the authors’ claims. As noted before, we expected Gaiman to be far more present at the beginning of the novel. Here we get a short glimpse of Anansi Boys’ style, which shows up again further down in the middle of Chapter 6 (e–f ). However, Pratchett’s style (Jingo and Pyramids) is prevalent throughout Chapter 2 (a–b). Regarding the two segments in which Gaiman’s idiom (speci昀椀cally, Trigger Warning, M is for Magic, and American Gods) is present for longer sections of the novel, they both coincide with two signi昀椀cant scenes involving the Four Horsemen of the Apocalypse. The very start of Chap- ter 6 introduces Death, arguably the most important Horseman, and involves Pollution while the next span (65000–75000) corresponds to the coming together of the Four Horsemen.14 The fact that these two segments are attributed to Gaiman aligns with statements from the authors: 13 For a more intuitive visualization of the novels see Fig. 6 in Appendix A.2. The chromatic distinction is here by author rather than by novel. Such visualization allows to better distinguish how Pratchett’s novels (as a whole) are closer to Good Omens’ style compared to Gaiman’s. 14 The scene of the Four coming together happens at exactly word n° 70209. However, the previous scene is still related to the Four Horsemen, who are being followed by the other four characters. 432 Figure 3: Rolling Classify diagram for Good Omens. Vertical lines indicate the end of each of the six chapters ‘... the Four Horsemen and anything with maggots started with Neil’ [14, p. 478]. Fig. 3 is the visualization of the results of Rolling Classify15 on the novel. In distinction to Delta, Rolling Classify does not plot stylometric information per novel. The horizontal x- axis represents the word count of Good Omens, with Pratchett’s presence delineated in green and Gaiman’s in red. The vertical lines on the underside of the x-axis represent to whom the segment has been attributed. The upper lines, on the other hand, denote whether the second author’s style is present and to what extent. The height of the lines on both sides indicates the degree of certainty with which the classi昀椀cation has been made. The vertical dotted lines represent chapter markers. The Rolling Classify method was con昀椀gured using our list of 117 MFW to analyze text segments of 5000 words using 1000-word steps. The Rolling Classify results generally con昀椀rm Rolling Delta’s output: Good Omens is pre- dominantly composed in Pratchett’s style. We again observe that Gaiman’s style is most dis- cernible between the 65,000- and 75,000-word marks, with a small additional contribution at 80,000. Across the intersection of Chapters 2–3 (b) and Chapters 3–4 (c), we see short instances of text segments attributed to Gaiman’s style, too. These do not correspond to the results of the Delta. Where Rolling Delta found a predominant presence of Gaiman (see Fig. 2) at the be- ginning of Chapter 6 (e–f ), Rolling Classify attributes the segment to Pratchett, with Gaiman’s presence being detected in the background. This may be related to the di昀昀erence in MFW used. While Rolling Classify allows using a custom MFW list, this is not possible for Rolling Delta.16 As such, the latter analysis may have been in昀氀uenced by content words (e.g. characters’ names, see Section 3) present in the unculled 250 MFW. Compared to Callaway [8], our results attribute signi昀椀cantly larger chunks of Good Omens to Pratchett. The segments attributed to Gaiman in Callaway [8] are detected as Gaiman’s 15 As a reminder, we here use the same 12 novels used for Rolling Delta: Jingo, 1997, Pyramids, 1989, The Colour of Magic, 1983, Wyrd Sisters, 1988, Moving Pictures, 1990, Thud, 2005, for Pratchett and M is for Magic, 2007, Anansi Boys, 2005, American Gods, 2001, The Graveyard Book, 2008, Neverwhere, 1996, Trigger Warning, 2015, for Gaiman. 16 In our Rolling Delta experiments, we tick the option which allows the user to cull personal pronouns. 433 Figure 4: Rolling Classifier diagram for Good Omens screenplay with episode markers Figure 5: Rolling Classify diagram for Good Omens novel with text-match markers authorial signals in Fig. 3, but are not attributed to him (except some smaller segments).17 Additionally, our results show a higher degree of certainty in attributing the segments to each author, i.e., less overlay between the two authors is present throughout the novel in Fig. 3 compared to Callaway’s [8] results. Fig. 4 and 5 test our hypotheses for the screenplay. Both 昀椀gures were obtained using a slice size of 5000 with an overlap of 4000. However, while Fig. 5 uses our list of 117 MFW, the screenplay was analyzed using 1000 MFW. Fig. 4 shows that our classi昀椀er correctly attributes the screenplay to Gaiman, with a segment between Episode 5 and 6 (e)18 attributed to Pratchett. 17 For instance, Callaway [8] results show that a large portion of text before the 2000-word mark is attributed to Gaiman while ours indicate that only a smaller section at the end of Chapter 2 is to be attributed to Gaiman. A similar pattern can be observed at the very end of the novel. 18 Vertical lines denote episodes. 434 The classi昀椀cation results shown in Fig. 5 are the same as Fig. 3; however, the plot of the novel is here overlaid with sixty-昀椀ve vertical dotted lines representing identical passages in the screenplay.19 The screenplay was compared to the novel using the text reuse detection tool Text-Matcher, which yielded the list of matches [24]. The matches comprise both dialogue and character and scene descriptions. Interestingly, the matching passages occur relatively frequently up until the 20,000 word mark of the novel. Then, it progressively diminishes until the 75,000 word mark. From this point of the novel onward there are no matches between the book and the screenplay. This pattern leads us to assume that over time Gaiman has relied less on the source material. These results are compatible with the observations that can be made by comparing the novel to the series: some of the most important scenes and characters from the book have been excluded from the screen adaptation20 while others are solely present in the show.21 It is worth noting that further analysis is needed to explore the style of the screenplay and that our conclusions’ reliability is limited by the scope of this study. 5. Conclusions The present paper aimed to explore authorial takeovers in Good Omens by Terry Prachett and Neil Gaiman. Additionally, we also compared the novel to the screenplay of the show written by Gaiman and based on the book. The application of stylometric techniques to the works of the two authors yields interesting results. From the PCA, we can observe how Pratchett’s novels written a昀琀er 2007, the year of his Alzheimer’s diagnosis, cluster di昀昀erently from most of his works. This pattern denotes a shi昀琀 in Pratchett’s writing style, which may be related to his neurological disease.22 Interestingly, PCA also locates The Colour of Magic, Prattchett’s breakthrough work, next to Neverwhere—one of Gaiman’s 昀椀rst novels. The clustering suggests that The Colour of Magic may have in昀氀uenced Gaiman’s early idiom.23 Rolling Delta partially con昀椀rms our expectations of the novel. Pratchett’s idiom is predomi- nant in the book. Instances of Gaiman’s style, especially throughout Chapter 6, can largely be attributed to the presence of the Four Horsemen in those sections, characters that he authored [14]. 19 The chapter markers are not present in this visualisation. 20 For instance, the highway chase at the beginning of Chapter 6, where four bikers decide to follow the Four Horsemen in their ride through the M25 highway is not present in the show 21 E.g., the 昀椀nale, during which Aziraphale and Crowley switch bodies to survive the punishments of Heaven and Hell, is absent from the novel 22 Our conclusion is here derived from an observation of stylometric patterns and does not account for the com- plex nature of Alzheimer’s. There is little academic research regarding the e昀昀ect of Alzheimer’s on Pratchett’s writing style. The only article on the issue was published on the Pratt School of Information’s website by one of the institute’s students (see [28]). Here, the author outlines how vocabulary complexity has not diminished but rather increased throughout Pratchett’s last novels, thus concluding (with the necessary reservations) that his neurological condition did likely not a昀昀ect his writing style. 23 This was the 昀椀rst of Pratchett’s works read by Gaiman (Gaiman, 2018). Critical literature on Pratchett notes that many writers “have found a昀琀er lengthy exposure to Pratchett’s prose that it has worn grooves in their heads” [1, p.148]. 435 Rolling Classify generally con昀椀rms the results of Rolling Delta, except for two additional shorter segments being attributed to Gaiman. Compared to the results obtained by previous studies [8], we 昀椀nd Pratchett to be far more predominant throughout the novel. Our results reveal a higher degree of con昀椀dence in attributing segments to each author, showing fewer overlays between Gaiman’s and Pratchett’s styles compared to Callaway’s [8]. The screenplay analysis further shows that the classi昀椀er can correctly attribute the text al- most entirely to Gaiman despite the di昀昀erence in genre. Based on text matches between the screenplay and the novel, we speculate that Gaiman may have relied less on the source material towards the end of the screenplay. It is worth noting that the use of the screenplay as control for the e昀케cacy of our classi昀椀er is limited as the two texts do not belong to the same genre. Fur- ther research could explore the issue of the screenplay by retrieving other screenplays written by Pratchett and Gaiman and that of the upcoming second season of Good Omens. 6. Code and data availability Code and datasets are available at https://zenodo.org/record/7257715 7. Acknowledgments A special thanks to Prof. Mike Kestemont and Dr. Wouter Haverals, who supported and en- couraged us during the making of this project. We also want to thank Eveline C. for allowing us to use her living room as our o昀케ce. References [1] A. H. Alton, W. C. Spruiell, and D. Palumbo. Discworld and the Disciplines: Critical Ap- proaches to the Terry Pratchett Works (Critical Explorations in Science Fiction and Fantasy, 45). annotated edition. McFarland & Company, 2014. [2] BBC News. “Good Omens: How Neil Gaiman and Terry Pratchett wrote a book”. In: (2014). url: https://www.bbc.com/news/magazine-30512620. [3] J. N. G. Binongo. “Who Wrote the 15th Book of Oz? An Application of Multivariate Analysis to Authorship Attribution”. In: Chance 16.2 (2003), pp. 9–17. doi: 10.1080/09 332480.2003.10554843. eprint: https://doi.org/10.1080/09332480.2003.10554843. url: https://doi.org/10.1080/09332480.2003.10554843. [4] J. N. G. Binongo and M. W. A. Smith. “The application of principal component analysis to stylometry”. In: Literary and Linguistic Computing 14.4 (1999), pp. 445–466. doi: 10.1 093/llc/14.4.445. [5] L. Breebaart. The Annotated Pratchett File v9.0 - Words from the Master. 2016. url: https: //www.lspace.org/books/apf/words-from-the-master.html. [6] J. Burrows. “’Delta’: a Measure of Stylistic Di昀昀erence and a Guide to Likely Authorship”. In: Literary and Linguistic Computing 17.3 (2002), pp. 267–287. doi: 10.1093/llc/17.3.267. 436 [7] F. Ca昀椀ero and J. Camps. “’Psyché’ as a Rosetta Stone? Assessing Collaborative Author- ship in the French 17th Century Theatre”. In: Proceedings of the Conference on Computa- tional Humanities Research, CHR2021, Amsterdam, The Netherlands, November 17-19, 2021. Ed. by M. Ehrmann, F. Karsdorp, M. Wevers, T. L. Andrews, M. Burghardt, M. Kestemont, E. Manjavacas, M. Piotrowski, and J. van Zundert. Vol. 2989. CEUR Workshop Proceed- ings. CEUR-WS.org, 2021, pp. 377–391. url: http://ceur-ws.org/Vol-2989/long%5C%5Fp aper51.pdf. [8] E. Callaway. Good Omens Stylometry. Elizabeth Callaway. url: http://www.elizabethcal laway.net/good-omens-stylometry. [9] J. B. Cro昀琀. “Nice, Good, or Right: Faces of the Wise Woman in Terry Pratchett’s ”Witches” Novels”. In: Mythlore: A Journal of J.R.R. Tolkien, C.S. Lewis, Charles Williams, and Mythopoeic Literature 26.3 (2008). [10] M. Deuchar. “Are function words non-language-speci昀椀c in early bilingual two-word ut- terances?” In: Bilingualism: Language and Cognition 2.1 (1999), pp. 23–34. doi: 10.1017/s 1366728999000127. [11] G. Dougary. “Good Omens: Neil Gaiman reveals what he and Terry Pratchett shared”. In: (2019). url: https://www.smh.com.au/culture/tv-and-radio/good-omens-neil-gaiman-r eveals-what-he-and-terry-pratchett-shared-20190603-p51u1y.html. [12] M. Eder. “Rolling stylometry”. In: Digital Scholarship in the Humanities 31.3 (2016), pp. 457– 469. doi: 10.1093/llc/fqv010. [13] S. Evert, T. Proisl, T. Vitt, C. Schöch, F. Jannidis, and S. Pielström. “Towards a better understanding of Burrows’s Delta in literary authorship attribution”. In: Proceedings of the Fourth Workshop on Computational Linguistics for Literature. Denver, Colorado, USA: Association for Computational Linguistics, 2015, pp. 79–88. doi: 10.3115/v1/W15-0709. url: https://aclanthology.org/W15-0709. [14] N. Gaiman and T. Pratchett. Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch (Cover may vary). William Morrow, 1990. [15] N. Gaiman [neilhimself]. The Colour of Magic. [Tweet]. 2018. url: https://twitter.com/n eilhimself/status/1023385399694163969. [16] T. Joachims. “Text categorization with Support Vector Machines: Learning with many relevant features”. In: Berlin, Heidelberg: Springer Berlin Heidelberg, 1998, pp. 137–142. doi: 10.1007/bfb0026683. [17] I. T. Jolli昀昀e. “Principal Component Analysis and Factor Analysis”. In: Principal Compo- nent Analysis (1986), pp. 115–128. doi: 10.1007/978-1-4757-1904-8\_7. [18] F. Karsdorp, M. Kestemont, and A. Riddell. Humanities Data Analysis: Case Studies with Python. Princeton University Press, 2021. [19] M. Kestemont, S. Moens, and J. Deploige. “Collaborative authorship in the twel昀琀h cen- tury: A stylometric study of Hildegard of Bingen and Guibert of Gembloux”. In: Digital Scholarship in the Humanities 30.2 (2013), pp. 199–224. doi: 10.1093/llc/fqt063. 437 [20] M. Kestemont, J. Stover, M. Koppel, F. Karsdorp, and W. Daelemans. “Authenticating the writings of Julius Caesar”. In: Expert Systems with Applications 63 (2016), pp. 86–96. doi: 10.1016/j.eswa.2016.06.029. [21] T. Litvinova and O. Litvinova. “Analysis and Detection of a Radical Extremist Discourse Using Stylometric Tools”. In: Digital Science 2019 (2019), pp. 30–43. doi: 10.1007/978-3-0 30-37737-3\_3. [22] D. Müllner. “Modern hierarchical, agglomerative clustering algorithms”. In: arXiv (2011). [23] R Core Team, Vienna, Austria. “R: A language and environment for statistical comput- ing”. In: R Foundation for Statistical Computing (2020). url: https://www.R-project.org/. [24] J. Reeve. “Text-matcher”. In: Github journal (2020). doi: 10.5281/zenodo.3937738. [25] J. Rybicki, D. Hoover, and M. Kestemont. “Collaborative authorship: Conrad, Ford and Rolling Delta”. In: Literary and Linguistic Computing 29.3 (2014), pp. 422–431. doi: 10.10 93/llc/fqu016. [26] J. Shanahan. “Terry Pratchett: Mostly Human”. In: Twenty-First-Century Popular Fiction. 1st ed. Amsterdam, Netherlands: Amsterdam University Press, 2017, p. 31. [27] J. Stover and M. Kestemont. “Reassessing The Apuleian Corpus: A computational Ap- proach To Authenticity”. In: The Classical Quarterly 66.2 (2016), pp. 645–672. doi: 10.10 17/s0009838816000768. [28] Were Terry Pratchett’s Final Works A昀昀ected by Alzheimer’s Disease?: An Analysis into Vo- cabulary Trends within the Discworld Series, Post Diagnosis. Tech. rep. 2016. url: https://s tudentwork.prattsi.org/dh/2016/05/08/were-terry-pratchetts-final-works-affected-by-a lzheimers-disease-an-analysis-into-vocabulary-trends-within-the-discworld-series-po st-diagnosis/. 438 A. Additional Figures and Tables A.1. SVM set-up Tables 2 and 3 report the results obtained with di昀昀erent SVM set-ups. The experiments were carried out using a 0.70–0.15–0.15 train-validation-test split. The validation set was utilized to verify the consistency of the results. The training data was limited to the 12-novel sub- corpus since the aim of these experiments was to understand whether the classi昀椀ers available in Rolling Classify could correctly attribute di昀昀erent text segments to each author. Because Rolling Stylometry functions only allow a maximum of twelve texts (six for each author), using a greater training set for our models would have contradicted the experiments’ goal. I.e., even if a classi昀椀cation algorithm trained on all available texts had reached better results, they would have not been a reliable foundation for the Rolling Classify experiments. Table 2 Performance of di昀昀erent classifiers for di昀昀erent sentence lengths on the 12-novels sub-corpus. MFW set to 1000 Classifier Segment Length MFW Accuracy Marco-avg svm.SVC 250 1000 0.99 0.99 svm.SVC 500 1000 1.00 1.00 svm.SVC 1000 1000 1.00 1.00 KNeighborsClassifier 250 1000 0.97 0.97 KNeighborsClassifier 500 1000 0.99 0.99 KNeighborsClassifier 1000 1000 1.00 1.00 LogisticRegression 250 1000 0.99 0.99 LogisticRegression 500 1000 0.99 0.99 LogisticRegression 1000 1000 1.00 0.99 Table 3 svm.SVC performance for di昀昀erent MFW on the 12-novels sub-corpus. Segments length set to 1000 following the results in Table 2 Classifier Segment Length MFW Accuracy Marco-avg svm.SVC 1000 50 0.92 0.93 svm.SVC 1000 100 0.98 0.98 svm.SVC 1000 117 0.99 0.99 svm.SVC 1000 250 1.00 1.00 svm.SVC 1000 500 1.00 1.00 svm.SVC 1000 1000 1.00 1.00 439 A.2. Rolling delta with color coding per author Figure 6: Rolling Delta diagram for Good Omens 250 MFW, window size 5000, step size 1000. Gaiman novels are in amber, Pratchett’s are in blue 440