The GOLEM Triple Store: A Graph-based Representation of Narrative and Fiction

The GOLEM Triple Store: A Graph-based Representation of Narrative and Fiction FranziskaPannach f.a.pannach@rug.nl Center for Language and Cognition (CLCG) University of Groningen Text Encoding Initiative XiaoyanYang xiaoyan.yang@rug.nl Center for Language and Cognition (CLCG) University of Groningen Text Encoding Initiative NoaVisserSolissa Center for Language and Cognition (CLCG) University of Groningen Text Encoding Initiative ZeYu z.yu@rug.nl Center for Language and Cognition (CLCG) University of Groningen Text Encoding Initiative AndreasVan Cranenburgh van.cranenburgh@rug.nl Center for Language and Cognition (CLCG) University of Groningen Text Encoding Initiative MichielVan Der Ree michiel.van.der.ree@rug.nl Center for Information Technology (CIT) University of Groningen FedericoPianzola f.pianzola@rug.nl Center for Language and Cognition (CLCG) University of Groningen Text Encoding Initiative The GOLEM Triple Store: A Graph-based Representation of Narrative and Fiction 1613-0073 662AD4456AD68336ABDCEB5264DC1E79 GROBID - A machine learning software for extracting information from scholarly documents

In this paper, we present the GOLEM triple store, a massive triple store resource for ction and narrative. This triple store is the rst step towards a large-scale knowledge-graph for stories, as well as characters and events in narratives. At the moment, it contains more than 8 million stories collected from the Archive of Our Own (AO3) [1], providing scholars with a tool to derive unique insights into fan narratives and storytelling trends over time.

Semantic Methods for Events and Stories (SEMMES) Workshop, 2024

Introduction

In this article we introduce a new resource for the large scale study of ction on the basis of metadata and "derived data" [2] -or "mesodata" [3] -that is, various textual features that allow to compare documents without accessing their full text. The idea is similar to that of the HathiTrust Extracted Features dataset [4], but the features encoded in the GOLEM ("Graphs and Ontologies for Literary Evolution Models") triple store are much richer, also referring to narrative and stylistic elements, and to reader response data (e.g. characters, relationships, topics, readability, sentiment of comments received by the story, etc.). Similar projects exist on a smaller scale for a selection of texts in English [5], Dutch [6] and German [7]. The creation of the GOLEM triple store has been inspired by such work but will operate on a completely dierent scale, which requires the automation of the extraction of textual features for millions of stories.

The core concept of the GOLEM infrastructure is that of "programmable corpora", i.e. "research-oriented corpora providing an API" [8], which allows to easily reapply scripts, notebooks, and pipelines of analysis to all texts in the corpora, inasmuch as they are encoded following the same principles and can be queried via the same API and SPARQL endpoint. Since the GOLEM focuses primarily on derived data, there is no need for a resource-intensive XML database of texts encoded in TEI 1 format, like that created by [8]. Only statements about the texts and their reception are stored in the database.

The rst batch of texts made available in the GOLEM triple store are gathered from the largest and most popular fanction platform, Archive of Our Own (AO3) 2 . The spread of digital and social media in the 21st century has recongured many literary dynamics, namely it has reduced the inuence of literary institutions like literary critics, publishers, and schools, creating more spaces and occasions for niche and amateur ction to become popular, thanks to reader-to-reader interactions. Fanction platforms and website for book reviews/discussion (e.g. Goodreads) are now some of the most thriving environments to study narrative, ction, and reader response. In the fanction space, readers become writers, viewers become creators and recipients become participants of their favorite (ctional) universes. Since fanction writers publish their own works independently from publishing houses and editors, their creativity knows no bounds, and is not subjected to limitations or censorship. Writers easily cross from one ctional universe (fandom) into another, or place themselves (or the reader) as characters in their favorite stories. Fanction has become an integral part of transformative fan culture, a cultural phenomenon in its own right.

Up to 2022 (the cut-o point of our data collection), more than 8.7 million stories were published on AO3 in the English language alone. Additionally, there is a wide coverage of other languages, including Chinese, Italian, or Korean including many so-called low-resource languages, such as Bahasa Indonesian or isiZulu. Therefore, the fanction domain holds immense potential for the study of user-produced narratives, readers' response, semantic and narrative modeling approaches, and for the development of natural language processing (NLP) tools for low-or under-resourced languages. While certain individual features of fanction domain have been investigated, such as user-selected and user-provided tags [9], the narratives in their entirety are largely under-studied.

The GOLEM Project triple store is a large and easily accessible resource for querying AO3 data and more data sources will be added later on. We demonstrate the potential of the triple store in three case studies.

The article is structured as follows: Section 2 describes the data and its representation in the triple store. Section 3 presents some illustrative case studies. Section 4 contains the discussion, and Section 5 describes future work and planned extensions. the predicate keyword. Notably, the GOLEM triples store does not provide access to the full text, which remains in solely accessible through the Archive of Our Own. However, text-based features with regard to events and character-features are planned to be incorporated in the knowledge graph, see Section 6. Table 1 explains the relevant data elds and gives examples where needed. Some values that were originally aggregated in lists, such as additional tags in Figure 1, are split into multiple triples, e.g. golem:keyword to facilitate explorability of the data (like querying specic keywords in fandoms).

For internal use, the data was rst harvested from the archive.org archive 3 and stored in an internal Elasticsearch database with help of a custom ingest script 4 . This database has now been converted into triple store data and is available at: http://graph.golemlab.eu:8890/sparql via an institutional Virtuoso server 5 . Virtuoso was chosen because it scales well with growing knowledge graphs. Even holding multiple billions of triples on a single instance, single machine setup, Virtuoso still performs well, in a real-life setting. 6 .

Up to the date of this publication, metadata for 8 million stories have been made available. The data in the GOLEM triple store contains the story related metadata in AO3 up to and including December 2022. With this choice we want to limit the stories that are potentially written with the aid of large language models, allowing for a more reliable investigation of human storytelling. While comparing human-generated stories with narratives produced by large language models could be an interesting area of research, it is not currently within the scope of the project. Extending the knowledge graph with more recent (human) user-produced stories from AO3 and other fanction platforms is planned for future updates.

Workflow

We transferred the Elasticsearch data into triple store in a series of steps. Firstly, the database was queried for stories by languages other than English. The smaller language sets were converted to triples in one step. Larger languages, such as Russian and Chinese, were processed using a batch size of 50,000 stories. The English data was queried from the database by fandom. The larger fandoms were processed in batches, while the smaller fandoms, i.e. the fandoms with few or very few stories, where queried sequentially from the database, before they were converted into triples according to the schema above. This process has two time-consuming bottle necks: the download and the import into the Virtuoso instance. This is illustrated on the example of English fanction stories for the Attack on Titan anime in Table 2. The Elasticsearch data (jsonl format) for this fandom has a size of 4.9 GB, resulting in 60,503 triple store entries.

Querying the triple store

Three small case studies are presented here to demonstrate how to use the triple store and give more insights into the data contained in it. First, to know which languages are contained in the data and how many stories per language there are (stories can have more than one associated language), we can write a simple SPARQL query using COUNT. The same query can be made for dierent fandoms, yielding a ordered list of fandoms with the most stories (i.e. Harry Potter J.K. Rowling: 324,767, Marvel Cinematic Universe: 252,605, Supernatural: 244,182). The results yields a list of 110 languages in total, with the top 10 by story count presented in Table 3. Next, to nd out the distribution of stories per rating (e.g. Mature or Explicit) for the fandom "Artemis Fowl -Eoin Colfer", we can use the following query: It yields a distribution of ratings of stories within the fandom, which is normalized and illustrated in Figure 2. We can see that this particular fandom produces stories that are largely targeted at general audiences or teen and older audiences, with only ew explicit or mature stories. In contrast, the same query for the fandom BTS (a popular Korean boy band) produces a dierent distribution, with a larger proportion of explicit and mature stories (see Figure 3). Content-related elds are interesting for processing the fanction data in downstream tasks, and to derive additional semantic information on the stories. Therefore, the last example shows how to query for a list of summaries of fanction stories in a specic fandom and language that are tagged with a specic keyword.

p r e f i x golem : < h t t p s : / / g o l e m l a b . eu / g r a p h / > SELECT ? o WHERE { ? s golem : keyword " Angst " . ? s golem : fandom " A t t a c k on T i t a n " . ? s golem : l a n g u a g e " E n g l i s h " . ? s golem : summary ? o . } The result of this query are available at https://github.com/GOLEM-lab/triple_store/.

Discussion

In this short paper, we present the GOLEM triple store, our eort towards a comprehensive semantic representation of fannish narratives. It provides users with manifold possibilities to study fanction from dierent viewpoints, e.g. by inspecting keywords and tags provided by the users or the distribution of romantic pairings across dierent fandoms. To date, the triple store contains more than 8 million stories. An overview on the statistics of the GOLEM triple store is given in Table 4.

Future Work

The presented triple store is the rst step towards a broader knowledge base for fanction narratives. In the short term, the triple store will be extended with additional reader response data, such as number of time users have bookmarked a story. It will further be extended from a story-centric view to a more complete data modelling based on the existing AO3 metadata, e.g. by modelling content collections according to various criteria. In the medium term, the GOLEM triple store will be extended towards a full-edged knowledge graph of characters and events in the fanction domain. This includes the results of character analysis, modeling essential properties of ctional characters, i.e. physiological and psychological traits, as well as narrative function of a character. Additionally, the full knowledge graph will also contain additional data on reader response (e.g. emotions felt, etc.) The project is currently developing a comprehensible ontology [13] for the modelling of (fan narratives), which will be aligned to relevant other ontologies, as closely as possible in order to maximize the interoperability with other relevant projects, like Wikidata and MiMoText [7].

We are additionally planning to report recent statistics of the quality of the knowledge graph (such as consistency) regularly on the project website.

Currently, the triple store only contains stories from AO3. However, we are working on including data from other sources, such as Wattpad 7 and fanction.net 8 .

Acknowledgements

This work is part of the Golem Lab: Graphs and Ontologies for Literary Evolution Models project, a 5-year (2023-2027) research project funded by the European Commission (ERC StG).

Figure 1 :1Figure 1: Example Metadata for AO3 data, Source: https://archiveofourown.org/

1 3 { 4 ?34PREFIX golem : <h t t p s : / / g o l e m l a b . eu / graph/> 2 SELECT ? o (COUNT( ? o ) a s ? oCount ) WHERE s golem : l a n g u a g e ? o .

5 } 6 756GROUP BY ? o 8 ORDER BY DESC( ? oCount )

1 3 { 4 ? 5 ? 6 } 734567PREFIX golem : <h t t p s : / / g o l e m l a b . eu / graph/> 2 SELECT ? o (COUNT( ? o ) a s ? oCount ) WHERE s golem : r a t i n g ? o . s golem : fandom " Artemis ␣Fowl␣ -␣ Eoin ␣ C o l f e r " . GROUP BY ? o 8 ORDER BY DESC( ? oCount )

Figure 2 :2Figure 2: Distribution of Content-Ratings in Fandom "Artemis Fowl"

Figure 3 :3Figure 3: Distribution of Content-Ratings in Fandom "BTS"

Table 11Triple Store PredicatesPredicateExplanationExamplegolem:authorUsername Author (anonymised)golem:charactersCharacters appearing in the storyMolly Weasleygolem:collectionsTitle of the collection that a story is part of Good Omens MinisodeMinibang 2024golem:contentWarningContent warnings regarding levelGraphic Depictionsof violence/sexualityOf Violencegolem:datePackagedDate packaged for the project databasegolem:datePublishedDate published on AO3golem:dateModifiedDate updated by the authorgolem:fandomFictional universe(s) of the storyGood Omens (TV Show)golem:keywordUser-provided content keywordsLoch-Ness Monstergolem:languageLanguage in which the story is writtenEnglish, Italianogolem:numberOfChaptersNumber of chaptersgolem:numberOfComments Number of commentsgolem:numberOfKudosNumber of user-approvals (similar to likes)golem:numberOfWordsNumber of wordsgolem:publicationStatusIn-Progress or Completedgolem:publisherSource platformarchiveofourown.orggolem:ratingContent-rating, level of sexuality/violenceTeen and Up Audiencesgolem:romanticCategoryClassification for romantic relationshipsF/M, Gen (no rel.)within the storygolem:socialRelationshipsSocial, e.g. romantic or sexual relationships Arthur/Molly Weasleybetween charactersgolem:seriesSeries the work is a part of, if anygolem:summaryText of the summarygolem:titleTitle

Table 22Attack on Titan Example WorkflowStepTime in MinutesElasticsearch Export20:00Copy jsonl files1:47Convert to TTL format3:20Import to Virtuoso9:57Copy TTL files for backup 0:11

Table 33Results for the first case study, stories per languagesLanguageCountEnglish7,129,450中文-普通话448,268Русский148,981Espa ñol96,477Français41,006Italiano27,762Português brasileiro 22,115Bahasa Indonesia21,605Deutsch17,757Polski15,551

Table 44Triple Store StatisticsCount

https://archiveofourown.org https://archive.org/details/AO3_nal_location Available at https://github.com/GOLEM-lab/golem-ingest https://virtuoso.openlinksw.com/ See UniProt https://www.w3.org/wiki/LargeTripleStores#OpenLink_Virtuoso_v7.2B_.2894.2B.2B_explicit.2C_ uncounted_virtual.2Finferred.2C_in_1_instance_on_1_machine.29

Data

Apart from the textual data provided by fanction writers and the common metadata such as author and title, AO3 provides a wide array of additional metadata, such as user-selected content tags, characters appearing in the story, as well as their relationships. Particularly popular are romantic (canon and non-canon) character pairings. Users can praise and react to each other's work by giving kudos or leaving comments.

Individual stories in the triple store are identied by their story ID. Each story has a number of associated metadata items, such as summary, word count, date published and more. As of date, all predicates in the triple store are using the golem prex (https://golemlab.eu/graph/ ), derived in parts by properties from CIDOC-CRM [10], Schema.org [11], and LRMoo [12]. The triple store maintains the user-selected (upper case) and user-generated (lower case) tags via

An archive of their own: A case study of feminist hci and values in design CFiesler SMorrison ASBruckman Proceedings of the 2016 CHI conference on human factors in computing systems the 2016 CHI conference on human factors in computing systems 2016 Derived data element OECD 2005 PBoot Mesotext: Digitised Emblems, Modelled Annotations and Humanities Scholarship Amsterdam University Press 2009 JJett BCapitanu DKudeki TCole YHu POrganisciak TUnderwood EDickson Koehl RDubnicek JSDownie 10.13012/R2TE-C227 The HathiTrust Research Center Extracted Features Dataset 2020 The CONLIT Dataset of Contemporary Literature APiper 10.5334/johd.88 Journal of Open Humanities Data 8 24 2022 Ubiquity Press 0 Publisher Psycholinguistic dataset on language use in 1145 novels published in English and Dutch SLuoto AVan Cranenburgh 10.1016/j.dib.2020.106655 Data in Brief 34 106655 2021 Smart Modelling for Literary History CSchöch MHinzmann JRöttgermann KDietz AKlee 10.3366/ijhac.2022.0278 International Journal of Humanities and Arts Computing 16 2022 Edinburgh University Press Programmable Corpora: Introducing DraCor, an Infrastructure for the Research on European Drama FFischer IBörner MGöbel AHechtl CKittel CMilling PTrilcke 10.5281/zenodo.4284002 2019 LPrice Fandom, folksonomies and creativity: the case of the archive of our own 2019 The CIDOC CRM, an Ontological Approach to Schema Heterogeneity MDoerr 10.4230/DagSemProc.04391.22 Semantic Interoperability and Integration YKalfoglou MSchorlemmer ASheth SStaab MUschold

Germany

Dagstuhl 2005 4391 Dagstuhl Seminar Proceedings (DagSemProc) Schema.org: evolution of structured data on the web RVGuha DBrickley SMacbeth Communications of the ACM 59 2016 the IFLA library reference model, and now LRMoo: a circle of development PRiva MŽumer Frbroo IFLA WLIC 2018 -Kuala Lumpur, Malaysia -Transform Libraries, Transform Societies 2017 The Golem Ontology: Theoretical and data-driven modelling of narrative and ction XYang FPianzola FPannach Preparation 2024