Non-Canonical Acts and their Topical Distribution

Non-Canonical Acts and their Topical Distribution ChristianVrangbaek Aarhus University

Ringgaden 1 8000 Aarhus C Denmark

EvaVrangbaek Aarhus University

Ringgaden 1 8000 Aarhus C Denmark

MártonKardos Aarhus University

Ringgaden 1 8000 Aarhus C Denmark

KristofferNielbo Aarhus University

Ringgaden 1 8000 Aarhus C Denmark

JacobMortensen Aarhus University

Ringgaden 1 8000 Aarhus C Denmark

†CVrangbaek Aarhus University

Ringgaden 1 8000 Aarhus C Denmark

Non-Canonical Acts and their Topical Distribution 1613-0073 DD9A571F2B69D1459A90DCD4CE5A4DC9 GROBID - A machine learning software for extracting information from scholarly documents Apocrypha, New Testament Studies, topic modelling, classification Orcid 0009-0005-2011-3082 (C. Vrangbaek) 0000-0002-4941-3434 (E. Vrangbaek) 0000-0001-9652-4498 (M. Kardos) 0000-0002-5116-5070 (K. Nielbo) 0000-0002-7153-3707 (J. Mortensen)

This paper investigates how we can use topic modelling to characterize and place four apocryphal, i.e. non-canonical, "Acts stories" in a corpus of ancient Greek texts. In the research field of New Testament Apocrypha, there remains uncertainty concerning the classification of apocryphal text The analysis serves the purpose of creating a structured ontology to be used in classifying New Testament Apocrypha. We attempt to show that topic modelling can be a viable tool in classifying and characterizing these texts. The results show that a) our four target texts of non-canonical "Acts stories" are ambiguous and multifaceted in their topical distribution compared to other texts in the corpus, and b) that topic modelling is a viable tool in this analysis.

Introduction

In the field of New Testament Studies, classification of the heterogenous group of non-canonical texts, i.e. apocrypha, is disputed. The main problem is that the taxonomy of modern scholars largely reproduces the ancient classifications which were shaped during the 4th and 5th-century debate of canon which tends to lead to a binary classification between either canonical or noncanonical, and, moreover, to utilize conceptualized labels such as "Gnosticism" and "Encratite" which does not do justice to the complexity of the topical variety in these texts [1,2,3]. To contribute to this debate, we want to create an ontology, i.e., a structured framework, out of (apocryphal) textual data with the overarching goal of establishing a computationally driven classification system for New Testament Apocrypha within the context of the semantic web. We will investigate the topical distribution of four non-canonical acts. Non-canonical acts are stories from roughly 2nd-3rd century CE about early Christian apostles' legendary deeds and speeches [4,2,5]. In this paper, we investigate the stories called Acta Joannis, Acta Thomae, Acta Barnabae and Acta Philippi [6]. These texts have been chosen as tests to create the basis of including more texts. We want to contribute to define categories, textual characteristics, and relationships between texts by building a structured ontology. While the creation of this ontology extends beyond the scope of this paper, our present study of integrating topic modelling in research on New Testament Apocrypha serves a crucial step towards this endeavor. By utilizing ontological relationships and semantic annotations, in this context as topical distributions, this study explores the potential for integrating the classical theological concepts present in New Testament Apocrypha with modern computational methods, thereby bridging the gap between traditional textual analysis and advanced semantic technologies within the context of the semantic web. Our working question is how and in what way can this model contribute to the discussion of classifying non-canonical acts in a corpus of ancient Greek texts?

Data and Methods

Our starting point is that digital methods are not quick and magic tools to solve complex questions [7], rather we find that constant and critical exchange between traditional and new methods in qualitative collaboration is the way forward [8,9,10]. Due to the circumstance that this study originates from a research discipline where computational methods are not embedded, we find it important to provide a detailed description -tending to tedious -of the methodological process, since we believe that this can help to bridge the disciplinary divides with a low-practical how-to-style

Text Corpus and Preprocessing

The first step is to provide our text corpus. The corpus serves in our experiments as the literary context for the target texts, for which reason a historical relation between corpus and target texts are needed. Our target texts, the four apocryphal acts, are written in the Ancient Greek language, for which reason we have gathered a database consisting of 2153 Ancient Greek Texts. These texts are retrieved from the Perseus Corpus, First1KGreek, Pseudepigrapha.org and Deutsche Bibelgesellschaft [11,12]. The selection of this corpus is chosen based on its relevance for our target texts. The corpus largely sets the parameters for our topic modelling at the later stages of the experiments. The Ancient Greek texts are in a solid machine-readable state. For the corpus text to be prepared for calculations we perform a set of preprocessing steps, so that the text is cleaned and lemmatized. We clean out for any Latin characters, digits, extra whitespaces and stop words. The textual cleaning follows the logic of standard natural language processing utilities. The models for cleaning and parsing the text were built with a transformer-based pipeline called OdyCy [13]. This pipeline is a single-transformer pipeline that uses the following workflow: The transformer is built on Ancient Greek Bert [14]. The parser, morphologizer and lemmatizer follows the infrastructure of SpaCy [15]. Before running topic modelling, we had already preprocessed the input text, so we set max_df and min_df to 1.0, since we did not want to ignore any terms in this experiment [16]. When the textual preprocessing is done, we move on to vectorization. Vectorisation is a highly qualitative choice, almost a language philosophical step, in which we must decide how to represent our textual corpus as numerical vectors [17]. In this case, we employ the method of term frequency-inverted document frequency (TF-IDF) which represents the logarithmic scale between a term's frequency and the inverted frequency of total documents in the corpus [18,17,16].

Topic Modelling: Non-Negative Matrix Factorization

Topic modelling is a much-utilized tool when engaging in natural language processing classification tasks, mostly for the purpose of assigning a category based on the most probable topic to a text in a corpus [19,20,21]. This is also partly our aim, although we do see possibilities of detecting more complex layers in the topical distribution besides a text being only a part of one single category, rather, topic modelling gives due credit to the complex topical distribution and its significance for the position of our target texts in the corpus. Topic modelling is a way of structuring our textual data to be included in a future knowledge graph and similar semantic web technologies. Topic modelling assumes that a corpus of texts consists of topics and that these topics are comprised by the words of the corpus. We utilize the kind of topic modelling called Non-negative Matrix Factorization (NMF) [22,23]. NMF is an approach that decomposes a high-dimensional term-document matrix, i.e., a matrix consisting of the corpus documents in columns, the words in rows, and their occurrence-values in the cell entries. The occurrence-values are, as mentioned above, chosen to be TF-IDF. Based on optimization, the method of NMF calculates topical patterns by associating words with topics and topics with documents. The NMF model, so to say, produces latent topical patterns by grouping similar and co-occurring words in the corpus [24]. These topics are then backtracked to fit to the documents. The number of topics to choose is important for the analytical task. If we choose too many topics, there were no interpretable coherence, so based on our knowledge of the corpus and trial-and-error process the most robust output in topics was 10 topics [25].

The 10 topics and a selection of their top words are: Testament Gospels. These genre categories did not influence the topic model's calculations but can be used by us as a navigating tool to interpret the topics. For example, we can see that Topic 7 and Topic 6 have overlapping words about justice and city, but we can see that the texts that are dominated by Topic 6 are rhetorical texts like Demosthenes, whereas Topic 7 dominate many of Philo of Alexandria's texts as well as Libanius' Declamationes which are more philosophical. These extra steps enable us to distill the words of the topic into a qualified label or notion about how to describe the topic presented.

Analysis 1: Topical Distribution of Four Non-Canonical Acts Stories

In our first analysis, we interpret the topical distribution of the Acta Joannis, Acta Thomae, Acta Barnabae and Acta Philippi [27]. The results can be seen in Figure 1, which shows four pie charts over the topical distribution of the four non-canonical acts. Overall, all the four target texts are dominated by topic 2 and generally they share similar groups of topics, but the distribution is not equal. If we wanted to group the target texts together in the corpus on a coarse level, they would be set in the group of topic 2. However, this would not be a discovery. Where topic modelling can lead us further is in the presence and distribution of minor topics compared to each other. The topics of Acta Joannis are distributed over mainly topics 2, 1, 9, 3, 0, 4 and 6. Topic 1 is, like topic 2, also a theological topic. The presence of topic 9, 3 and 4 is revealing of the content of the story, since these topics represent philosophically oriented words which tell about how the story engages in Hellenistic religion and philosophy. Topic 0 is the historical-political topic which is understandable since the text narrates sequences of events and speeches. We also see a small portion of Topic 6, which concerns justice and law, which resonates with a few scenes in the narrative. From the topical distribution, we can generally characterize Acta Joannis as being situated in a Jewish-Christian theological context where Hellenistic anthropocentric philosophy and religion is also present. Concerning Acta Thomae , the topical distribution is similar to that of Acta Joannis, with a large topic 2 and 1, but Acta Thomae has a more equal distribution between its subtopics, 3, 9, 4 and 0. The Acta Thomae-story is set in legendary India, where Thomas is sent off as a missionary. From manual reading, Acta Thomae and Acta Joannis are comparable in the sense of the mix between narrative and preaching, where the preaching might drag in the more philosophically weighted topics, and this relation is also visible in the topical distribution. When we inspect the topics of Acta Philippi, then the distribution is markedly different from the two previous narratives with Topic 0, the historical-political topic being the third most prominent. Although Acta Philippi in its content has a much more adventurous and mythical tone with, for example, the protagonists arriving to a city of snakes, the structural dynamic of this narrative is dominated by sequence narration where we follow event after event [5,2]. This makes the Acta Philippi into a drier, journal-like text. The large presence of Topic 0 is also visible in the topical distribution of Acta Barnabae. This circumstance is probably linked to the fact that Acta Barnabae is situated in a church-political context of legitimizing the so-called autocephalous, i.e., independent ecclesiastical unity in Creta in the 4th century [28].

Analysis 2: Position in Corpus

In the second part of the analysis, we want to address how the topical distribution analyzed in the previous part affects texts' position in the corpus. The position of each of the four target texts are visualized in Figure 2 The clusters are formed based on texts with a dominant topic. Those texts which have an almost unequivocal dominance of one topic, e.g., Aristotle's Problemata and The Book of Jeremiah are placed in the outskirts of each colored cluster dragging away from the center of complexity. Conversely, those texts that are close to the center and also other topical groups display diversity and complexity in their topical distribution. Some clusters appear like relatively demarcated cone-shapes like e.g. Topic 4 of Hellenistic religion and philosophy and Topic 2 of god, people, human, whereas topic 3 of Greek element-philosophy is more flat, which indicates that this topic is dispersed more evenly in the corpus. The topical distribution on the corpus level gives a navigational tool with which texts can be grouped with relative clarity. For example, it is noteworthy that classic, and almost foundational texts, of Ancient Greek language, the Iliad and Odyssey, are clearly situated in topic 4 of Greek religion and philosophy, and that later texts that are trying to imitate these, like Tryphiodorus' Sack of Troy (4th century CE) and Nonnus' Dionysiaca from the 5th century CE, almost a thousand years after Homer, have an almost identical topical distribution. All of the Acts stories are placed in the dark green cluster of Topic 2, the topic of god, people, human. But it is noticeable that they all show diversity in the distribution. Acta Joannis and Acta Thomae are placed more securely in the Topic 2 cluster, whereas Acta Philippi and Acta Barnabae are drawn toward the center which might be explained by their higher percentage of the historical-political topic Topic 0. When texts are placed in this corpus, it becomes important to iterate the basic assumption of topic modelling: that all topics are produced based on words in the corpus. This means that the corpus words are constituent of the created topics which calls for a qualitative choice of corpus texts. Our corpus consists, as mentioned, of texts that historically have shaped our four target texts, either directly or indirectly due to the circumstance that the (mostly anonymous) authors were educated people in the Greco-Roman world about whom it can be assumed that they had basic knowledge of the texts in their historical and literary context. The method of topic modelling, then, almost backtracks the literary world of the authors of our texts, of course, with the important acknowledgement that we do not have all texts which made up the author's literary context.

Concluding Remarks

In this analysis assisted by topic modelling, we were able to characterize and place four noncanonical acts based on their topical distribution. The topical distribution of the analyzed texts will be used to map ontological relationships and enhance semantic annotations in order to classify New Testament Apocrypha, among which the topical distribution is a major component. The results of this topic modelling analysis will thus be able to be included in a future New Testament Apocryphal Ontology. The navigational advantages of topic modelling allowed us to inspect the target texts in a qualitatively selected corpus consisting of texts from a similar historical and literary horizon. This ensured meaningful topics. Instead of characterizing and classifying the texts based on abstractions and taxonomy from the 4th and 5th century church-political discussions on canonization, we could characterize and classify, or at least contribute to these tasks on the basis of raw content, both on a small, close scale in the topical distribution from text to text, but also on a larger scale based on the entire corpus.

Figure 1 :1Figure 1: Pie chart visualization of topical distribution of the four target texts in the corpus of Ancient Greek texts. From top left to bottom right the four Acts stories are marked in the order Acta Joannis, Acta Thomae, Acta Philippi, Acta Barnabae.

Figure 2 :2Figure 2: Four scatterplot snapshots of the same corpus with different markings of a target text. The plots' positions are based on their topical distribution. From top left to bottom right the four Acts stories are marked in the order Acta Joannis, Acta Thomae, Acta Philippi, Acta Barnabae. The figure is a frozen image from one angle of a multidimensional space. The scatterplot is created based on the topical distribution of the four texts

Acknowledgments

The research in this article is funded by the Carlsberg Foundation in the Semper Ardens: Accelerate-project "Computing Antiquity: Computational Research in Ancient Text Corpora. " We would also like to thank Deutsche Bibelgesellschaft, Pseudepigrapha.org, The Perseus-Project and First1KGreek for providing texts.

Hellene πόλις, πὸλεμος, βασιλεύς, ἕλλην God soul, heaven, word, Christ [θεός, ψυχή χριστός god, lord, people, human being [θεός, κυριός, λαός ἄνθρωπος body, matter, air, earth, water, fire [σῶμα, ὕλη Zeus Cypris, Apollo [ζέυς, παῖς, ἔρως ἀπόλλων part, character, being, necessity, cause [μόριον, τρόπος, οὐσία, άνάγκη αίτία law, justice, city, witness, possession νόμος, δίκη, πὸλις, μὰρτυς, χρῆμα 10.2307/j.ctv2rh2cqj like, for example, Jewish Philosophy, Tragedies and New References OLehtipuu SPetersen

Atlanta

SBL Press 2023 Ancient Christian Apocrypha Die kanonische apostelgeschichte und die apokryphen apostelakten FBovon 10.1515/9783110216325 Die Apostelgeschichte im Kontext antiker und frühchristlicher Historiographie

Berlin, New York

Walter de Gruyter 2009 162 New testament apocrypha: Introduction and critique of a modern category DMartin Ancient Christian Apocrypha CW J CEdwards CEvans

Grand Rapids

Zondervan Academic 2022 H.-JKlauck The Apocryphal Acts of The Apostles: An Introduction Baylor Univ. Press 2008 Canonical and apocryphal acts of apostles FBovon Journal of Early Christian Studies 11 2003 MBonnet RALipsius Acta Apostolorum Apocrypha. Vols. 1-3

Mendelssohn, Leipzig

1903 The order of things. a study on topic modelling of literary texts IUglanova EGius Proceedings of the Workshop on Computational Humanities Research (CHR 2020) the Workshop on Computational Humanities Research (CHR 2020)

Amsterdam, the Netherlands

2020 Digital methods for intertextuality studies PMolitor JRitter 10.1515/itit-2020-0006 IT -Information Technology 62 2020 Blunt instrumentalism: On tools and methods DTenen Debates in the Digital Humanities 2016 MKGold LFKlein University of Minnesota Press 2016 Composition and change in de ciuitate dei: A case study of computationally assisted methods EE HVrangbaek KLNielbo Papers presented at the Eighteenth International Conference on Patristic Studies held in Oxford 2019 Peeters 2021 Code for corpus 2023 Code for parser 2023 Odycy -a general-purpose nlp pipeline for ancient greek JKostkan MKardos JP BMortensen KLNielbo Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Association for Computational Linguistics the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Association for Computational Linguistics

Dubrovnik, Croatia; Accessed

2023. Jun. 02, 2023 A pilot study for bert language modelling and morphological analysis for ancient and medieval greek PSingh GRutten ELefever 10.18653/v1/2021.latechclfl-1.15 Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Association for Computational Linguistics NR SDegaetano-Ortlieb AKazantseva SSzpakowicz the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Association for Computational Linguistics

Punta Cana, Dominican Republic

2021 spacy: Industrial-strength natural language processing in python MHonnibal IMontani SVLandeghem ABoyd 10.5281/zenodo.1212303 IEEE 2020 Api design for machine learning software: experiences from the scikitlearn project LBuitinck ECML PKDD Workshop: Languages for Data Mining and Machine Learning 2013 Introduction to text classification: Impact of stemming and comparing tf-idf and count vectorization as feature extraction technique AWendland MZenere JNiemann 10.1007/978-3-030-85521-5_19 Communications in Computer and Information Science RM MYilmaz PClarke MReiner

Cham

Springer International Publishing 2021 1442 Introduction to tf-idf: To represent importance of keyword within whole dataset DDMehare 10.22214/ijraset.2018.3369 IJRASET 6 2018 Topic modelling with nmf vs. expert topic annotation: The case study of russian fiction TSherstinova OMitrofanova TSkrebtsova EZamiraylova MKirina 10.1007/978-3-030-608 Advances in Computational Intelligence Lecture Notes in Computer Science LMartínez-Villasenor OHerrera-Alcántara HPonce FACastro-Espinoza

Cham

Springer International Publishing 2020 12026 Topic modeling: A comprehensive review PKherwa PBansal 10.4108/eai.13-7-2018.159623 EAI Endorsed Transactions on Scalable Information Systems 7 2019. Mar. 12, 2024 The interpretation of topic models for scholarly analysis: An evaluation and critique of current practice MGillings AHardie 10.1093/llc/fqac075 Digital Scholarship in the Humanities 38 2023 Learning the parts of objects by non-negative matrix factorization DDLee HSSeung 10.1038/44565 Nature 401 1999 Blueprints for Text Analytics Using Python: Machine Learning-Based Solutions for Common Real World (NLP) Applications JAlbrecht SRamachandran CWinkler 2021 O'Reilly Media Sebastopol, CA 1st ed The why and how of nonnegative matrix factorization NGillis 2014 Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems SVajjala BMajumder AGupta HSurana 2020 O'Reilly Media Beijing Boston Farnham Sebastopol Tokyo 1st ed Reading tea leaves: How humans interpret topic models JChang SGerrish CWang JLBoyd-Graber DMBlei Advances in Neural Information Processing Systems 22 YBengio DSchuurmans JLafferty CWilliams ACulotta NIPS 2009. 2009 Code for topic modelling 2023 An early byzantine pseudepigraphon: the apocryphal acta barnabae FCairns 10.1515/bz-2019-0004 Byzantinische Zeitschrift 112 2019