Quotes at the fingertips: The combined approach of the BogoSlov project towards identification of Biblical material in Old Church Slavonic texts Martin Ruskov1 , Tomáš Mikulka2 , Irina Podtergera3 , Maxim Gavrilkov3 and Walker Thompson3 1 Department of Languages, Literatures, Cultures and Mediations, University of Milan, Piazza Sant’Alessandro 1, 20123 Milan, Italy 2 Catholic Theological Faculty, Charles University, Thákurova 3, 16000 Prague, Czech Republic 3 Institute of Slavic Studies, University of Heidelberg, Schulgasse 6, 69117 Heidelberg, Germany Abstract Scriptural quotations shaped and influenced the orthodox Slavic world by laying the groundwork for historical and symbolic exegesis through Old Church Slavonic (OCS) texts. The correct identification of biblical quotations is of the utmost importance for the textological as well as functional analysis of such texts. In this paper, we present a computer-assisted approach towards identifying quotations proposed by the BogoSlov project. This approach aims to combine two distinct methods of quotation identification. These are: 1) explicit rule-based algorithms and 2) quantitative embeddings from implicit language models. To make these accessible to Slavists and theologians, we aim to integrate them into a graphical user interface (GUI) built on best practices from related fields and facilitating the identification and validation of quotations, allowing as short as possible a feedback loop between expert and machine. Keywords Palaeoslavistics, computer-supported collaborative work, short text similarity, text reuse identification 1. Introduction Biblical texts shaped and influenced Slavia orthodoxa by laying the groundwork for historical and symbolic exegesis, i.e., the interpretation and understanding of facts and narratives, such that historical events were seen in the light of Scriptural prototypes [1]. The Bible supplied the core quotations through which this exegesis was manifested in Old Church Slavonic (OCS) texts. Therefore, the universal and correct identification of biblical quotations is of the utmost importance for textological as well as functional textual analysis. However, this task is fraught with challenges, demanding a nuanced approach that balances philological rigour with an understanding of the broader intellectual and theological context. Although biblical quotations have been studied for decades [2, 3], so far attempts to build a software tool for automatic identification of biblical quotations are limited and at best offer support for manual annotation by experts. No standalone solutions are yet available [4, 5], as recently have emerged in other contexts of text reuse [6]. Yet, such technological support for quotation identification could be very impactful to Palaeoslavistics and shed light on previously unknown intertextual relations, text history, and meanings. Among other things, it could facilitate the investigation of the dual phenomena of inter- and hypertextuality, whereby texts drew on common sources and came to reference each other or themselves internally. These phenomena are strongly present in medieval Slavonic Patristic IRCDL 2025: 21st Conference on Information and Research Science Connecting to Digital and Library Science, February 20-21 2025, Udine, Italy $ martin.ruskov@unimi.it (M. Ruskov); mikut5af@ktf.cuni.cz (T. Mikulka); irina.podtergera@slav.uni-heidelberg.de (I. Podtergera) € https://islab.di.unimi.it/team/martin.ruskov@unimi.it (M. Ruskov); https://www.slav.uni-heidelberg.de/personal/ipodtergera.html (I. Podtergera)  0000-0001-5337-0636 (M. Ruskov); 0000-0003-4362-2531 (T. Mikulka); 0000-0001-9098-2746 (I. Podtergera); 0000-0001-7656-1590 (M. Gavrilkov); 0000-0002-7203-9508 (W. Thompson) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings and liturgical texts owing to their extensive citation of the Scriptures; having a tool to identify such quotations would make it possible to study such aspects of these texts more deeply from a semiotic point of view. In other historical contexts, various techniques of automatic text reuse detection have been employed, from statistical and algorithmic methods [7, 8, 9], to machine learning and transformer models [10]. Only recently, some of these have started to also adopt human-machine interaction techniques [6], but have stopped short of making their approaches adoptable in other domains. In the BogoSlov project (acronym for Biblical OriGins in slavOnic texts – Systems for Language- modelled Observation and Verification), we set out to develop experimental software infrastructure to assist the identification and listing of Biblical quotations in OCS texts. This infrastructure is intended to include tools for annotation, algorithms and language models, combined in a toolset software bundle to support scholars in their efforts of identification and verification Biblical quotations. However, due to the varying degree of literality, identification of Biblical quotations is not a straight- forward task. Regardless of their origin, i.e. direct or indirect quotation (imported by the patristic source, which can be difficult to identify), quotations in general can vary in their fidelity to the original. This variation could range from exact reproductions, explicitly marked by the author or slightly modified versions that reflect either a conscious adaptation to fit the surrounding text or a transmission error. The before mentioned variation could include subtle and barely noticeable thematic allusions, which invoke Biblical themes or ideas without directly quoting the text. As a consequence for this less direct range of the spectrum, two specific challenges emerge. First, such allusions are hardly recognisable due to their divergence from the precise language of the original Biblical text. Secondly, there is a great deal of lexical and grammatical variation in translations of biblical texts and quotations. These issues delineate the limits of any attempt of identifying quotations merely by comparing texts. 2. Project Overview Since it would require considerable effort to build a universal tool that could identify all types of biblical quotations in any type of OCS text, the BogoSlov project plans to work around two pilot studies, in which the technology would be developed and tested on well-researched types of text, rich in biblical quotations, namely well-known homilies and hagiography. Thus, one pilot case study will focus on Vita Constantini and Vita Methodii, and in a second case study, a small number of less researched texts [11] would be analysed to identify citation-related clues that might help better contextualise the texts historically. To accomplish this, within the project two datasets will be created and integrated into an interlinked database. The first of these datasets is a corpus of known biblical quotations, an idea first elaborated by Naumow [2]. The second corpus is a machine-readable, tagged database of OCS biblical texts. Within the project, two major biblical sources of quotations (primary texts) will be examined: the Psalter and the Gospels, both of which are frequently quoted in medieval texts. The Psalter text is quite stable and was often known by heart at the time of writing these texts, suggesting a well-established and consistent tradition. Due to this stability, quotations from the Psalter are commonly easier to identify, making them suitable for an approach utilising explicit algorithmic modelling based on string similarity and longest common subsequence [12, 13]. In contrast, Gospel quotations present a more complex challenge. This complexity arises first from the nature of the synoptic Gospels, which share many common passages, allowing the wording to be influenced by prevailing traditions. As a result, within the Christian oral tradition, there exists a form of "evangelical harmony", where it becomes difficult to strictly separate nuances originating from specific Gospels. A second source of lexical variation that is valid not only for Gospel quotation is the fact that an important part of OCS literature was translated from Greek (and less often Latin) texts. In the process of translation, lexical variation was introduced, even within individual texts. To identify quotations with such variation or allusions, the word embeddings of language models allow for the study of semantic, rather than syntactic similarity. 3. Proposed Approach The BogoSlov project builds around the idea that quotation identification should be a computer-assisted process, meaning that this work should still be driven by theologians, philologists and medievalists, but whenever possible, tedious tasks should be (semi-)automated. With this premise, we set out to research three parallel directions: 1) helping experts efficiently perform manual identification and validation of quotations on one hand, and two automated approaches to detection of potential quotations on the other: 2) explicit algorithms for rule-based suggestion for possible quotations, and 3) language models for suggestions for potential allusions via implicit semantic modelling. These three will be brought together through a common data representation that would allow exchange of results between them. This combined approach aims to allow for a feedback loop between expert and machine that is as short as possible. 3.1. Data Model Combining the three aforementioned approaches, that are detailed in further sections, is only feasible if it is supported by a reproducible and interpretable quotation representation format. A database needs to allow for the systematic storage, querying and user annotation of the identified quotation candidates. Approaches in related research [3, 14] tend to gravitate towards the use of XML-formats, inspired by TEI (Text Encoding Initiative). While at this point this exact format does not present particular advantages to the current effort, compatibility with it would guarantee possible future exchange of results. Of particular interest to us is following a related established standard for text localisation, such as CTS URN [14]. These are strings (Uniform Resource Names) that contain a document identifier in their first part and text localisation (e.g. through the specification of a range) in the second. The CTS URN features the following components: urn:cts:NAMESPACE:WORK:PASSAGE, where PASSAGE could contain indications of start and end, separated by dash (“-”) or optionally specific strings indicating more precise subsection in the passage, separated by the at-sign (“@”)1 . A few examples from our context follow: 1. urn:cts:proiel:bible.marianus.matt:5.48 for the Gospel of Matthew 5:48 in the Codex Marianus manuscript as processed for the PROIEL treebank project2 . 2. urn:cts:titus:bible.zograph.jo:4.5@-4.5@ for a specific text in the Gospel of John 4:5 in the Codex Zographensis as presented in the TITUS corpus3 . 3. urn:cts:scripta-bulgarica:uchitelno-evangelie:5.20.cd@- 5.20.cd@ for a specific text on page 20, columns C and D in the 5th sermon of Constantine of Preslav’s Didactic Gospel (Uchitelno Evangelie) as presented by the Scripta Bulgarica corpus, and externally proposed as corresponding to the previous example4 . Primary Secondary ns_work: string ns_work: string quotes URL: string URL: string lincence: string lincence: string passage: string passage: string fromURN(string) fromURN(string) toURN(): string toURN(): string Figure 1: The conceptual data model, representing quotes in our database. Notice that the online sources for the above examples do not support CTS URN, so it will be our responsibility to define the corresponding namespaces and link them to the original sources. With this 1 i.e. urn:cts:NAMESPACE:WORK:PASSAGE@SUBSECTION 2 https://syntacticus.org/sentence/proiel:20180408:marianus:38407 3 https://titus.uni-frankfurt.de/texte/etcs/slav/aksl/zograph/zogra.htm?zogra071.htm#NT_Jo._4 4 http://scripta-bulgarica.eu/bg/sources/uchitelno-evangelie-na-konstantin-preslavski-tlkuvanie-vrhu-gl-4-ot-evangelie-na-yoan premise, as indicated in Figure 2, quotations are required to be pairs of two similar objects representing text segments: one biblical, and one medieval, e.g. the pair between examples 2 and 3 above. Each of these objects needs to be serialisable and deserialisable to URN, which combines a representation of the document and snippet attributes. 3.2. Annotation Due to the error-prone nature of the task of quotation identification, of key importance to the process is an user experience (UX) that allows experts to 1) browse for hypothetical partial (i.e. local only) alignments around the use of particular words or phrases, and 2) visualise and refine in context already identified quotations and allusions. Here, viewing a partial alignment is synonymous with contextual visualisation of a quotation. Previous research has focused on algorithms for automatic identification, but has stopped short of providing humanities researchers an accessible interface to validate and refine automatically identified quotations or allusions [7]. In contrast, we believe that an efficient semi-automated text reuse detection process is only possible with a graphical user interface (GUI) which allows side-by-side viewing of biblical (primary) texts on one side, and medieval texts (secondary) on the other, much along the lines of how this is done in text alignment software [15]. Furthermore, the parallel search of a specific vocabulary should be possible. This GUI-enabled tool should be able to flag potential quotations or allusions in a large input text, as well as allow for the search for short phrases or biblical references and return a list of known quotations, as discussed by Hue-Gay et al [3]. This is intended to significantly speed up the initial stages of analysis as it is currently performed by experts. Yet, the tool is still intended to be used by expert users providing only suggestions to experts that still would need to further verify possible quotation candidates. Thus, the tool is only semi-automatic and expert validation and detailed philological work remain indispensable. Human expertise is required to confirm the presence of a quotation, assess its degree of literality, and understand its function within the text. In BogoSlov we will employ usability research to ensure that the interface of this tool is both efficient to use [16] and cognitively undemanding [17]. The collaborative workflow of existing tools, like MoreEver [18] and UE extractor [19], served as a starting point for the discussion of a possible interface that should allow users to manually locate Figure 2: The proposed review GUI which allows experts to select a quotation and refine the corresponding alignment. and annotate potential quotations to support the training and verification of the language models. An early mock version of a quotation validation screen is shown in Figure 1. The two panes have similar functionalities, with the left one referring to biblical corpora (primary texts) and the right one to medieval (secondary texts). In the top-most search bar, users could search the text of interest from the corpora. When at least three characters are typed, an autocomplete drop-down list of available matching texts would appear to ease selection. The selection of text specifies an URN up to the WORK section. Once a text is selected, the list below gets populated with addresses of citation candidates from the text. Clicking on one of them, updates both the selection in the text below, along with its alignment on the other side, and the URN in the text box that represents the exact text selection. The text boxes for the two texts include WORK and PASSAGE components of URNs, i.e. complete addresses of text ranges. Users will be able to make selections on the two texts or manually edit URNs. These actions should be interactively synchronised, so that when changing the selection, the URN gets updated, and when changing the URN, the selection gets updated. In other words, the text preview and the URN box offer two different affordances for quotation review and refinement. 3.3. Algorithms There is a long-standing research tradition around text reuse identification through short text similarity algorithms [12, 13]. In it typically longer texts are broken down into sentences or text snippets of comparable length and then these short texts between the two corpora are compared one at a time. Due to OCS being a late (compared to classic languages) established written tradition spanning very broad language variation, and using both the Glagolitic and Cyrillic alphabets, there are a number of accepted redactions. In other words, even common sounds and words have varying established spellings, which makes even simple lexical correspondence a challenging task. This calls for methods that allow greater flexibility than typical short text similarity techniques used in the context of contemporary languages, allowing for tolerance to variation both in orthography and morphology. 3.4. Language Models Even short text similarity algorithms that exhibit tolerance for greater variation cannot cope for situations of rephrased texts or summarisation, typical for allusions [14]. In this context, recent developments in the field of language models could be helpful. In particular, semantic embeddings have emerged as a useful text quantification technique, used in language models. Commonly these are word embeddings, but more recently also other types, such as sentence embeddings have been proven to yield interesting results in text reuse detection [20, 10, 21]. On the downside, language models require corpora sizes that are unavailable for under-resourced languages, such as OCS [22]. A viable alternative is training a multilingual model including only the limited OCS resources available. Particularly interesting are results with multilingual models which exhibit improved performance over single-language ones [23, 24, 25]. Theoretically, this opens the possibility to address situations very typical in our context where a Greek homily – that contains Biblical quotations – was later translated to OCS. Such translations lead to the need to partially align them to the OCS translation of a biblical text, which would be an even more ambitious text reuse detection task. However, the choices surrounding training a multilingual model are not trivial. On one hand, including contemporary texts introduces unwanted socio-historical biases [26]. This particular problem could be addressed by training a model dedicated to classical languages, making sure to include only corpora from relevant historical periods. On the other hand, models trained on languages using multiple alphabets (such is the case of a mixed-classical model) are known to achieve lower performance on the minority alphabets [27]. We are looking for ways to train RoBERTa [28] and Sentence BERT [29] from scratch on corpora from the late antiquity and early middle ages. One particular challenge is that OCS was written in Glagolitic and Cyrillic alphabets with further variability, related to the specific historical circumstances of the language. As a consequence, we are interested in possibilities to further explore how to reduce variation and/or bootstrap data by tackling orthographic variation, related to the different alphabets used in the languages. 4. Further Work As previously mentioned, OCS texts of the Gospels underwent various revisions and refinements, sometimes aligning closely with the Greek standard and sometimes not. This led to a complicated array of textual variants circulating within the Slavic world. A comprehensive analysis of Biblical quotations in OCS texts often necessitates a comparative approach, involving not only the OCS translations themselves but also their Greek and Latin sources. By comparing the OCS text with these sources, researchers can identify the specific Biblical texts that were used, discern the influences and editorial decisions that shaped the text, and uncover any redactional layers that indicate later modifications. Although not considered part of our project, this comparative work is essential for understanding the intellectual and theological formation of the author or translator. The degree to which an author or translator accurately quotes or adapts a Biblical text can provide insights into their level of education, familiarity with the sources, and the theological or rhetorical objectives that guided their work. The program developed within this project would help identify the affinities of these quotations with specific Gospels. However, detailed analysis and interpretation will still be necessary to fully understand the nuances and origins of these quotations. In conclusion, the identification and analysis of Biblical quotations in OCS texts are complex tasks that require a combination of technological assistance and deep philological and theological expertise. While semi-automatic tools can greatly aid in the process, they cannot replace the nuanced analysis that only a trained scholar can provide. Through careful comparison with Greek and Latin sources, researchers can uncover the intricate layers of influence and modification that shape these medieval texts, offering valuable insights into the intellectual world of the time. Whereas, our case studies focus on Old Church Slavonic (OCS) texts dated around the 9th and 10th centuries, we expect that the developed methodology and algorithms are adaptable to applications for Church Slavonic (12th century and beyond) and possibly other medieval languages, not excluding Latin and Greek. Acknowledgments This contribution has received funding from the SEED4EU+ collaborative research programme under the project “Biblical OriGins in slavOnic texts – Systems for Language-modelled Observation and Verification” (acronym: BogoSlov), scheduled to run throughout the year 2025. References [1] R. Picchio, The Function of Biblical Thematic Clues in the Literary Code of “Slavia Orthodoxa”, in: Slavica Hierosolymitana, volume 1, Magnes Press, Hebrew University, Jerusalem, 1977, pp. 1–31. [2] A. Naumov, kartotece cerkiewnosłowiańskich użyć biblijnych. Cytaty biblijne w staroruskiej części Kodeksu Uspienskiego, Rocznik Slawistyczny 44 (1983) 21–29. [3] E. Hue-Gay, L. Mellerin, E. Morlock, TEI-encoding of text reuses in the BIBLINDEX Project, Journal of Data Mining & Digital Humanities Special Issue on Computer-Aided Processing of Intertextuality in Ancient Languages (2017). doi:10.46298/jdmdh.3989. [4] M. Moritz, A. Wiederhold, B. Pavlek, Y. Bizzoni, M. Büchler, Non-Literal Text Reuse in Historical Texts: An Approach to Identify Reuse Transformations and its Application to Bible Reuse, in: J. Su, K. Duh, X. Carreras (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Austin, Texas, 2016, pp. 1849–1859. doi:10.18653/v1/D16-1190. [5] H. Eckhoff, Automatic Alignment of the Psalterium Sinaiticum and the Septuagint Psalms, - (2021) 71–90. URL: https://www.ceeol.com/search/article-detail?id=1005901. [6] M. Düring, M. Romanello, M. Ehrmann, K. Beelen, D. Guido, B. Deseure, E. Bunout, J. Keck, P. Apostolopoulos, impresso Text Reuse at Scale, 2023. URL: https://hal.science/hal-04151808. [7] G. Franzini, M. Passarotti, M. Moritz, M. Büchler, Using and Evaluating TRACER for an Index Fontium Computatus of the Summa contra Gentiles of Thomas Aquinas, in: E. Cabrio, A. Mazzei, F. Tamburini (Eds.), Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC- it 2018), volume 2253, CEUR-WS, Torino, Italy, 2018. URL: https://ceur-ws.org/Vol-2253/#paper22. [8] D. A. Smith, R. Cordell, A. Mullen, Computational Methods for Uncovering Reprinted Texts in Antebellum Newspapers, American Literary History 27 (2015) E1–E15. doi:10.1093/alh/ ajv029. [9] A. Vesanto, A. Nivala, H. Rantala, T. Salakoski, H. Salmi, F. Ginter, Applying BLAST to Text Reuse Detection in Finnish Newspapers and Journals, 1771-1910, in: G. Bouma, Y. Adesam (Eds.), Proceed- ings of the NoDaLiDa 2017 Workshop on Processing Historical Language, Linköping University Electronic Press, Gothenburg, 2017, pp. 54–58. URL: https://aclanthology.org/W17-0510/. [10] F. Periti, P. Cassotti, S. Montanelli, N. Tahmasebi, D. Schlechtweg, TRoTR: A Framework for Evaluating the Re-contextualization of Text Reuse, in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Miami, Florida, USA, 2024, pp. 13972–13990. URL: https://aclanthology.org/2024.emnlp-main.774. [11] T. Mikulka, Textological, linguistic and theological features of the newly identified corpus of Old Church Slavonic homilies, Slavia 92 (2023) 610–624. doi:10.58377/slav.2023.5.05. [12] A. Islam, D. Inkpen, Semantic text similarity using corpus-based word similarity and string simi- larity, ACM Trans. Knowl. Discov. Data 2 (2008) 10:1–10:25. doi:10.1145/1376815.1376819. [13] D. W. Prakoso, A. Abdi, C. Amrit, Short text similarity measurement methods: a review, Soft Computing 25 (2021) 4699–4723. doi:10.1007/s00500-020-05479-2. [14] M. Berti, C. Blackwell, M. Daniels, S. Strickland, K. Vincent-Dobbins, Documenting Homeric Text-Reuse in the Deipnosophistae of Athenaeus of Naucratis, Bulletin of the Institute of Classical Studies 59 (2016) 121–139. doi:10.1111/j.2041-5370.2016.12042.x. [15] T. Yousef, S. Janicke, A Survey of Text Alignment Visualization, IEEE Transactions on Visualization and Computer Graphics 27 (2021) 1149–1159. doi:10.1109/TVCG.2020.3028975. [16] K. Hornbæk, M. Hertzum, Technology Acceptance and User Experience: A Review of the Experi- ential Component in HCI, ACM Transactions on Computer-Human Interaction 24 (2017) 1–30. doi:10.1145/3127358. [17] F. Paas, J. J. G. Van Merriënboer, Cognitive-Load Theory: Methods to Manage Working Memory Load in the Learning of Complex Tasks, Current Directions in Psychological Science 29 (2020) 394–398. doi:10.1177/0963721420922183. [18] A. Morollon Diaz-Faes, C. S. R. Murteira, M. Ruskov, Values That Are Explicitly Present in Fairy Tales: Comparing Samples from German, Italian and Portuguese Traditions, Journal of Data Mining & Digital Humanities NLP4DH (2024). doi:10.46298/jdmdh.13120. [19] M. Ruskov, L. Taseva, Computer-Aided Modelling of the Bilingual Word Indices to the Ninth- Century Uchitel’noe evangelie, in: Proceedings of the Workshops and Doctoral Consortium of the 26th International Conference on Theory and Practice of Digital Libraries, 2022, pp. 19–30. URL: http://ceur-ws.org/Vol-3246/03_paper-6921.pdf. [20] P. Cassotti, L. Siciliani, M. DeGemmis, G. Semeraro, P. Basile, XL-LEXEME: WiC Pretrained Model for Cross-Lingual LEXical sEMantic changE, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 1577–1585. doi:10.18653/v1/2023.acl-short.135. [21] I. Muneer, R. M. A. Nawab, Cross-Lingual Text Reuse Detection at sentence level for English–Urdu language pair, Computer Speech & Language 75 (2022) 101381. doi:10.1016/j.csl.2022. 101381. [22] Q. Dombrowski, From Annotation to Modeling: Computational Horizons for Medieval Slavic Studies, Scripta & e-Scripta (2021) 11–21. URL: https://www.ceeol.com/search/article-detail?id= 994247. [23] Z. Wang, K. K, S. Mayhew, D. Roth, Extending Multilingual BERT to Low-Resource Languages, in: T. Cohn, Y. He, Y. Liu (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp. 2649–2656. doi:10.18653/v1/ 2020.findings-emnlp.240. [24] P. Rust, J. Pfeiffer, I. Vulić, S. Ruder, I. Gurevych, How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online, 2021, pp. 3118–3135. doi:10.18653/v1/2021. acl-long.243. [25] P. Singh, A. Maladry, E. Lefever, Too Many Cooks Spoil the Model: Are Bilingual Models for Slovene Better than a Large Multilingual Model?, in: J. Piskorski, M. Marcińczuk, P. Nakov, M. Ogrodniczuk, S. Pollak, P. Přibáň, P. Rybak, J. Steinberger, R. Yangarber (Eds.), Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023), Association for Computational Linguistics, Dubrovnik, Croatia, 2023, pp. 32–39. doi:10.18653/v1/2023.bsnlp-1.5. [26] M. Cuscito, A. Ferrara, M. Ruskov, How BERT Speaks Shakespearean English? Evaluating Historical Bias in Contextual Language Models, 2024. URL: http://arxiv.org/abs/2402.05034, arXiv:2402.05034 [cs]. [27] J. Pfeiffer, I. Vulić, I. Gurevych, S. Ruder, UNKs Everywhere: Adapting Multilingual Language Models to New Scripts, 2021. doi:10.48550/arXiv.2012.15562, arXiv:2012.15562. [28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019. doi:10.48550/arXiv.1907. 11692, arXiv:1907.11692. [29] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, 2019. doi:10.48550/arXiv.1908.10084, arXiv:1908.10084.