Ukrainian Redaction of Church Slavonic (URCS): Needs for Digitalization and Text Corpora Platform Generation. Part I. Nataliya Popovych 1, Andriy Lutskiv2, Oleksandr Mitsa 1, Olha Lyntvar3 and Andriana Ivanova 1 1 Uzhhorod National University, Narodna Square, 3, Uzhhorod, 88000, Ukraine 2 Ternopil Ivan Puluj National Technical University, Ruska 56, Ternopil, 46001, Ukraine 3 National Aviation University, Liubomyra Huzara Ave. 1, Kyiv, 03058, Ukraine Abstract The article explores the issues of digitalization and possibilities of text corpus generation for the Ukrainian Redaction of Church Slavonic (URCS). This endangered language is still in use in Liturgical Services in certain regions of Ukraine (mostly Zakarpattia and Lviv Regions), as well as on bordering territories of Slovakia, Romania, and Poland. The given research is on its initial stage. It provides a brief overview of the URCS language history, and examines the interconnections between the Ukrainian language and URCS through examples from its usage in the modern Ukrainian language, as well as in Ukrainian literature of different genres and time periods. In addition, (4) the article suggests preservation and conservation approaches and strategies to URCS digitalization, including the creation of the text corpora platform, from both the user and developer perspectives, with the aim of ensuring its survival for future generations. The given article outlines a set of issues aimed at being solved through (1) the analysis and the classification of URCS text collections, (2) review of Ukrainian corpora and corpus tools available for Ukrainian speaking target users in the open access payment free as well as on reasonable fixed-price basis, (3) corpus-based analysis provided on the examples of URCS lemmas used in the modern Ukrainian language, in the texts of Ukrainian literature of different time periods as well as comparative analysis of URCS texts and their Ukrainian translations focusing on the accuracy, adequacy and faithfulness of specialized terminology and concepts. Furthermore, the article explores the potential of creating digital text corpora for URCS by utilizing modern technologies and methods by the interdisciplinary research team of linguists and IT experts. The creation of such a corpora platform could facilitate linguistic research, including studies on vocabulary, grammar, and syntax, as well as studies on the specific terminology, historical and cultural aspects of the language. The research also highlights the need for further collaboration among linguists, and digital experts to enhance the preservation and promotion of the URCS. Keywords 1 Ukrainian Redaction of Church Slavonic (URCS), URCS text analysis platform (software), parallel corpus, comparable corpus, NLP, SketchEngine, mova.info, LancsBox 6.0. URCS, URCS text analysis platform (software). 1 COLINS-2023: 7th International Conference on Computational Linguistics and Intelligent Systems, April 20–21, 2023, Kharkiv, Ukraine EMAIL: nataliya.popovych@uzhnu.edu.ua (N. Popovych); l.andriy@gmail.com (A. Lutskiv); alex.mitsa@gmail.com(A.Mitsa); olha.lyntvar@npp.nau.edu.ua (O. Lyntvar); andriana.ivanova@uzhnu.edu.ua (A.Ivanova); ORCID: 0000-0001-6949-0771 (N. Popovych); 0000-0002-9250-4075 (A. Lutskiv); 0000-0002-6958-0870 (O. Mitsa); 0000-0003-4671-5514 (O. Lyntvar); 0000-0002-1733-4416 (A. Ivanova) ©️ 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 1. Introduction The need for the creation of a URCS Text Analysis Platform (Software) that will preferably include parallel and comparable text corpora is urgent and significant for several reasons. Firstly, the URCS language is rapidly disappearing from current Church (liturgical) use. Secondly, there is constant planned substitution of it by the Russian Redaction of Church Slavonic in Ukraine (e.g. in Zakarpats’ka oblast’) [1]. Finally, the preservation of the unique heritage of URCS in text and chant, which is still present in Church use now, is crucial for future generations. The subject matter requires scientific exploration, analysis, and discussion due to the absence of adequately proven research results in the English-speaking scientific community and the lack of broad insight, deep comprehension or even solid awareness of URCS language issues within the global research community today. Despite the assiduous efforts of prominent Ukrainian scholars, such as Pavlo Hrytsenko and in particular Vasyl' Nimchuk, who demonstrated unwavering dedication to addressing URCS research problems, persistent issues remained unresolved in this field. Their work has remained unknown due to possibly weak information dissemination and a lack of research result sharing and subsequently adequate research result impact outside the Ukrainian-speaking scientific community. Nevertheless, Nimchuk's PhD students and colleagues continue effectively carrying out research in this area [2]. The research focuses on the following main issues: 1. The threat of extinction facing the URCS language due to its forced substitution by the Russian Redaction of Church Slavonic by Moscow authorities [3, 1]. 2. The lack of evidence and research on the presence of URCS in Ukrainian literature of different time periods within English scientific literature, reviews, and articles on the subject matter. 3. The lack of evidence and research on the presence of URCS in modern Ukrainian language within English scientific literature, reviews, and articles on the subject matter. 4. The importance of digitalization to meet the needs and purposes of preserving URCS. 5. The main requirements of a URCS corpus or corpus-based tool. Suggestions for the best approaches to digitalizing the URCS language, taking into account a vast majority of already existing corpora, are being developed to solve different specific tasks. The outlined issues, which are aimed at being resolved, will require a series of analyses and reviews to be conducted in future research. These include: (1) analyzing and classifying URCS text collections, (2) reviewing Ukrainian corpora that are available for Ukrainian-speaking users in open access or on a reasonably fixed-price basis, (3) conducting corpus-based analyses on URCS lemmas in transliterated versions of URCS texts, modern Ukrainian texts of different genres and styles, and Ukrainian literature from different time periods. In order to meet the specific needs of researchers, it is necessary to develop a URCS query platform that includes parallel and comparable corpora of text and chant, which enable a comparative analysis of URCS texts and their Ukrainian translations. The development of such a platform should take into account the accuracy, adequacy, and faithfulness of the translation of URCS specialized terminology and concepts. Digitalization and text corpus generation are essential for preserving the Church Slavonic of Ukrainian Redaction, which is an important aspect of Ukrainian cultural heritage and religious traditions. URCS is still used in Zakarpattia, Bukovyna and Lviv regions of Ukraine, where it has been preserved as a proof of its integral part of religious and cultural life in Ukraine for centuries. In conclusion, creating a digital corpus of URCS would enable scholars and researchers to analyze its grammar (syntax), and vocabulary systematically. This would also facilitate the study of Ukrainian translations of URCS texts and allow for important amendments and corrections to be made. 2. Brief Historical Overview of the Church Slavonic of the Ukrainian Redaction (URCS): the Past and the Present The URCS language has been the subject of extensive research by Academician V. Nimchuk and scientists at the Institute of the Ukrainian Language of the National Academy of Sciences of Ukraine (Institute of the Ukrainian Language NASU), resulting in numerous works in the Ukrainian language that explore its complexities and challenges. The historical overview presented here is primarily based on the works of V. Nimchuk, N. Puriaeva, other scientists from the Institute of the Ukrainian Language at the National Academy of Sciences of Ukraine, H. Kuzems’ka, M. Skab, as well as M. Moser and the authors' own research. The authors have given due consideration and respect to other scholarly works in the Ukrainian language that focus on the history, functionality, development, and significance of the URCS language. Many ethnic languages, such as Greek, Latin, Arabic, and Old Slavonic – Church Slavonic, have been elevated to the status of regional or world sacred languages. These sacred languages often cease to be used in daily communication and instead become reserved for cult purposes, acquiring the label of “dead languages” [3, 4]. The language traditionally known as Old Church Slavonic (OCS) was the language of religious practice in the territory of East Slavia from the 11th to 13th centuries. Prior to becoming the language of Kyivan Rus’, the OCS language underwent three to four intermediate periods of its development [1, 2, 3, 4]. The first Old Church Slavonic texts were translated by St. Cyril and St. Methodius from Ancient Greek based on the language spoken by the Slavonic population of their native town Thessaloniki. When the saints arrived in Great Moravia, they had to adapt those texts to the local language spoken in the kingdom. As a result, the oldest monument of Old Church Slavonic, the Hlaholitic Leaflets, is characterized by the distinct use of the Czech and Slovak languages of that time. Then the brothers moved to Pannonia, where they had to adapt the Old Church Slavonic language for the Slavic population in that region as well. From there, the disciples of St. Cyril and St. Methodius spread the use of Old Church Slavonic to other regions in two directions: Croatia and the Bulgarian state [3,4]. Another significant period of Old Church Slavonic development occurred in the Bulgarian Empire, where it flourished and became enriched. Macedonia, which was part of the Bulgarian state and had important centers of book culture at that time, contributed to the language's enrichment too. Hence, the language obtained the colorful and distinct features of the Bulgarian and Macedonian language mix. This language, known as Old Bulgarian, was inherited by the first Eastern Slavic Church communities in the 10th century. After Christianity was introduced as the official religion in Kyivan Rus’ in 988, Old Church Slavonic of the Old Bulgarian Redaction was spread throughout Rus’. However, it immediately began to be influenced by the communicative features of the Eastern Slavic language – the language of the local population. By the end of the 11th century, this language had acquired the common characteristics of the Eastern Slavic Redaction and was used as a language of the Church alongside the Old Rus’ standard language [3, 4, 5, 6]. The East Slavic Redaction of Old Church Slavonic had distinct features of phonetics and morphology, and as books were edited in the capital of Rus’, many colloquial words began to appear in the Old Church Slavonic texts, including liturgical ones. In particular, Old Kyiv words were frequently used in them. Old Slavic texts were read in various regions of Kyivan Rus’ according to the native pronunciation of the reader. The pronunciation of the capital city, Kyiv, which was a church center and metropolitan city, was regarded as the standard and as an exemplary. For instance, it is widely accepted that in the southern and southwestern regions of Rus’, the letter "г" was pronounced as a guttural sound similar to modern Ukrainian. By the mid-13th century, the Old Slavic language had transformed into a variety that was typical of the Ukrainian language in the Kyivan state. This version of the East Slavic variant of the Old Slavic language was in use from the mid-12th century to the end of the 13th century, and since that time and later its Redactions are referred to as Church Slavonic by philologists. While modern-day Russia and Belarus have gradually developed their own Redactions of the Church Slavonic language, orthoepy of the capital and metropolitan Kyiv exerted significant influence and authority in these regions. Evidence of this can be seen in the liturgical orthography of Russian Old Believers in the north of Russia today, who still use the letter "г" as a back-palatal fricative, which partially corresponds to Ukrainian, and also pronounce hard consonants before "e," similar to the Ukrainian pronunciation [3, 5]. Starting from the end of the 18th century, two redactions of the Church Slavonic language coexisted in various Ukrainian denominations: the Old Kyiv or the Church Slavonic language of Ukrainian Redaction in the Greek Catholic Church, which was displaced from Right-Bank Ukraine to the territory of the Austrian Empire and continued to develop as the Ukrainian Greek Catholic Church (UGCC) in 1795, and the Old Moscow Redaction in the Orthodox Church on the territory of Ukraine. This determined the further development of the Ukrainian liturgical language in these denominations. In the UGCC, the liturgical Church Slavonic language was never used as an instrument of national assimilation of Ukrainians. The preservation of the Ukrainian pronunciation allowed it to be perceived as a chronological (old Ukrainian), functional (church, as opposed to secular) and stylistic (highly literary, as opposed to everyday spoken language) variant of the Ukrainian language. The URCS language was perceived by Greek Catholics as the language of their native faith, rite, and therefore, their native (not foreign) language [5, 6, 7, 8, 9]. 2.1. The URCS CS Language Today Following the forced displacement of the autochthonous Ukrainian Redaction of the Church Slavonic language by the Russian Redaction, there are now only few remaining regions in modern Ukraine where this language is still actively used in liturgical practice. These regions likely include three oblasts of Ukraine, namely Lviv, Zakarpattia, and Chernivtsi. Within the Lviv oblast, the Univ Lavra of Holy Ascension (UGCC) is considered to be the primary center for the continuous and constant use of the Ukrainian Redaction of the Church Slavonic language [10]. The Ukrainian Greek-Catholic Church (UGCC) on the whole has a natural inclination to preserve the Ukrainian Redaction of the Church Slavonic language in its liturgical practices, given that it is the official liturgical language of this Church. This language, being the liturgical matrix of the UGCC, forms the basis for all other liturgical translations, even though they are now used more frequently than the URCS language [11]. The URCS language is used as a main liturgical language alongside modern Ukrainian in the Greek Catholic Diocese of Mukachevo in Zakarpatska oblast [12]. 2.2. Main Features of URCS Pronunciation Thus, we may claim that every CS redaction has their own orthoepy, which significantly impacts the way the text is transliterated [7]. By transliteration we mean the transfer of a text lettering into the target alphabet. Hence, it looks more like a reproduction of a language at a phonetic level preserving its characteristic features. Every Slavic Church is also reflected differently in terms of phonetic representation. Regardless of the place where theologians-writers and composers created their sacred-language texts – Bulgaria, Romania, Slovenia, Ukraine or Russia, every people is bound to transliterate their texts following the language norms of their land [4, 5, 13]. Ukrainian pronunciation is engraved in the world famous “Grammar” by archbishop Meletius Smotryts’kyi, which fixed some foreign and later language forms though, still preserved Ukrainian stress and traditional pronunciation of letters: г, e, е, и; argued that l was pronounced as і and not є (according to Russian tradition) [14]. This specific Ukrainian phonetics was preserved regardless of the numerous tsarist and synodal decrees up to the end of the eighteenth century when following the Valuev Circular (1863) and Ems Ukaz (1876) the ruinous attack on Ukrainian orthoepy was launched. The language was humiliated, nicknamed ‘distorted Russian’, ‘twisted Polish’, ‘language of the lowest societal layers’[13]. But all this did not prevent the Ukrainian people from praying in the language of their ancestors. Since our aim is to highlight the importance of preserving old Ukrainian tradition of liturgical texts, we, following V. Nimchuk, claim that Ukrainian redaction of Church Slavonic, which was mainly revealed through Ukrainian orthoepy at a desolate time of lacking Ukrainian statehood acted as one of those spiritual and cultural factors that guarded the integrity of Ukrainians as an ethnic group. This redaction contributed to the formation of a single cultural and linguistic area, was a characteristic feature of the Ukrainian Church, Ukrainian identity [3, 4, 13]. To conclude, we want to stress the importance of preserving the Ukrainian redaction of Church Slavonic as a source of Ukrainian national identity, whose linguistic and cultural heritage was appropriated by Russian church culture which allowed them to distort both Ukrainian realities, create myths and build their own deceitful linguistic, historical and cultural background. 3. Ukrainian Corpora and Corpus Platforms In a previous publication, we presented a classification and overview of corpora. In this research, we use a three-fold classification of corpora that includes content-based corpora and corpus tools, functional annotation set and aim-based corpora, and generation-based corpora [14]. ● Content-based corpora and corpus tools ● Functional annotation set and aim-based corpora ● Generation-based сorpora [14] Content-based corpora and corpus tools can be further categorized into national, professional, parallel, comparable, specialized, and task-based (adaptable or mixed) [14]. Among the various corpus platforms available to Ukrainian users, we focus on two projects: the Corpus of the Ukrainian Language developed by N. Dartchuk, O. Siruk, M. Langenbach, Ya. Khodakivska, and V. Sorokin at the Institute of Philology of TKU in Kyiv [15] and the Laboratory of Ukrainian and the General Regionally Annotated Corpus of Ukrainian (GRAC) [16]. These projects are among the most developed of the Ukrainian corpora and corpus tools [14]. Mova.info is a corpus platform of the Ukrainian language, which allows users to search and analyze a large collection of Ukrainian texts. The platform contains a diverse range of texts from different genres and time periods, including literary works, scientific papers, news articles, and more. Users can search for specific words or phrases, view concordances and collocations, and perform various types of linguistic analysis. We conducted a small experiment to explore the use of URCS words in Ukrainian literature. The experiment was carried out using the mova.info corpus platform. Out of 100 randomly selected URCS words, 80 were found to be used in literary works within the corpus. This experiment highlights the connection between URCS and the literary language of Ukraine. GRAC is the Corpus of the Ukrainian language, which counts 1.875 billion tokens in its 16 version. SketchEngine is another multilingual text analysis software, which provides corpora in 14 languages, including Ukrainian. The Ukrainian corpus is presented as ukTenTen – Ukrainian corpus from the web [17]. The Ukrainian Web Corpus (ukTenTen) is a Ukrainian corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 40 languages. Data for the Ukrainian Web 2020 corpus consists of texts from May 2014 and July–August 2020. The Wikipedia part is from December 2020. The final size of the corpus contains 2.5+ billion words [17]. There are 3,282,586,754 tokens, 2,592,516,436 words, 129,751,817 sentences, as well as 7,204,875 web pages. The Ukrainian Web 2020 corpus is lemmatized by CSTLemma and part-of-speech tagged by RFTagger using two different tagsets (MULTEXT-East Ukrainian PoS tagset, which is more-detailed and Universal Dependencies PoS tagset showing only basic parts of speech) [17]. Figure 1: Concordance Search Query Result for URCS Lemma “hlaholaty” in Ukrainian Literature Corpus on mova.info. A complete set of Sketch Engine tools is available to work with this Ukrainian Web corpus to generate ● keywords– terminology extraction of one-word units ● word lists – lists of Ukrainian words organized by frequency ● n-grams– frequency list of multi-word units ● concordance – examples in context ● text type analysis – statistics of metadata in the corpus [17]. 4. Developing the URCS Corpus Platform The development of the URCS corpus platform must take into account the previous approaches as well as the needs and expectations of users. Therefore, it is important to consider the user side during the development process. Firstly, the URCS platform must be user-friendly, take into account different types of users (ordinary people interested in the URCS vocabulary and texts, medium level experts, and linguists) and be easy to navigate. Secondly, the platform should offer a diverse collection of URCS texts, encompassing liturgical, literary, and historical works, which can be subjected to lemmatization and morphological analysis. Users should have access to a broad range of texts for linguistic research purposes. These texts are typically printed in the URCS alphabet. However, one issue that arises is the lack of consistency in the use of this alphabet in such publications. Thirdly, the platform should allow for easy downloading and exporting of texts in various formats, such as OCR-ed PDF or TXT and others as it shown on Figure 1. This will allow users to conduct their own analysis and research of uploaded corpus or corpora and download the received results. Figure 2: Options and Supported Formats on SketchEngine Platform Unfortunately, all scanned, but not OCR-ed pdf texts are not supported by the platform. In this case neither the user, nor the corpus will benefit. Users will not be able to submit their own texts or suggest corrections to existing texts. Finally, the platform should provide support and resources for users who may not be familiar with the URCS language or digital corpus research. This can include user guides, tutorials, and a help desk for technical support as it is provided by many corpus tools like LancsBox 6.0 and its former versions and corpus platforms like SketchEngine or ГРАК-16 [16, 17]. 4.1. Previous Backgrounds in Development Section 4.1 presents examples of team work results (on the material of Bible books in different languages), which serve to illustrate the developmental context and potential solutions that the team may employ for generating the URCS text corpora platform. In our previous team publications, the solutions for adaptable text corpus development for specific linguistic research were suggested [14]. It was also described the effectiveness of automated linguistic analysis using a big data approach [18, 19, 20]. It is worth providing here its data processing workflow implementation [14] and computational experiments [18]. According to the nature of input data we used approaches for Big Data processing, so software should fulfill these requirements. Developed corpus tool prototype was based on software components of Apache Hadoop ecosystem (Hortonworks Data Platform 3.1). Suggested corpus tool was implemented basing on Lambda architecture. Application was developed with Java 8 programming language and Spring Framework 5. Workflow was implemented with Apache Spark 2.3 [21] components: stages of workflow implemented as a Stanford Core NLP Pipelines in Apache Spark SQL using Spark Datasets which were well supported in Java. Pipelines were implemented by using appropriate libraries. Apache Tika with TesseractOCR used for Data ingestion of source data in binary formats (images, raster and vector PDFs, DOC, DOCX). Bliki-core and edu.umd.cloud9 libraries used for handling Wikipedia’s tags. Main workflow steps 1-9 provided by Stanford Core NLP [22] library and LangTool. Nowadays LangTool has the best support of Ukrainian language for purposes of POS-tagging and text spellchecking. Workflow step 10 was implemented with Apache Spark MLLib: TF-IDF calculation, matrix operations in LSA. Workflow step 11 was implemented with Word2VecfJava library [23]. Steps 13 and 14 were implemented with available API of Wikipedia, Dictionary of Ukrainian Language [24], Oxford Dictionaries API [25] and Glosbe API [26]. Building inverted index was done with Apache Lucene Library. Index was stored in Apache Solr which was embedded into HDP 3.1 platform. Vocabulary of gathered metadata from step 12 stored into RDBMS PostgreSQL 9.6 to minimize time of data access. Figure 3: Data Processing Workflow Computational experiments carried out on Wikipedia dumps and open text documents of Ukrainian and English texts. In the implementation process free or/and open source software were used. Data source were open or free of charge [14]. For computational experiment corpora of different editions, translations and languages of the Bible were compiled to verify the suggested approach. For experiment there were taken the books which were used for mobile applications in SQLLite format and were imported into RDBMS PostgreSQL to work with Apache Spark [18, 19]. Due to that investigation every Bible edition was treated as a subcorpus, i.e., a set of chapters. Each chapter had its own sentences and terms. After ETL the most important keywords, term POS-tags, relations between terms and other features for each chapter were obtained. After statistical processing each subcorpus (book) and its documents (book chapters) had its own characteristics. Some books were logically subdivided into stories, but the number of stories depended on translation and varies from zero to 1252 and other books contained less number of translated books. Due to those two factors and in order to provide more accurate results it was suggested to divide each book into documents by chapter criterion and to prepare custom ETLs for different types of books. After ETL there were to be done the following processing steps: sentence and word tokenization, calculation of terms frequencies, finding collocations (N-grams with high probabilities), POS-tagging, stop words filtering, lemmatization, calculation of TF-IDFs, building of term-document matrix with TF-IDFs, SVD with obtaining low- dimensional term-document matrix representation. Developed adaptable corpus also allowed to choose the custom k-value [18, 19]. Two other successful experiments were conducted, one focused on the effectiveness of automated linguistic analysis using a big data-based approach, while the other developed an adaptable corpus translation module [18, 19, 20, 27]. The developing team has also an extensive experience in developing software to meet the needs of cultural sector. In particular, the professionally developed map of Ukrainian dialects [28] displays the settlements where the entered word is still being used. The map contains a big amount of dialect data (more than 32 thousand words) and enhances language diversity preservation. The client part of an interactive map is created with the help of the library React.js, programming language JavaScript and the library of managing the sate MOBX. The server part of the informational system is written in the programming language Ruby using framework Ruby on Nails. Relational DBMS Postgresql has been used as the primary database along with Redis cache for caching some of the most frequently used data. One more development [29] is connected to displaying the toponyms taken from web-portal “Diia” developed by Ministry of Digital Transformation of Ukraine and with help of data provided by the Institute of National Remembrance. The processed data have been visualized with the help of cartographic web-service Google Maps. Every decommunized object got a pin on the map. The interactive map has been integrated to the web portal Analitycs-UA. 4.2. User Interaction Processes and Technologies’ Overview In the end of this section we provide a brief overview of the user interaction processes and technologies used as well as technical details related to the implementation. The project contains an authentication flow that is designed to allow users to securely log in to our platform and access their accounts. After successful authorization, the user has access to the main feature which is the creation of the corpora, their viewing and interaction with it. The system provides two options for corpus creation: manual text formation or based on text file generation. Furthermore, there are options for editing texts on the selected pages in the corpus and viewing experience that allows users to read content as a physical book. The project uses several technologies to deliver a seamless experience. The system uses Next.js for the front end and Nest.js for the API side. Next.js is a React-based framework that allows us to build powerful and performant user interfaces for the platform. On the back end side, Nest.js and PostgreSQL are used to manage server-side platform operations. Also being a cloud provider AWS S3 feature, it is highly scalable and secure object storage, designed to store and retrieve any amount of data from API and it is used for storing blob text files. The project leverages cutting-edge technologies and architectures to deliver a top-notch user experience. 4.3. Search Query Experiments (SketchEngine) A good example of user-friendly software is SketchEngine, which we tested by creating our own URCS corpus. Following a clear instruction [30], one of the URCS text in PDF was uploaded to the platform (scanned images have been OCRed before uploading). For our search query experiments we use the prayer book "Promin dushi" edited by Greek Catholic Diocese of Mukachevo [31]. The text is written in the URCS language, but transliterated using the letters of the modern Ukrainian alphabet based on the phonetic principle ("write as you hear"). The only difference between this URCS text and modern Ukrainian writing is the presence of stress mark on almost every word, except for function words (stop words). The example of one search query result can be seen on Figure 4. Figure 4: 3-4-Ngrams, URCS Corpus, Uploaded to SketchEngline Platform Given the diverse origins of URCS texts in terms of time and place of publication, a standardized alphabet and consistent use of diacritical marks poses a significant challenge for the development of the URCS corpus platform. The only solution might be to segment the texts into separate corpora based on their historical periods. This approach would enable scholars and researchers to analyze and compare different versions of the URCS language through creating parallel as well as comparable corpora which will also help tracing the development of its linguistic features over time. Figure 5: Psalm 103 in URCS, and in URCS using the Stress Marked Ukrainian Transliteration Despite the successful creation of the URCS corpus on the SketchEngine platform, the search results obtained are insufficient. Only the ngram search query produces accurate results while all other search functions fail to work or display inaccurate outcomes. The reason behind this issue is the mismatch between the grammar and stylistics of our URCS corpus and the Ukrainian language, resulting in only transliterated text that uses the modern Ukrainian alphabet being matched by the search functions. Figure 6: POS-tagging of URCS Search Query Showing the Wrong Initial Grammatical Forms As we can understand, the search functions are likely not able to accurately analyze the linguistic properties of the text. Solution 1: Preprocessing the URCS text may facilitate the recognition of the linguistic features of the original language, leading to more accurate search results. However, the feasibility of this solution is contingent upon the availability of tools and resources for giving equal value to the transliterated text grammatically and stylistically as original URCS text, as well as the compatibility of the transliterated URCS text with SketchEngine's search functions. Solution 2: An alternative approach is to utilize a different platform or tool that is better suited to the URCS language and corpus. The selection of an appropriate tool may require careful evaluation of its capabilities in handling the specific linguistic properties of the Church Slavonic language of the Ukrainian Redaction. Solution 3: To ensure the preservation and accessibility of the extant URCS texts, it is imperative to create curated collections of the texts and develop a tailored corpus platform that can accommodate the unique linguistic features of the URCS language and the specialized search queries required for meaningful research. Such a custom platform would require significant resources and expertise, but it would enable scholars and researchers to conduct more accurate and targeted analyses in the URCS corpora, thus contributing to a deeper understanding of this historically significant language. By taking into account the user side during the development of the URCS corpus platform, the developer team should ensure the needs and expectations of users and provide (1) a valuable resource for linguistic research and (2) preservation efforts. The main expectations of the user are as following. ● different types of text data ingestion (URCS of different publishing periods) ● text processing ● semantic tagging of each part of a corpus (e.g. UCREL Semantic Analysis System [32]) ● qualitative and quantitative analyses which are based on different statistical characteristics ● comparison with different translations of the same text and map terms in different languages ● texts which will be stored and analyzed in the corpus should be chosen only by linguists to build proper dependencies and lead to proper statistics to prevent side effects in statistics ● linguists can choose proper calculation methods of text preprocessing and analysis and these methods should be customizable [19, 20] 5. Conclusions The article highlights the significance of preserving the Ukrainian Redaction of Church Slavonic language (URCS) through digitalization and the creation of text corpora. URCS is an endangered language that is still used in liturgical services in certain regions of Ukraine and neighboring countries. The article provides a brief history of URCS and explores its connection with the Ukrainian language. It also reviews Ukrainian corpora and corpus tools, and analyzes URCS lemmas using corpus-based techniques. To preserve the URCS language, the article suggests various preservation and conservation approaches, such as creating a URCS corpora platform or a separate corpus tool like LancsBox. The platform would facilitate linguistic research on vocabulary, grammar (syntax), specific terminology, and historical and cultural aspects of the language. Additionally, the development of a URCS query platform that includes parallel and comparable corpora of texts is necessary to meet the specific needs of researchers. The article identifies several issues that need to be addressed in the future, including the morphological analysis and classification of the URCS text collections, opportunities for lemmatization and text elaboration, as well as creating various search queries for users. Overall, the article stresses the significance of preserving the Ukrainian Redaction of Church Slavonic language (URCS) as a cultural and linguistic heritage of Ukraine for future generations. URCS is not only an important liturgical language alongside modern Ukrainian, but also a valuable source of linguistic, historical, and cultural knowledge. Therefore, the article proposes digitalization and corpus-based research as a means to conserve and promote the language, and to ensure its transmission to future generations. 6. References [1] V. Nimchuk, Yakoyu movoyu molylasya davnya Ukraina, Video, 2012. URL: https://www.youtube.com/watch?v=V_Lhfv5J4WA&t=334s&ab_channel=brownianbox [2] “Izbornyk. Istoriia Ukrainy IX-XVIII st. Pershodzherela ta interpretatsii”, (2003). URL: litopys.org.ua [3] V. Nimchuk, Ukrains’ka mova – svyashchenna mova, Liudyna i svit (1992), 11–12, 28–32. [4] V. Nimchuk, Literaturni movy Kyivs’koyi Rusi, Istoriya Ukrains’koyi kul’tury, 1 (2003). URL: http://litopys.org.ua/index.html [5] V. Nimchuk, Leksyka davn'orus'koi movy, Istoriia ukrains'koi movy: Leksyka i frazeolohiia, (1983), 29—163. [6] V. Nimchuk Davn'orus'ka spadschyna v leksytsi ukrains'koi movy, Kyiv, (1992). [7] M. Moser. Cerkovnoslov’jans’ka mova ukrajins’koji redakciji v dzerkali mizhnarodnoji slavistyki. Balcania et Slavia, 2, 2022, 133-142. [8] M. Skab, Mova Tserkvy v Ukraini kintsia XX – pochatku XXI st. yak chynnyk formuvannia natsional'noi svidomosti, Bohoslovs'kyj visnyk (2013), 8, 8-16. [9] N. Puriaeva, Ukrayns'ka mova v liturhijnij praktytsi ukrayns'kykh tserkov, Problemy humanitarnykh Nauk, Seriya «Filolohiya», (2018), 42, 128-146. [10] Univ Lavra of Holy Ascension (UGCC), (2007). URL: http://studyty.org.ua/index.php?option=com_files&Itemid=52; [11] Shevchuk S., Church Slavonic is the Official Liturgical Language of the UGCC (in Ukrainian), 2020. URL: https://synod.ugcc.ua/data/glava-ugkts-tserkovnoslovyanska-mova-ofitsiynoyu- liturgiynoyu-movoyu-ugkts-4276/ [12] The Greek Catholic Eparchy of Mukachevo. (in Ukrainian), updated 2023. URL: https://mgce.uz.ua/ [13] H. Kuzems'ka, Yakoiu movoiu molylasia davnia Ukraina: Pravyla ukrains'koi transliteratsii tserkovnoslov'ians'kykh tekstiv, Kyiv, KZhD ”Sofiia”, 2012. [14] Meletii Smotrytskyi. Hramatyka / Pidhotovka faksymilnoho vydannia ta doslidzhennia pamiatky V. V. Nimchuka. — K.: Naukova dumka, (1979), 111. 492 . (Faksymile). [15] N. Dartchuk, O. Siruk, M. Langenbach, Ya. Khodakivska, and V. Sorokin, Corpus of the Ukrainian Language (Ukrainian) (2023). URL: http://www.mova.info/corpus.aspx?l1=209 [16] General Regionally Annotated Corpus of Ukrainian (GRAC) (Ukrainian, English), (2023). URL: http://uacorpus.org/Kyiv/ua [17] SketchEnengine platform, (2023). URL: https://www.sketchengine.eu/ [18] A. Lutskiv, N. Popovych, Big data-based approach to automated linguistic analysis effectiveness, Proceedings of the 2020 IEEE 3rd International Conference on Data Stream Mining and Processing, DSMP, Lviv, (2020), 438-443. [19] A. Lutskiv, N. Popovych, Big data approach to developing adaptable corpus tools CEUR Workshop Proceedings, Lviv, (2020) 374-395. [20] A. Lutskiv, N. Popovych, Adaptable Text Corpus Development for Specific Linguistic Research, Proceedings of IEEE International Scientific and Practical Conference Problems of Infocommunications. Science and Technology, Kyiv, (2019), 217-223. [21] O. Levy, Dependency-Based Word Embeddings, 2014, URL: https://www.aclweb.org/anthology/P14-2050 [22] Stanford CoreNLP 3.9.2 (updated 2018-11-29). URL: https://corenlp.run/ [23] R. M. Reese, A. S. Bhatia, Natural Language Processing with Java, 2 nd ed., Birmingham: Packt Publishing, (2018), 318. [24] Academic Dictionary of the Ukrainian Language, 2018. URL: http://sum.in.ua/ [25] Oxford Dictionaries API, 2023. URL: https://developer.oxforddictionaries.com/ [26] Glosbe API, 2023. URL: https://glosbe.com/a-api [27] A. Lutskiv, R. Lutsyshyn, Corpus-Based Translation Automation of Adaptable Corpus Translation Module, CEUR Workshop Proceedings, Lviv, (2021), 2870, 511–527. [28] O.V. Mitsa, H.V. Shumytska, V.V. Sharkan, N.F. Venzhynovych &H.I. Dulishkovych, Interactive map of dialects as the professional training tool for philology students, Information Technologies and Learning Tools, vol. 88, no. 2, , (2022), 126–138. doi: https://doi.org/10.33407/itlt.v88i2.4787 [29] M. Lupei, M. Shlahta, O.Mitsa, Y. Horoshko, H. Tsybko & V. Gorbachuk, Development of an Interactive Map Within the Implementation of Actual State and Public Directions, in 2022 12th International Conference on Advanced Computer Information Technologies (ACIT), IEEE, (2022), 384-387. [30] Create a new corpus from files, 2023. URL: https://www.sketchengine.eu/guide/create-corpus- from-files/#toggle-id-1 [31] A. Solans’kyi, Promin dushi, Uzhhorod, (2017), 830. [32] The UCREL semantic analysis system, (2023). URL: https://www.researchgate.net/publication/228881331_The_UCREL_semantic_analysis_system [33] V. Brezina, P.Weill-Tessier, & A.McEnery, #LancsBox v. 5.x. [software], 2020. URL: http://corpora.lancs.ac.uk/lancsbox. [34] L.Bilen’ka-Svystovych, N. Rybak, Tserkovnoslovianska mova. Pidruchnyk zi slovnykom, 2012. [35] H. P. Klimchuk, Cerkovnoslov’jans’ki zapozychennja v publicystyci Mikhaila Grushevs’kogo, Filolohichni studii, 2009, 3, 53-64.