=Paper=
{{Paper
|id=Vol-3688/paper9
|storemode=property
|title=Ukrainian Big Data: The Problem of Databases Localization
|pdfUrl=https://ceur-ws.org/Vol-3688/paper9.pdf
|volume=Vol-3688
|authors=Victoria Vysotska,Ihor Shubin,Maksym Mezentsev,Karen Kobernyk,Grygoryy Chetverikov
|dblpUrl=https://dblp.org/rec/conf/colins/VysotskaSMKC24
}}
==Ukrainian Big Data: The Problem of Databases Localization==
Ukrainian Big Data: The Problem Of Databases Localization Victoria Vysotska1, Ihor Shubin2, Maksym Mezentsev3, Karen Kobernyk4 and Grygoryy Chetverikov5 1 Kharkiv National University of Radioelectronics, Nauky ave. 14, Kharkiv, 61166, Ukraine 2 Lviv Polytechnic National University, Stepan Bandera Street, 12, Lviv, 79013, Ukraine COLINS-2024: 8th International Conference on Computational Linguistics and Intelligent Systems, April 12–13, 2024, Lviv, Ukraine victoria.a.vysotska@lpnu.ua (V. A. Vysotska); igor.shubin@nure.ua (I. U. Shubin); maksym.mezentsev@nure.ua (M. A. Mezentsev); karen.kobernyk@nure.ua (K.S. Kobernyk); grigorij.chetverykov@nure.ua (G. G. Chetverykov) 0000-0001-6417-3689 (V. A. Vysotska); 0000-0002-1073-023X (I. U. Shubin); 0009-0004-0662-2078 (M. A. Mezentsev); 0009-0009-9385-4454(K.S. Kobernyk); 0000-0001-5293-5842 (G. G. Chetverykov) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings • • • • • • • • • (𝑏 + 𝑐) 𝑑 ⁄( × 0,01) = 𝑎 (1) 2 To demonstrate the capabilities of the prototype, a dataset with data in the Russian language was selected. This choice is not accidental, as it allows you to test the system in conditions that are as close as possible to the real needs of the Ukrainian market. Translation from Russian into Ukrainian requires not only high accuracy and preservation of context, but also a deep understanding of cultural and linguistic nuances, which poses complex tasks to the system, the solution of which requires the application of the latest achievements in the field of machine learning and natural language processing in different translation services used in research. Figure 1: Chart of different translation services Max. characters per 1 call The dataset that was uploaded into database contains around 140 000 rows with dependencies and 3 tables for short texts. Localized example can be seen in Table 1. And about 70 000 rows for long paragraphs in single table. Localized example can be seen in Table 2. That kind of dataset was used to test English translation accuracy and can be found in open source by the name ”RuBQ” now we will use it to test localization from russian to Ukrainian. Table 1 Data entry example for short texts Uid Question Answer 2 "Хто автор п'єси «Ромео і Джульєта» ?" "Шекспір" 3 " Як називається столиця Румунії?" "Бухарест" The “Uid” column is used to represent unique id that is assigned to a specific column and can be used to connect relative collumns from different tables. In our case we made a relation between “Question” table and “Answer” table. The “Question” column is representing text field that contains questions for a different topics. Such type of data has to be translated correctly so it will be understood after translation – that factor can be an indicator for logical mistakes made by translation services. The “Answer” column is basically the same as “Question” by type. It contains answers for questions stored in previous column. It also has to be properly translated because if it is not – the whole row in database becomes invalid and unusable. Visualization of such database relations with are shown on Figure 2. Table 2 Data entry example for long texts Uid Paragraph Впродовж британського правління аж до отримання незалежності від Малайзії в 3098 1965 році Сінгапур і Республіка Китай мали дипломатичні відношення, які продовжилися і після проголошення незалежності. The “Paragraph” column is representing some historical data that is stored under specific id. The size of the paragraph can vary and they can be much bigger, up to 1000 symbols. Figure 2: First dataset database visualization Visualized database is MySQL instance that is used during tests and contains three databases. Databases has many-to-many relations using separate table to store Ids in between the connection. 4. Experiment The series of experiments were conducted and aimed at understanding the best database management system for accommodating Ukrainian language-specific data. The experiment will involve acquiring a dataset containing JSON data representative of language content needing translation. This dataset was imported into MySQL database, to measure time needed to parse large quantities of data in combination with time spent on translation and data transfer to translation services. Key metrics for evaluation will include percentage of translation mistakes, single row translation time, average time needed to perform full translation and data-storing cycle. Through comprehensive experimentation and analysis, we aim to identify the architecture that best addresses the bigdata localization challenges in Ukrainian big data[10]. By understanding the strengths and limitations of different database solutions, database localization strategies can be addressed and contributed to the development of more efficient and effective data management practices tailored to Ukrainian language requirements. Steps that were performed during experiment: 1. JSON data from open source 2. Loaded data into MySQL cluster 3. Filtered text from the data 4. Sent the chosen text to translator 5. Got data back from translator and saved it into new instance Visualized version of the steps did during the experiment is shown on the figure 3. Figure 3: Vizualization of steps during the experiment While choosing a database for the experiment, we should look at various parameters to find the one we need. Figure 4 shows the comparison of the most popular SQL databases. The choice of MySQL determined for several reasons: • MySQL is the most popular database that is widely used it the software engineering world. • MySQL is one of the best choices for performance and stability. It can handle large amounts of data of any kind. • provides quick access to the data • Supports different types of queries, flexible data control • Has convenient data management Overall, it is the best choice for our goals, since we have to load a large amount of data and mySQL allows us to have easy control over the data and conduct any manipulations we need. Table 3: databases comparison Parameter MongoDB Firebase MySQL Ukrainian Language Support Yes Yes Yes Database Type NoSQL NoSQL Relational Performance Average Average High Scalability Average High High Data Management Limited Convenient Convenient JSON Support Yes Yes Yes Community Support Large Large Large Security Features Limited Comprehensive Comprehensive Indexing Capabilities Limited Limited Extensive Replication and Clustering Limited Limited Comprehensive Data Integrity Limited High High Stored Procedure Support No No Yes 5. Results Evaluated results represent 2 different types of data tested and 6 different services used in MySQL database. Results were split by subchapters for every service used in experiment with 2 tables containing short data and long data. 5.1. Google translate Table 4 shows mistakes percentage, time used for single row translation of short texts, average process time from parsing data to translating and finally uploading data to database. Table 4 Results for Google service with short texts Average process time Time for translation of single row API calls number Mistakes % ~0.720 ms ~0.187 ms 854 5.76% Table 5 shows time used for long texts. Table 5 Results for Google service with long texts Average process time Time for translation of single row API calls number Mistakes % ~0.419 ms ~0.183 ms 18969 5.61% 5.2. Meta translator Table 6 shows time used for short texts. Table 6 Results for Meta translator with short texts Average process time Time for translation of single row API calls number Mistakes % ~2.11s ~0.620ms 4270 4.12% Table 7 shows time used for long texts. Table 7 Results for Meta translator with long texts Average process time Time for translation of single row API calls number Mistakes % ~1.56s ~0.578ms 94 845 3.72% 5.3. Reverso Table 8 shows time used for short texts. Table 8 Results for Reverso context service with short texts Average process time Time for translation of single row API calls number Mistakes % ~1.19s ~0.691ms 2135 4.16% Table 9 shows time used for long texts. Table 9 Results for Reverso context service with long texts Average process time Time for translation of single row API calls number Mistakes % ~0.998ms ~0.648ms 47423 3.97% 5.4. Onlinetranslator.eu Table 10 shows time used for short texts. Table 10 Results for onlinetranslator.eu with short texts Average process time Time for translation of single row API calls number Mistakes % ~2.55s ~0.791ms 2135 4.21% Table 11 shows time used for long texts. Table 11 Results for onlinetranslator.eu with long texts Average process time Time for translation of single row API calls number Mistakes % ~1.99s ~0.723ms 47423 3.69% 5.5. DeepL Table 12 shows time used for short texts. Table 12 Results for DeepL service with short texts Average process time Time for translation of single row API calls number Mistakes % ~8.72s ~4.94s 854 2.21% Table 13 shows time used for long texts. Table 13 Results for DeepL service with long texts Average process time Time for translation of single row API calls number Mistakes % ~7.88s ~5.22s 18969 1.87% 5.6. Translate.ua Table 12 shows time used for short texts. Table 12 Results for translate.ua with short texts Average process time Time for translation of single row API calls number Mistakes % ~5.92s ~2.32s 4270 0.89% Table 13 shows time used for long texts. Table 13 Results for translate.ua with long texts Average process time Time for translation of single row API calls number Mistakes % ~5.34s ~2.19s 94 845 0.72% 6. Discussion Figure 4: Visualization of Average time / Mistakes chart Based on achieved results, each translator has it’s own benefits and choice of used service depends on the goal we want to achieve. If the crucial part is API call speed, the suitable service should have good servers with low response time and high symbol restriction as for “Google Translate” or “Meta”. On the other hand, the best choice for translation accuracy is local translator, since they are more adapted to the language environment and can show low mistakes percentage even with small amount of context for translated texts. As a matter of fact the decision of Database and it’s workload has minimal impact on average row process time as we used locally hosted solution, but if database location, server load and internet connection is applied – time can be increaced significantly. 7. Conclusion Upon discussing the Big Data field in Ukraine and the difficulties associated with localizing databases, several important topics come to light. First off, there are special reasoning localizing databases in Ukraine. By respecting national interests and guaranteeing that sensitive information stays under national control, localizing data can improve data sovereignty, security, and regulatory compliance. However, putting into practice successful database localization techniques - require a strong technological foundation, technological investment, and adherence to global data standards. Secondly, there is special connection between the localization of databases in Ukraine and more general problems with data privacy and data availability. Thirdly, Adherence to localization specifications could result in extra expenses and regulatory complications, which could affect corporate operations and innovation. Furthermore, data localization strategies may obstruct international collaboration and cross-border data flows, which would reduce prospects for global integration and economic growth. In summary, the resolution of database localization issues in Ukraine necessitates a sophisticated strategy that strikes a compromise between national and international concerns. Ukraine can successfully traverse the challenges of data localization and realize the full potential of its big data ecosystem for the good of society at large by adhering to the values of awailability, accountability, and inclusion. 8. References [1] A. Kopp, D. Orlovskyi, S. Orekhov, An Approach and Software Prototype for Translation of Natural Language Business Rules into Database Structure, CEUR Workshop Proceedings 2870 (2021) 1274-1291. [2] M. Garcarz, Legal Language Translation: Theory behind the Practice, CEUR Workshop Proceedings, Vol-3171 (2022) 2-2. [3] I. Shubin, A. Kozyriev, V. Liashik, G. Chetverykov, Methods of adaptive knowledge testing based on the theory of logical networks, in: Proceedings of the 5th International Conference on Computational Linguistics and Intelligent Systems, COLINS 2021, Lviv, Ukraine, 2021, pp. 1184– 1193. [4] Bur, M., & Stirewalt, K. (2022). ORM ontologies with executable derivation rules to support semantic search in large-scale data applications, Proceedings - ACM/IEEE 25th International Conference on Model Driven Engineering Languages and Systems, MODELS 2022: Companion Proceedings, p. 81. [5] A. Kopp, D. Orlovskyi, S. Orekhov, An Approach and Software Prototype for Translation of Natural Language Business Rules into Database Structure, CEUR Workshop Proceedings 2870 (2021) 1274-1291. [6] Kulik, A., Chukhray, A., & Havrylenko, O. (2022). Information Technology for Creating Intelligent Computer Programs for Training in Algorithmic Tasks. Part 1: Mathematical Foundations. System Research and Information Technologies, 2022(4), 27-41. doi:10.20535/SRIT.2308-8893.2021.4.02 [7] H. Falatiuk, M. Shirokopetleva, Z. Dudar, Investigation of architecture and technology stack for e-archive system, in: 2019 IEEE International Scientific-Practical Conference: Problems of Infocommunications Science and Technology, PIC S and T 2019 – Proceedings, p. 229. [8] K. Herud, J Baumeister, Testing Product Configuration Knowledge Bases Declaratively, in: LWDA 2022 - Workshops: Special Interest Group on Knowledge Management (FGWM), Knowledge Discovery, Data Mining, and Machine Learning (FGKD) and Special Interest Group on Database Systems (FGDB), CEUR Workshop Proceedings, vol. 3341, pp. 173-186. [9] Igor Shubin, Andrii Kozyriev, Method for Solving Quantifier Linear Equations for Formation of Optimal Queries to Databases in: Computational Linguistics and Intelligent Systems 2023, Proceedings of the 7th International Conference on Computational Linguistics and Intelligent Systems. vol. 449-459 [10] Wu Aiyan, Zhang Yongmei, Yang Shang. (2022). A Method for Scientific Cultivation Analysis Based on Knowledge Graphs, in: 12th International Conference on Electronics, Communications and Networks, CECNet 2022.