-

V. A. Vysotska);

Ukrainian Big Data: The Problem Of Databases Localization

Victoria Vysotska

victoria.a.vysotska@lpnu.ua 0

Ihor Shubin

igor.shubin@nure.ua 1

Maksym Mezentsev

maksym.mezentsev@nure.ua

Karen Kobernyk

karen.kobernyk@nure.ua

Grygoryy Chetverikov

0 Kharkiv National University of Radioelectronics , Nauky ave. 14, Kharkiv, 61166 , Ukraine 1 Lviv Polytechnic National University , Stepan Bandera Street, 12, Lviv, 79013 , Ukraine

2024

000 0 0002

• • • • • • • • • ⁄( ( + ) 2 × 0,01) = (1)

To demonstrate the capabilities of the prototype, a dataset with data in the Russian language was selected. This choice is not accidental, as it allows you to test the system in conditions that are as close as possible to the real needs of the Ukrainian market. Translation from Russian into Ukrainian requires not only high accuracy and preservation of context, but also a deep understanding of cultural and linguistic nuances, which poses complex tasks to the system, the solution of which requires the application of the latest achievements in the field of machine learning and natural language processing in different translation services used in research.

The dataset that was uploaded into database contains around 140 000 rows with dependencies and 3 tables for short texts. Localized example can be seen in Table 1. And about 70 000 rows for long paragraphs in single table. Localized example can be seen in Table 2.

That kind of dataset was used to test English translation accuracy and can be found in open source by the name ”RuBQ” now we will use it to test localization from russian to Ukrainian.

The “Uid” column is used to represent unique id that is assigned to a specific column and can be used to connect relative collumns from different tables. In our case we made a relation between “Question” table and “Answer” table.

The “Question” column is representing text field that contains questions for a different topics. Such type of data has to be translated correctly so it will be understood after translation – that factor can be an indicator for logical mistakes made by translation services.

The “Answer” column is basically the same as “Question” by type. It contains answers for questions stored in previous column. It also has to be properly translated because if it is not – the whole row in database becomes invalid and unusable. Visualization of such database relations with are shown on Figure 2.

The “Paragraph” column is representing some historical data that is stored under specific id. The size of the paragraph can vary and they can be much bigger, up to 1000 symbols.

Visualized database is MySQL instance that is used during tests and contains three databases. Databases has many-to-many relations using separate table to store Ids in between the connection.

4. Experiment

The series of experiments were conducted and aimed at understanding the best database management system for accommodating Ukrainian language-specific data.

The experiment will involve acquiring a dataset containing JSON data representative of language content needing translation. This dataset was imported into MySQL database, to measure time needed to parse large quantities of data in combination with time spent on translation and data transfer to translation services.

Key metrics for evaluation will include percentage of translation mistakes, single row translation time, average time needed to perform full translation and data-storing cycle. Through comprehensive experimentation and analysis, we aim to identify the architecture that best addresses the bigdata localization challenges in Ukrainian big data[ 10 ].

By understanding the strengths and limitations of different database solutions, database localization strategies can be addressed and contributed to the development of more efficient and effective data management practices tailored to Ukrainian language requirements. Steps that were performed during experiment: 1. JSON data from open source 2. Loaded data into MySQL cluster 3. Filtered text from the data 4. Sent the chosen text to translator 5. Got data back from translator and saved it into new instance Visualized version of the steps did during the experiment is shown on the figure 3.

While choosing a database for the experiment, we should look at various parameters to find the one we need. Figure 4 shows the comparison of the most popular SQL databases. The choice of MySQL determined for several reasons: • MySQL is the most popular database that is widely used it the software engineering world. • MySQL is one of the best choices for performance and stability. It can handle large amounts of data of any kind. • provides quick access to the data • Supports different types of queries, flexible data control • Has convenient data management

Overall, it is the best choice for our goals, since we have to load a large amount of data and mySQL allows us to have easy control over the data and conduct any manipulations we need.

5. Results 5.1. Google translate

Evaluated results represent 2 different types of data tested and 6 different services used in MySQL database. Results were split by subchapters for every service used in experiment with 2 tables containing short data and long data.

5.2. Meta translator 5.3. Reverso 5.4. Onlinetranslator.eu

5.5. DeepL

5.6. Translate.ua 6. Discussion

Based on achieved results, each translator has it’s own benefits and choice of used service depends on the goal we want to achieve. If the crucial part is API call speed, the suitable service should have good servers with low response time and high symbol restriction as for “Google Translate” or “Meta”. On the other hand, the best choice for translation accuracy is local translator, since they are more adapted to the language environment and can show low mistakes percentage even with small amount of context for translated texts.

As a matter of fact the decision of Database and it’s workload has minimal impact on average row process time as we used locally hosted solution, but if database location, server load and internet connection is applied – time can be increaced significantly.

7. Conclusion

Upon discussing the Big Data field in Ukraine and the difficulties associated with localizing databases, several important topics come to light.

First off, there are special reasoning localizing databases in Ukraine. By respecting national interests and guaranteeing that sensitive information stays under national control, localizing data can improve data sovereignty, security, and regulatory compliance. However, putting into practice successful database localization techniques - require a strong technological foundation, technological investment, and adherence to global data standards.

Secondly, there is special connection between the localization of databases in Ukraine and more general problems with data privacy and data availability.

Thirdly, Adherence to localization specifications could result in extra expenses and regulatory complications, which could affect corporate operations and innovation. Furthermore, data localization strategies may obstruct international collaboration and cross-border data flows, which would reduce prospects for global integration and economic growth.

In summary, the resolution of database localization issues in Ukraine necessitates a sophisticated strategy that strikes a compromise between national and international concerns. Ukraine can successfully traverse the challenges of data localization and realize the full potential of its big data ecosystem for the good of society at large by adhering to the values of awailability, accountability, and inclusion.

8. References

[1]

Kopp ,

Orlovskyi ,

Orekhov , An Approach and Software Prototype for Translation of Natural Language Business Rules into Database Structure , CEUR Workshop Proceedings 2870 ( 2021 ) 1274 - 1291 .

[2]

Garcarz , Legal Language Translation: Theory behind the Practice , CEUR Workshop Proceedings , Vol- 3171 ( 2022 ) 2 - 2 .

[3]

Shubin ,

Kozyriev ,

Liashik , G. Chetverykov, Methods of adaptive knowledge testing based on the theory of logical networks , in: Proceedings of the 5th International Conference on Computational Linguistics and Intelligent Systems, COLINS 2021 , Lviv, Ukraine, 2021 , pp. 1184 - 1193 .

[4] Bur , M. , & Stirewalt , K. ( 2022 ). ORM ontologies with executable derivation rules to support semantic search in large-scale data applications , Proceedings - ACM/IEEE 25th International Conference on Model Driven Engineering Languages and Systems, MODELS 2022: Companion Proceedings , p. 81 .

[5]

Kopp ,

Orlovskyi ,

Orekhov , An Approach and Software Prototype for Translation of Natural Language Business Rules into Database Structure , CEUR Workshop Proceedings 2870 ( 2021 ) 1274 - 1291 .

[6] Kulik , A. , Chukhray , A. , & Havrylenko , O. ( 2022 ). Information Technology for Creating Intelligent Computer Programs for Training in Algorithmic Tasks . Part 1 :

Mathematical

Foundations . System Research and Information Technologies , 2022 (4), 27 - 41 . doi: 10 .20535/SRIT.2308- 8893 . 2021 . 4 . 02

[7]

Falatiuk ,

Shirokopetleva ,

Dudar , Investigation of architecture and technology stack for e-archive system , in: 2019 IEEE International Scientific-Practical Conference: Problems of Infocommunications Science and Technology, PIC S and T 2019 - Proceedings , p. 229 .

[8]

Herud ,

Baumeister , Testing Product Configuration Knowledge Bases Declaratively , in: LWDA 2022 - Workshops: Special Interest Group on Knowledge Management (FGWM), Knowledge Discovery, Data Mining, and Machine Learning (FGKD) and Special Interest Group on Database Systems (FGDB) , CEUR Workshop Proceedings , vol. 3341 , pp. 173 - 186 .

[9]

Igor

Shubin , Andrii Kozyriev, Method for Solving Quantifier Linear Equations for Formation of Optimal Queries to Databases in: Computational Linguistics and Intelligent Systems 2023 , Proceedings of the 7th International Conference on Computational Linguistics and Intelligent Systems . vol. 449 - 459

[10] Wu

Aiyan

, Zhang Yongmei, Yang Shang. ( 2022 ). A Method for Scientific Cultivation Analysis Based on Knowledge Graphs , in: 12th International Conference on Electronics, Communications and Networks , CECNet 2022 .