=Paper=
{{Paper
|id=Vol-3688/paper9
|storemode=property
|title=Ukrainian Big Data: The Problem of Databases Localization
|pdfUrl=https://ceur-ws.org/Vol-3688/paper9.pdf
|volume=Vol-3688
|authors=Victoria Vysotska,Ihor Shubin,Maksym Mezentsev,Karen Kobernyk,Grygoryy Chetverikov
|dblpUrl=https://dblp.org/rec/conf/colins/VysotskaSMKC24
}}
==Ukrainian Big Data: The Problem of Databases Localization==
<pdf width="1500px">https://ceur-ws.org/Vol-3688/paper9.pdf</pdf>
<pre>
                         Ukrainian Big Data: The Problem Of Databases
                         Localization
                         Victoria Vysotska1, Ihor Shubin2, Maksym Mezentsev3, Karen Kobernyk4 and Grygoryy
                         Chetverikov5
                         1 Kharkiv National University of Radioelectronics, Nauky ave. 14, Kharkiv, 61166, Ukraine
                         2 Lviv Polytechnic National University, Stepan Bandera Street, 12, Lviv, 79013, Ukraine


                         COLINS-2024: 8th International Conference on Computational Linguistics and Intelligent Systems, April 12–13, 2024,
                         Lviv, Ukraine
                            victoria.a.vysotska@lpnu.ua (V. A. Vysotska); igor.shubin@nure.ua (I. U. Shubin); maksym.mezentsev@nure.ua
                         (M. A. Mezentsev); karen.kobernyk@nure.ua (K.S. Kobernyk); grigorij.chetverykov@nure.ua (G. G. Chetverykov)
                           0000-0001-6417-3689 (V. A. Vysotska); 0000-0002-1073-023X (I. U. Shubin); 0009-0004-0662-2078 (M. A.
                         Mezentsev); 0009-0009-9385-4454(K.S. Kobernyk); 0000-0001-5293-5842 (G. G. Chetverykov)
                                    © 2024 Copyright for this paper by its authors.
                                    Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
•
•
•


•
•
•
•

•
•
                                     (𝑏 + 𝑐)
                                 𝑑 ⁄(        × 0,01) = 𝑎                                     (1)
                                        2


   To demonstrate the capabilities of the prototype, a dataset with data in the Russian language
was selected. This choice is not accidental, as it allows you to test the system in conditions that
are as close as possible to the real needs of the Ukrainian market. Translation from Russian into
Ukrainian requires not only high accuracy and preservation of context, but also a deep
understanding of cultural and linguistic nuances, which poses complex tasks to the system, the
solution of which requires the application of the latest achievements in the field of machine
learning and natural language processing in different translation services used in research.
Figure 1: Chart of different translation services Max. characters per 1 call

   The dataset that was uploaded into database contains around 140 000 rows with
dependencies and 3 tables for short texts. Localized example can be seen in Table 1. And about
70 000 rows for long paragraphs in single table. Localized example can be seen in Table 2.
   That kind of dataset was used to test English translation accuracy and can be found in open
source by the name ”RuBQ” now we will use it to test localization from russian to Ukrainian.

Table 1
Data entry example for short texts
 Uid    Question                                                    Answer
 2      "Хто автор п'єси «Ромео і Джульєта» ?"                      "Шекспір"
 3      " Як називається столиця Румунії?"                          "Бухарест"

   The “Uid” column is used to represent unique id that is assigned to a specific column and can
be used to connect relative collumns from different tables. In our case we made a relation
between “Question” table and “Answer” table.
   The “Question” column is representing text field that contains questions for a different topics.
Such type of data has to be translated correctly so it will be understood after translation – that
factor can be an indicator for logical mistakes made by translation services.
   The “Answer” column is basically the same as “Question” by type. It contains answers for
questions stored in previous column. It also has to be properly translated because if it is not – the
whole row in database becomes invalid and unusable. Visualization of such database relations
with are shown on Figure 2.
Table 2
Data entry example for long texts
 Uid    Paragraph
        Впродовж британського правління аж до отримання незалежності від Малайзії в
 3098 1965 році Сінгапур і Республіка Китай мали дипломатичні відношення, які
        продовжилися і після проголошення незалежності.

  The “Paragraph” column is representing some historical data that is stored under specific id.
The size of the paragraph can vary and they can be much bigger, up to 1000 symbols.


Figure 2: First dataset database visualization

   Visualized database is MySQL instance that is used during tests and contains three databases.
Databases has many-to-many relations using separate table to store Ids in between the
connection.

4. Experiment
The series of experiments were conducted and aimed at understanding the best database
management system for accommodating Ukrainian language-specific data.
   The experiment will involve acquiring a dataset containing JSON data representative of
language content needing translation. This dataset was imported into MySQL database, to
measure time needed to parse large quantities of data in combination with time spent on
translation and data transfer to translation services.
   Key metrics for evaluation will include percentage of translation mistakes, single row
translation time, average time needed to perform full translation and data-storing cycle. Through
comprehensive experimentation and analysis, we aim to identify the architecture that best
addresses the bigdata localization challenges in Ukrainian big data[10].
   By understanding the strengths and limitations of different database solutions, database
localization strategies can be addressed and contributed to the development of more efficient and
effective data management practices tailored to Ukrainian language requirements.
Steps that were performed during experiment:
    1. JSON data from open source
    2. Loaded data into MySQL cluster
    3. Filtered text from the data
    4. Sent the chosen text to translator
    5. Got data back from translator and saved it into new instance
Visualized version of the steps did during the experiment is shown on the figure 3.
Figure 3: Vizualization of steps during the experiment

   While choosing a database for the experiment, we should look at various parameters to find
the one we need. Figure 4 shows the comparison of the most popular SQL databases. The choice
of MySQL determined for several reasons:
      • MySQL is the most popular database that is widely used it the software engineering
           world.
      • MySQL is one of the best choices for performance and stability. It can handle large
           amounts of data of any kind.
      • provides quick access to the data
      • Supports different types of queries, flexible data control
      • Has convenient data management
   Overall, it is the best choice for our goals, since we have to load a large amount of data and
mySQL allows us to have easy control over the data and conduct any manipulations we need.

Table 3: databases comparison
 Parameter                           MongoDB             Firebase           MySQL
 Ukrainian Language Support          Yes                  Yes               Yes
 Database Type                       NoSQL                NoSQL             Relational
 Performance                         Average             Average            High
 Scalability                          Average          High                   High
 Data Management                      Limited          Convenient             Convenient
 JSON Support                         Yes              Yes                    Yes
 Community Support                    Large            Large                  Large
 Security Features                    Limited          Comprehensive          Comprehensive
 Indexing Capabilities                Limited          Limited                Extensive
 Replication and Clustering           Limited          Limited                Comprehensive
 Data Integrity                       Limited          High                   High
 Stored Procedure Support             No               No                     Yes


5. Results
    Evaluated results represent 2 different types of data tested and 6 different services used in
MySQL database. Results were split by subchapters for every service used in experiment with 2
tables containing short data and long data.

5.1. Google translate
Table 4 shows mistakes percentage, time used for single row translation of short texts, average
process time from parsing data to translating and finally uploading data to database.

Table 4
Results for Google service with short texts
 Average process time Time for translation of single row   API calls number        Mistakes %
 ~0.720 ms               ~0.187 ms                         854                     5.76%


Table 5 shows time used for long texts.

Table 5
Results for Google service with long texts
 Average process time Time for translation of single row   API calls number        Mistakes %
 ~0.419 ms               ~0.183 ms                         18969                   5.61%


5.2. Meta translator
Table 6 shows time used for short texts.

Table 6
Results for Meta translator with short texts
 Average process time Time for translation of single row   API calls number        Mistakes %
 ~2.11s                  ~0.620ms                          4270                    4.12%


Table 7 shows time used for long texts.
Table 7
Results for Meta translator with long texts
 Average process time Time for translation of single row   API calls number   Mistakes %
 ~1.56s                  ~0.578ms                          94 845             3.72%


5.3. Reverso
Table 8 shows time used for short texts.

Table 8
Results for Reverso context service with short texts
 Average process time Time for translation of single row   API calls number   Mistakes %
 ~1.19s                 ~0.691ms                           2135               4.16%


Table 9 shows time used for long texts.

Table 9
Results for Reverso context service with long texts
 Average process time Time for translation of single row   API calls number   Mistakes %
 ~0.998ms               ~0.648ms                           47423              3.97%


5.4. Onlinetranslator.eu
Table 10 shows time used for short texts.

Table 10
Results for onlinetranslator.eu with short texts
 Average process time Time for translation of single row   API calls number   Mistakes %
 ~2.55s                  ~0.791ms                          2135               4.21%


Table 11 shows time used for long texts.

Table 11
Results for onlinetranslator.eu with long texts
 Average process time Time for translation of single row   API calls number   Mistakes %
 ~1.99s                  ~0.723ms                          47423              3.69%


5.5. DeepL
Table 12 shows time used for short texts.

Table 12
Results for DeepL service with short texts
 Average process time Time for translation of single row   API calls number   Mistakes %
 ~8.72s                 ~4.94s                             854                2.21%


Table 13 shows time used for long texts.

Table 13
Results for DeepL service with long texts
 Average process time Time for translation of single row   API calls number   Mistakes %
 ~7.88s                  ~5.22s                            18969              1.87%


5.6. Translate.ua
Table 12 shows time used for short texts.

Table 12
Results for translate.ua with short texts
 Average process time Time for translation of single row   API calls number   Mistakes %
 ~5.92s                  ~2.32s                            4270               0.89%


Table 13 shows time used for long texts.

Table 13
Results for translate.ua with long texts
 Average process time Time for translation of single row   API calls number   Mistakes %
 ~5.34s                  ~2.19s                            94 845             0.72%


6. Discussion
Figure 4: Visualization of Average time / Mistakes chart

    Based on achieved results, each translator has it’s own benefits and choice of used service
depends on the goal we want to achieve. If the crucial part is API call speed, the suitable service
should have good servers with low response time and high symbol restriction as for “Google
Translate” or “Meta”. On the other hand, the best choice for translation accuracy is local
translator, since they are more adapted to the language environment and can show low
mistakes percentage even with small amount of context for translated texts.
    As a matter of fact the decision of Database and it’s workload has minimal impact on
average row process time as we used locally hosted solution, but if database location, server
load and internet connection is applied – time can be increaced significantly.

7. Conclusion
Upon discussing the Big Data field in Ukraine and the difficulties associated with localizing
databases, several important topics come to light.
    First off, there are special reasoning localizing databases in Ukraine. By respecting national
interests and guaranteeing that sensitive information stays under national control, localizing
data can improve data sovereignty, security, and regulatory compliance. However, putting into
practice successful database localization techniques - require a strong technological foundation,
technological investment, and adherence to global data standards.
    Secondly, there is special connection between the localization of databases in Ukraine and
more general problems with data privacy and data availability.
    Thirdly, Adherence to localization specifications could result in extra expenses and
regulatory complications, which could affect corporate operations and innovation. Furthermore,
data localization strategies may obstruct international collaboration and cross-border data
flows, which would reduce prospects for global integration and economic growth.
    In summary, the resolution of database localization issues in Ukraine necessitates a
sophisticated strategy that strikes a compromise between national and international concerns.
Ukraine can successfully traverse the challenges of data localization and realize the full
potential of its big data ecosystem for the good of society at large by adhering to the values of
awailability, accountability, and inclusion.

8. References
[1] A. Kopp, D. Orlovskyi, S. Orekhov, An Approach and Software Prototype for Translation of
     Natural Language Business Rules into Database Structure, CEUR Workshop Proceedings
     2870 (2021) 1274-1291.
[2] M. Garcarz, Legal Language Translation: Theory behind the Practice, CEUR Workshop
     Proceedings, Vol-3171 (2022) 2-2.
[3] I. Shubin, A. Kozyriev, V. Liashik, G. Chetverykov, Methods of adaptive knowledge testing
     based on the theory of logical networks, in: Proceedings of the 5th International Conference
     on Computational Linguistics and Intelligent Systems, COLINS 2021, Lviv, Ukraine, 2021, pp.
     1184– 1193.
[4] Bur, M., & Stirewalt, K. (2022). ORM ontologies with executable derivation rules to support
     semantic search in large-scale data applications, Proceedings - ACM/IEEE 25th International
     Conference on Model Driven Engineering Languages and Systems, MODELS 2022:
     Companion Proceedings, p. 81.
[5] A. Kopp, D. Orlovskyi, S. Orekhov, An Approach and Software Prototype for Translation of
     Natural Language Business Rules into Database Structure, CEUR Workshop Proceedings
     2870 (2021) 1274-1291.
[6] Kulik, A., Chukhray, A., & Havrylenko, O. (2022). Information Technology for Creating
     Intelligent Computer Programs for Training in Algorithmic Tasks. Part 1: Mathematical
     Foundations. System Research and Information Technologies, 2022(4), 27-41.
     doi:10.20535/SRIT.2308-8893.2021.4.02
[7] H. Falatiuk, M. Shirokopetleva, Z. Dudar, Investigation of architecture and technology stack
     for e-archive system, in: 2019 IEEE International Scientific-Practical Conference: Problems
     of Infocommunications Science and Technology, PIC S and T 2019 – Proceedings, p. 229.
[8] K. Herud, J Baumeister, Testing Product Configuration Knowledge Bases Declaratively, in:
     LWDA 2022 - Workshops: Special Interest Group on Knowledge Management (FGWM),
     Knowledge Discovery, Data Mining, and Machine Learning (FGKD) and Special Interest
     Group on Database Systems (FGDB), CEUR Workshop Proceedings, vol. 3341, pp. 173-186.
[9] Igor Shubin, Andrii Kozyriev, Method for Solving Quantifier Linear Equations for Formation
     of Optimal Queries to Databases in: Computational Linguistics and Intelligent Systems 2023,
     Proceedings of the 7th International Conference on Computational Linguistics and Intelligent
     Systems. vol. 449-459
[10] Wu Aiyan, Zhang Yongmei, Yang Shang. (2022). A Method for Scientific Cultivation Analysis
     Based on Knowledge Graphs, in: 12th International Conference on Electronics,
     Communications and Networks, CECNet 2022.

</pre>