Usage of Machine-based Translation Methods for Analyzing Open Data in Legal Cases Nataliya Boyko[0000-1111-2222-3333], Lesia Mochurad[0000-0002-4957-1512], Uliana Parpan[0000-0003-1424-050X], Oleh Basystiuk[0000-0003-0064-6584] Lviv Polytechnic National University, Lviv, Ukraine nataliya.i.boyko@lpnu.ua, lesia.i.mochurad@lpnu.ua, uparpan35@gmail.com, obasystiuk@gmail.com Abstract. Deep learning has completely changed approaches to machine translation. The initial ways of building machine translation software were based on rules, the next stage was based on statistics and probability theory. But nowadays, with new researches in the deep learning field has created simple solutions based on machine learning that outperform the best expert systems. This paper overviews the main features of machine translation for analyzing open data in legal cases based on recurrent neural networks. The advantages of systems based on RNN using the sequence-to-sequence model against statistical translation systems are also highlighted in the article. Two machine translation systems based on the sequence-to-sequence model were constructed using Keras and PyTorch machine learning libraries. Based on the obtained results, libraries' analysis was done, and their performance comparison. Keywords: machine translation, deep learning, recurrent neural networks, performance, keras, pytorch, sequence-to-sequence. 1 Introduction Systems of machine translation of unstructured data from one language to another are modeling work of a human translator. Their productivity depends on their ability to comprehend the language grammar rules. In the translation, the main units are not single words, but phrases or phraseological units expressing various concepts. Only by using them, more complex ideas can be expressed via the translated text [20]. The main feature of machine translation is the different length for input and output. To be able to work with different input and output length, you need to use a recurrent neural network [1-6]. Initially, the work of computer programs for translation is to replace words or phrases from one language with words or phrases from another. However, then there is a problem that such a replacement cannot provide a quality translation of the text because it requires the definition and recognition of words and whole phrases from the original language. Currently, multilingual ontological resources such as WordNet and UWN are used to handle collisions in translation. Machine translation is one of the subgroups of computational linguistics that studies different languages text translation approaches based on software solutions. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CybHyg-2019: International Workshop on Cyber Hygiene, Kyiv, Ukraine, November 30, 2019. Machine translation basically performs the replacement of one language words to another language words, but usually, the translation made in this way is relevant, because in order to fully convey the meaning of the sentence and find the most suitable analog in the "target" language - it is often necessary to translate the whole phrase in general. Solving this problem with statistical and neural translation systems is a rapidly growing field that leads to improved translation, upgrade differences in linguistic typology, better handling differences in linguistic typology, the translation of idioms, and the identification of anomalies [5,8]. Modern machine translation software has the function of changing the settings for the domain - industry or professional activity, for example, meteorological reports. By limiting the scope of permissible substitutions/substitutions, we are able to obtain a better translation result [10]. This method is especially effective in areas where the formal or template-style language is used. This means that machine translation is more efficient in government and legal documents, rather than translation any less standardized texts [7, 11]. Improving the quality of the final result can also be achieved through human intervention: for example, some systems will be able to provide a more accurate translation if the user will indicate in advance the correct translation of some words in the text. There are two fundamentally different approaches to the construction of machine translation algorithms: rule-based and statistical-based. The first approach is traditional and is used by most machine translation system developers. Rule-based MT (RBMT), “Classic Approach” (MT) is a machine translation system based on linguistic information from unilingual, bilingual, or multilingual dictionaries and grammar rules, source language and target language [13,15]. The system covers the basic semantic, morphological, and syntactic patterns of each language. Accordingly, in order to make a translation, the system must make a preliminary morphological, syntactic, and semantic analysis of the text, and only after that it generates a sentence. The biggest disadvantage of RB-translation is that in order for a program to perform a correct translation, its database must contain all spelling variations of word entry, and for all cases of ambiguity, lexical selection rules must be written. In itself, adaptation to new domains is not such a complicated process, because the basics of grammar for all domains are the same, and the settings of the areas of user activity are limited only by the correction of lexical selection. Thus, such a machine translation system is the classical method of its implementation, it allows to obtain a better result than the statistical method, but synthesizes translation more slowly [1,17]. Statistical machine translation is a type of text-based machine translation that is more effective in working with bigger volumes of language pairs. Language pairs - text data that contain sentences in one language and the corresponding sentences in another. Thus, statistical machine translation has a feature of self-learning. The more language pairs available to the program and the more accurately they correspond to each other, the better the result of statistical machine translation [2,19]. The term "statistical machine translation" refers to a general approach to solve the problem of translation, which is based on finding the most probable translation of a sentence using data obtained from a bilingual set of texts. An example of a bilingual set of texts is parliamentary reports, which are minutes of debates in parliament. Bilingual parliamentary reports are issued in Canada, Hong Kong, and other countries; official documents of the European Economic Community are issued in 11 languages, and the United Nations publishes documents in several languages. As a result, these materials are highly useful resources for statistical machine translation. This system is based on the statistical calculation of the probability of coincidences. To translate, the program must have access to hundreds of millions of documents that have been translated by humans in advance. Such documents serve as templates for the system, on the basis of which it translates. The more documents, the higher the probability of better translation[18, 20] . At the beginning of its existence, in 2006, Google Translate was based on the statistical method of machine translation, and its translation was of very low quality and was considered one of the worst translation options that can be done by an online translator. Today, Google uses the "neural" method of machine translation (MT) and is in serious competition with commercial enterprises, whose products are not free. Neural networkapproach is based on the method of deep learning.Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader group of machine learning methods based on the interpretation of learning outcomes, as opposed to algorithms for specific tasks. Training can be supervised or unsupervised.In recent years, Hybrid machine translation (HMT) has become increasingly popular, and the main technology of implementing HMT become RNN.Recurrent neural network (RNN) - is a class of artificial neural network, which has connections between nodes. In this case, the connection refers to the connection from the more distant node to the less distant node. The presence of connections allows RNN to memorize and reproduce the entire sequence of reactions to one stimulus. From the programming point of view in such networks there is an analog of the cyclic execution, and from the systems point of view - such networks are equivalent to a finite-state machine. RNNs, are generally used to handle the sequence of words in the processing of natural language [14-17]. Usually for word sequence processing using the Hidden Markov Model (HMM) and the N-program language model. Hidden Markov Model (НММ) - the statistical model that simulates the work of a process similar to a Markov process with unknown parameters and the task is to guess unknown parameters on the basis of the observed ones. The obtained parameters can be used in further analysis. n a normal Markov model, the state is known to the observer, so the probability of transitions is one parameter. In NMM it is possible to observe only variables that are affected by this state. Each state has a probabilistic distribution among all possible output values. Therefore, the sequence of words generated by NMM gives information about the sequence of states. The NMM can be considered as the easiest Bayesian network. Bayesian network - the graphical model in the form of a directed acyclic graph, each vertex of which corresponds to a random variable, and the arcs of the graph encode the relations of conditional independence between these variables. The vertices can represent variables of any type, be weighted parameters, hidden variables, or hypotheses. There are effective methods that are used to calculate and study Bayesian networks. For conducting a probabilistic output in Bayesian networks, both precise and approximate algorithms are used [18-20]. 2 Materials and Methods At a high-level representation of a recurrent neural network (RNN), shown on figure 1, it’s processes data sequences, such as sentences, one element at a time while retaining a memory (called a state) of what has come previously in the sequence. Recurrent means the output at the current time step becomes the input to the next time step. At each element of the sequence, the model considers not only the current input, but what it remembers about the preceding elements. The most popular cell approach nowadays is the LSTM (Long Short-Term Memory) which maintains a cell state as well as a carry for ensuring that the signal (information in the form of a gradient) is not lost as the sequence is processed. At each time step the LSTM considers the current word, the carry, and the cell state. Fig. 1. Recurrent network loop. The basic idea of an RNN is to use recursion to form the fixed dimension vector from the input sequence of symbols. Assume that in step t vector is ℎ𝑡−1 which is the history of all previous words. RNN will calculate new vector ℎ𝑡 (its internal state), which combines all previous words(𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝒕−𝟏 )and new character 𝑥𝑡 using: ℎ𝑡 = 𝜑𝜃 (𝑥𝑡 , ℎ𝑡−1 ) (1) In this equation, the following parameters are present: 𝝋𝜽 - function, parameterized with θ, which receive a new word input 𝒙𝒕 and words history 𝒉𝒕−𝟏 till (t - 1) - N word. First, we can assume that 𝒉𝟎 - zero vector. The recurrent activation function φ is usually implemented as an affine transformation, followed by non-linear function: ℎ𝑡 = 𝑡𝑎𝑛ℎ(𝑊𝑥𝑡 + 𝑈ℎ𝑡−1 + 𝑏) (2) In this equation, the following parameters are present: input weight matrix W, recurrence weight matrix U and bias vector b. Note, that this is not the only one variant. There is wide scope for developing new recurring activation functions [19]. More detailed about the work of the method for text translation based on neural networks. The idea of this algorithm is, in fact, simple and consists of the following steps: 1. Encoding the input data of language A into the data set; 2. Decoding the data set in language B. Let's look at an algorithm for encoding unstructured data on an example text sentence: “Example of neural network”: Fig. 2. Visualization of input unstructured data encoding After performing such a simple operation, we obtain the encoded unstructured data, for example text, that looks like a numerical data set. At the initial stage of training, these numbers are random and generated by the algorithm also accidentally. Next passing of the text that has already encoded, RNN will be evaluated to the same numerical data set. The algorithm of decoding of the unstructured data works like encoding, only in the reverse - the input receives a numerical data set and outputs the probable text that corresponds to this data [7-12]. Once we understand the essence of encoding and decoding of the unstructured data, let's move to the very essence of our task - machine translation and its general algorithm. To do this, we just have to combine these two RNNs - for encoding and decoding - and get the following result: Thus, we obtain the general way of transforming the sequence of Ukrainian words into an equivalent sequence of English words, this is the so-called, sequential method of language translation Sequence-to-Sequence. The main pros of the method are [13- 16]:  this approach is limited on the training data set amount and the computing power that you can allocate to the translation. Researchers of machine learning have invented this method only a few years ago, but such systems are already working better than the machine translation statistical systems, which was developing through last 20 years;  the system does not depend on knowledge of any rules of the language. The algorithm itself defines these rules and constantly adapted. Lower level headings remain unnumbered; they are formatted as run-in headings. Fig. 3. Sequence-to-Sequence model. 3 Solution Analyze Let’s conduct more information about our dataset and how we will collect that data. First and the most obvious way to collect data is to use open-source datasets, but this way of mining data is not so suitable, in case data will be noisy and will require a lot of economic resources to get from this data high accuracy results in any unique case.Another case is to create own dataset, this is a better way to create personalized solutions for any type of data. The main way to evaluate how noisy is current dataset is to calculate entropy [17]. 1 1 H(x) = E[log ] ≤ logE[ ] = logN (3) p(X) p(X) As you can see, the training data set consists of 10 phrases, that are widely used in open data sources related to legal cases, we will use that data to train and test our models, based on RNN approach, build on different ML libraries. After that will evaluate the speed and accuracy of the models. Table 1. Dataset for training the RNN English Ukrainian An example of a neural network . Приклад нейронної мережі . Statement by the Chairman of the UN Заява голови Ради Безпеки ООН. Security Council. Accepted the application. Прийняв заяву. The court made an order. Суд ухвалив рішення. Veto the law. Ветувати закон. Support the resolution. Підтримати резолюцію. Appeal the decision. Оскаржити рішення. Сall for the fulfillment of obligations. Закликати до виконання зобов'язань. Emphasizes the need for strict compliance Підкреслює необхідність суворого with regulations. дотримання постанов. Support all initiatives. Підтримувати всі ініціативи. Let’s conduct experiments based on two machine learning libraries written in Python - PyTorch and Keras. The basis of the algorithm is the method of sequential learning. Table 2. Comparison of Keras and PyTorch libraries results Library title Learning time Training loops Loss coefficient Translation accuracy Keras 4150 millis 400 0.0027 100% PyTorch 5800 millis 650 0.0021 100% Let's look at these data in more detail:  Learning time. The value that shows the model's training time. Mainly depend on the environment where the script was run. Environment mean the current PC specifications; processor computing power and it upload by other processes.  Training loops. The value that shows training cycles of the model. We give it ourselves.  Loss coefficient. The value that shows the accuracy of the trained model. It is a measure of how good your model is.  Translation accuracy. The value that shows in percentage term value of correct translation sentences. So, the model build on the Keras library was more effective than the PyTorch model, the comparison based on the training time, training loops and error rate. Because of the small training data set, both algorithms show the maximum translation accuracy. In the case of increasing of training data set amount, models will provide completely different loss coefficient and accuracy of the translation, training time and loss coefficient will increase and accuracy will decrease. 4 Results The article explains the main stages of the development of machine translation technologies, describes the main architectural solutions used in machine translation nowadays. The advantages and disadvantages of several approaches, such as rule- based, statistical, and neural network-based are described. Considering all the factors, the most relevant way of organization and software approach for creating methods for analyzing open data in legal cases. Moreover, overviewed design and software approach of the two systems for numbering unstructured data based on different ML frameworks was chosen. For example, this solution will be suitable for translating sentences from one language to another. In the case of an RNN-based language translation approach, the most popular ML libraries are Keras and PyTorch. In order to perform the English-Ukrainian language case study, we have used English- Ukrainian and Ukrainian-English as base language pairs, as it was shown in table 3. Based on that, we have the final result, of three different approaches comparison to the current set. Table 3. Used English- Ukrainian and Ukrainian-English as base language pairs English-Ukraine MT Adequacy Fluency systems Rules-based 55.6% 47% Statistical 77.2% 87% RNN 98% 96% 5 Conclusion RNN, like other classes of neural networks, are developing so fast that it's increasingly difficult to track new, more interesting, and more sophisticated models for solving more complex and complicated tasks. These sequential methods of teaching neural networks can be used in other areas, not only in machine translation. Simple examples are models that could make verbal descriptions of the image, recognize the voice and maintain the conversation. In our opinion, the development of RNN will lead to the emergence of smart assistants that can recognize the owner's voice and correctly perceive the task. At the moment RNNs are the most frequently used in machine translation and we think this field will be also upgraded in the nearest future. According to the results of the experiment, the model based on Keras library is more efficient for the current training data set. Note, that the research results may be considered relevant only for small data sets and there will be changes in translation quality and training time after increasing the training data set amount. Next phase of this research may consist of model training in large data volumes with analyzing and comparing the quality and speed of its work. References 1. Gahegan, M.: On the application of inductive machine learning tools to geographical analysis, Geographical Analysis, vol. 32, pp. 113–139 (2000) 2. Zhang, С., Murayama, Y.: Testing local spatial autocorrelation using, Intern. J. of Geogr. Inform. Science, vol. 14, pp. 681–692 (2000) 3. Estivill-Castro, V., Lee, I.: Amoeba: Hierarchical clustering based on spatial proximity using Delaunay diagram, 9th Intern. Symp. on spatial data handling, pp. 26–41, Beijing, China (2000) 4. Kryvenchuk Y., Boyko N., Helzynskyy I., Helzhynska T., Danel R.: Synthesis control sys- tem physiological state of a soldier on the battlefield. CEUR. Vol. 2488. Lviv, Ukraine, p. 297–306. (2019) 5. Kang, H.-Y., Lim, B.-J., Li, K.-J.: P2P Spatial query processing by Delaunay triangulation, Lecture notes in computer science, vol. 3428, pp. 136–150, Springer/Heidelberg (2005) 6. Boehm, C., Kailing, K., Kriegel, H., Kroeger, P.: Density connected clustering with local subspace preferences, IEEE Computer Society, Proc. of the 4th IEEE Intern. conf. on data mining, pp. 27–34, Los Alamitos (2004) 7. Wang, Y., Wu, X.: Heterogeneous spatial data mining based on grid, Lecture notes in computer science, vol. 4683, pp. 503–510, B.: Springer/Heidelberg (2007) 8. Harel, D., Koren, Y.: Clustering spatial data using random walks, Proc. of the 7th ACM SIGKDD Intern. conf. on knowledge discovery and data mining, pp. 281–286, San Francisco, California (2000) 9. Turton, I., Openshaw, S., Brunsdon, C.: Testing spacetime and more complex hyperspace geographical analysis tools, Innovations in GIS 7, pp. 87–100, London: Taylor & Francis (2000) 10. Boyko N., Pylypiv O., Peleshchak Y., Kryvenchuk Y., Campos J.: Automated document analysis for quick personal health record creation. 2nd International Workshop on Informatics and Data-Driven Medicine. IDDM 2019. Lviv. p. 208-221. (2019) 11. Kryvenchuk Y., Mykalov P., Novytskyi Y., Zakharchuk M., Malynovskyy Y., Řepka M.: Analysis of the architecture of distributed systems for the reduction of loading high-load networks. Advances in Intelligent Systems and Computing. Vol.1080. p.759-550. (2020) 12. Tung, A.K, Hou, J., Han, J.: Spatial clustering in the presence of obstacles, The 17th Intern. conf. on data engineering (ICDE’01), pp. 359–367, Heidelberg (2001) 13. Veres, O., Shakhovska N.: Elements of the formal model big date, The 11th Intern. conf. Perspective Technologies and Methods in MEMS Design, pp. 81-83, Polyana (2015) 14. Agrawal, R., Gehrke, J., Gunopulos ,D., Raghavan, P.: Automatic sub-space clustering of high dimensional data, Data mining knowledge discovery, vol. 11(1), pp. 5–33 (2005) 15. Guimei, L., Jinyan, L., Sim, K., Limsoon, W.: Distance based subspace clustering with flexible dimension partitioning, Proc. of the IEEE 23rd Intern. conf. on digital object identifier, vol. 15. Iss. 20, pp. 1250–1254 (2007) 16. Aggarwal, C., Yu, P.: Finding generalized projected clusters in high dimensional spaces, ACM SIGMOD Intern. conf. on management of data, pp. 70–81 (2000) 17. Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A Monte Carlo algorithm for fast projective clustering, ACM SIGMOD Intern. conf. on management of data, pp. 418– 427, Madison, Wisconsin, USA (2002) 18. Kryvenchuk Y., Vovk O., Chushak-Holoborodko A., Khavalko V., Danel R.: Research of servers and protocols as means of accumulation, processing and operational transmission of measured information. Advances in Intelligent Systems and Computing. Vol.1080. p.920-934. (2020) 19. Ankerst, M., Ester, M., Kriegel, H.-P.: Towards an effective cooperation of the user and the computer for classification, Proc. of the 6th ACM SIGKDD Intern. conf. on knowledge discovery and data mining, pp. 179–188, Boston, Massachusetts, USA (2000) 20. Peuquet, D.J.: Representations of space and time, N. Y.: Guilford Press (2002) 21. Guo, D., Peuquet, D.J., Gahegan, M.: ICEAGE: Interactive clustering and exploration of large and high-dimensional geodata, Geoinfor-matica, vol. 3, N. 7, pp. 229–253 (2003) 22. Boyko, N., Shakhovska, Kh., Mochurad, L., Campos, J.: Information System of Catering Selection by Using Clustering Analysis, Proceedings of the 1st International Workshop on Digital Content & Smart Multimedia (DCSMart 2019), рр. 94-106, Lviv, Ukraine (2019) 23. Boyko, N., Komarnytska, H.,Kryvenchuk ,Yu., Malynovskyy, Yu.: Clustering Algorithms for Economic and Psychological Analysis of Human Behavior, Proceedings of the International Workshop on Conflict Management in Global Information Networks (CMiGIN 2019), рр. 614-626, Lviv, Ukraine (2019) 24. Fedushko S., Syerov Yu., Tesak O., Onyshchuk O., Melnykova N. (2020) Advisory and Accounting Tool for Safe and Economically Optimal Choice of Online Self-Education ServicesProceedings of the International Workshop on Conflict Management in Global Information Networks (CMiGIN 2019), Lviv, Ukraine, November 29, 2019. CEUR- WS.org, Vol-2588. pp. 290-300. http://ceur-ws.org/Vol-2588/paper24.pdf 25. Yavorska T., Prihunov O., Syerov Yu. Efficiency of Using Social Networks in the Period of Library Activity in Remote Mode. CEUR Workshop Proceedings. Vol 2616: Proceedings of the 2nd International Workshop on Control, Optimisation and Analytical Processing of Social Networks (COAPSN-2020), Lviv, Ukraine, May 21, 2020. p. 214- 226. http://ceur-ws.org/Vol-2616/paper18.pdf 26. Boyko, N., Basystiuk, O.: Comparison Of Machine Learning Libraries Performance Used For Machine Translation Based On Recurrent Neural Networks, 2018 IEEE Ukraine Student, Young Professional and Women in Engineering Congress (UKRSYW), pp.78-82, Kyiv, Ukraine (2018).