Technological Features of Cross-Language Migration from PHP to Python of Software Products Working with Intensive Data Barakhnin Vladimir1, Kozhemyakina Olga2, Revun Artem3, Shashok Natalia4 Federal Research Center for Information and Computational Technologies, Novosibirsk, Russia 1 bar@ict.nsc.ru 2 olgakozhemyakina@mail.ru 3 pm@artemrevun.ru 4 n.shashok@g.nsu.ru Abstract. This article describes the methods of cross-language migration from the PHP programming language to the Python programming language on the ex- ample of modernization of the search module of the system for complex analysis of poetic texts. The choice of language was determined by the need of integration of the module in the system, and by the fact that Python has a large number of open libraries that make it possible to work with machine learning technologies, what is an advantage in the intensive data processing. The main problems of cross-language migration are considered and the methods to solve these problems are proposed. The proposed solutions concern the data security, the separation of module representation from the data model, the work with database via declara- tive mapping. In the process of cross-language migration, the search module was rewritten and all problems were solved, what allowed to integrate it into the in- formation system for complex analysis of poetic texts. Keywords: cross-language migration, PHP, Python, intensive data, text analy- sis information system, natural language processing, analysis of poetic texts. 1 Introduction In the practice of the development of the information systems, there are a number of problems associated with the need of the modernization of the source code of an appli- cation that is already in active usage. This can be caused, for example, by the migration of an information system into a new programming language. It is necessary to correctly reproduce the work of all parts of the old code so that there is no noticeable difference in the code execution time for the end user. This article describes the features of migration of a software product that works with intensive data from the PHP programming language to Python. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 411 The Python programming language, together with the Flask framework, is used to develop the applications in various scientific fields without any restrictions, for exam- ple, the creation of the framework for Covid-19 diagnostics [1] or the improvement of the user interface of the LHCb Software "control panel for nightly builds" [2]. In par- ticular, the choice of language and framework for the improvement of the interface of LHCb Software was not chosen by chance: the developers needed to provide the system support and to have clear, readable code, as well as to use the tools which are already used for developments earlier [3]. Similarly, given that the other modules in the system of complex analysis of poetic texts are written in Python, the cross-language migration with the usage of the Flask framework can be carried out to modernize the search mod- ule in the system. Let's describe the technology for solving this problem on the example of one of the modules of the information system for the automated complex analysis of poetic texts. The data on poetic texts which are extracted in the process of their multilevel analy- sis (phonetic, metro-rhythmic, strophic, lexical, etc. characteristics) in accordance with the approach described in [4], refer to intensive data, since they are extracted directly from the source texts, and are not obtained through the usage of hypothetical patterns expected from the data, and are also planned to be placed in the public domain. In ad- dition, according to [5], our system refers to the systems which are working with inten- sive data due to the constant expansion of the volume of accumulated data - both in connection with the expansion of the range of analyzed poetic texts, and in connection with the increase in the set of extracted characteristics. In accordance with the modern realities of the software development, there was a need to modernize this module, which works with the corpuses of poetic texts of large vol- ume. To achieve this goal, the following tasks were set: ─ to investigate the correctness of the considered software; ─ to detect the problems that exist in the initial code; ─ to substantiate the choice between updating and rewriting of the code; ─ to substantiate the choice of the final programming language; ─ to define the architecture of the application; ─ to develop a new version of the software product. 2 Description of the Problem The purpose of the development of the information system for the automated complex analysis of poetic texts [6], implemented at the FRC ICT, is to provide the philologists with a tool for the analysis of large-volume text corpuses. The described information system (IS) is a complex of separate modules, each of which is an independent software product that performs a specific task and, at the same time, is integrated into the system. One of the modules of the system, which is responsible for processing of big data in the database, was written using the PHP programming language, while the Bootstrap framework was used to organize the user interface. The code provided the solution to 412 all given tasks, namely, the interaction with the database and the generation of the user interface. PostgreSQL was chosen as the database management system. This relational DBMS, although not traditional when it is used in conjunction with PHP, has the ability to analyze the main linguistic objects (token/word/phrase/sentence/text), what is in de- mand when processing text information [7]. However, the further development of the system is carried out in the more universal programming language Python 3, what leads to the problem of the integration of the modules written in PHP with the new modules. During the analysis of the initial code, the following problems were found: ─ the complexity of the code, which causes a lot of internal dependencies, what leads to low scalability; ─ the inseparability of the visual representation of data and the data model; ─ critically low level of the security of IS. Thus, despite the fact that the modules written in PHP are functional, they need to be improved: if the system is functioning, but it is difficult to maintain it or expand it, then, according to general theoretical guidelines [8], it cannot be considered correct. 2.1 Approaches to Improving the Initial Code There are two approaches to improving the initial code: the refactoring and the com- plete rewriting. The refactoring is the improvement of the initial code without changing the result of its execution [9]. The periodic refactoring can make it easier to understand how the program works, and in some cases it helps to find the various errors, although the error detection is not its main purpose. Thus, the refactoring leads to the optimization of the previously written program, improves the quality of code readability, and allows to add the new features. However, the refactoring is a very time-consuming process and can take a long time. A complete rewriting of the code has the same effect as refactoring, but it is usually used either when the adding of new functionality within the old concept is unnecessarily time-consuming (for example, due to incompatible architectures), or when there is a need of the migration to another programming language. In addition, a complete rewrit- ing carries the risk of adding the new errors that may not be obvious. Also, regardless of the chosen approach, it is necessary to test the resulting product during the development process, what allows to find the errors in the operation of the modernized code and to solve them quickly, without having to return to the problem code after a while. To choose between two different approaches, it is necessary to analyze the initial code of the application for problems and their severity. After that, it is possible to make a conclusion about the need for refactoring or rewriting of the code. In addition, if there is a need to choose between different programming languages, the differences between the application deployment in probable programming lan- guages should be determined. 413 2.2 Analysis of the Inseparability of the Data Representation and Data Model from Each Other in the Initial Code Let's analyze the emerging problems in the initial code. The user interface of the module is written in HTML using the Bootstrap framework, which includes CSS and the JavaScript programming language, while some of the search fields contain dynamic parameters that the interface requests from the database during the initialization. Let's consider the part of the code under discussion that is responsible for the dy- namic construction of the "Author" field. The program code presented below makes a request to the database, and, returning the values, generates a field of the select type using them, which is then displayed as a drop-down list of authors.

Выберите автора из списка или воспользуйтесь поиском
This example shows that the work with the data is not separated from the formation of the interface. In particular, if there is a need to call the same query in another part of the system, for example, to further access to this data, it should be copied. Thus, the procedural architecture shown in the example leads to code redundancy, and, as a result, to the possible errors. 414 2.3 Analysis of the Security Level of the Search Module Another problem concerns the analysis of the level of data protection. The analysis of the protection level of the search module for poetic texts consists of the following actions: ─ the analysis of the initial code of the application; ─ the analysis of compiled SQL queries for the vulnerabilities such as SQL injection; ─ the implementation of an SQL injection attack. The SQL injection is an attack for which an attacker inserts the deleterious code into the text of a database request, what makes it possible to execute an unauthorized re- quest. In this way, a hacker can get stored data or cause damage to the database. The main reasons for the possibility of SQL injection are the incorrect processing of special characters and data types, as well as the dynamic query construction [10]. There are many ways to implement such attacks, as well as to protect against them. One of the main methods of leveling and preventing the potential threats is to check the correctness of the data which is entered from the client to the server. This method is implemented through the special characters escaping, the strict data type conversion, and the prepared or parameterized queries. However, during the analysis of the application source code, it was found that there is no verification of the correctness of the entered data, requests are made dynamically, and the data is transmitted to the DBMS in the form in which they were sent by the user, without any processing of special characters or restrictions on data types. To confirm the presence of the vulnerability, the attack was made by injecting "1 OR 1=1", entered in the filter field by metrorhythmic statistics, regardless of the se- lected metric. If the application works correctly, such a string must lead to no search results for all metrics, but instead, the entire corpus of poetic texts available in the da- tabase is displayed. Accordingly, the SQL injection attack is successful, and the user may receive incorrect data. 2.4 Differences between the Creation of Applications in PHP and Python Today, the Python programming language is more popular than the PHP programming language. This is indicated by the TIOBE index, where Python ranks second and PHP ranks ninth [11]. The TIOBE index is calculated based on key queries in the most pop- ular search engines: Google, Bing, Yahoo!, Wikipedia, Amazon, YouTube and Baidu. It is also worth noting that the TIOBE index does not indicate the best of the program- ming languages, but only reflects the activity of the global community that searches for information on each programming language. Both languages offer a large number of frameworks for development, what allows to achieve fast, secure and well-scalable applications. However, Python has a large number of open libraries that let to work with machine learning technologies in a web environment, what is an advantage when working with big data. 415 Python is a language with strong implicit dynamic typing, and PHP is a language with weak implicit dynamic typing. [12] Therefore, in the tasks where PHP allows to mix the different data types in expressions, it's necessary to look for other solutions for Python. Since Python is focused not only on Web development, for the interaction of a Py- thon program with a server, it is required to use the WSGI interaction standard [13], unlike PHP, where such technologies are not necessary. This technology is an interme- diate link between the web server and the application, what allows to process the infor- mation outside the application before sending it to the web server, what, in turn, can significantly improve the efficiency of the client-server application. PHP was not created as a programming language and was originally a scripting lan- guage embedded in html-pages, so the separation of data from the interface in PHP is a non-trivial task. In turn, Python is a general-purpose programming language, so the separation of data and interface is achieved by simple methods. 3 Modernization of the Initial Code of the Discussed Software Product Taking into account the above, the decision about the rewriting of the module in Python 3 was made. The rewriting of the module separates the data model from the module representa- tion, and also allows to solve the scalability problem and to use the data model in other tasks. The module retains its isolation, what improves the possibility of its usage in the further development of the system. As the basis of the system, the Flask framework [14] was chosen, which, using WSGI, provides simple interaction between the server and the application, has a built- in html template engine for creating the user interface, allows to evaluate the efficiency of the application using internal tools, is suitable for building an application that has various modules, and is also the project with open initial code. The strong side of the Python language is the number of libraries and tools available for it for both data processing and web programming, what distinguishes it from other languages, such as PHP. The strong side of the Flask web framework can be called its easiness, since it requires a small amount of code to get both a web application and a simple web-API from the functions and modules which are written in Python. [15] 3.1 Work of the Module with the Database The database that the described module works with is described in detail in the article [6] and does not need any changes. The task when the module is being rewritten is to combine the working database with the Python SQLAlchemy module, which uses the ORM approach. Let's consider the ER diagram of the poems database [16]. 416 Fig. 1. ER-diagram of the poems database. By initializing all the entities which are presented in the diagram as data models in the SQLAlchemy ORM, the universal tool for working with the database is get. This will allow to use the data from the database not only for search, but also for other purposes that may be required for the implementation of certain functions of the system. 3.2 Declarative Mapping of a Procedural Database in the Concepts of the SQLAlchemy Object-oriented Paradigm used in the Flask Software Package Since the tables in the database already exist, the task is to declare each table as a class in accordance with the SQLAlchemy syntax [17]. A base class is created, from which all classes created on the basis of the tables shown in the ER-diagram are inherited (see Figure 1). Next, the attributes are defined for each class according to the attributes of the corresponding table. The complexity of this declarative mapping is that for each author represented in the database, a set of tables is created according to its unique identifier, where the identifier serves as a postfix. In this case, the tables with the same structure store the data from different authors. The description of each such table as a class separately from each other is not an effective solution, because there is a code redundancy. It was decided to create a function with a numeric parameter that returns a class object depending on the specified parameter – the unique identifier of the author. This solution allows to avoid a redundancy, since there is no need to add a description of a new class to the database when a new table is being created, and it is possible to create the queries for each author dynamically without the need to modify the initial code of the program. 417 After declaring the classes, the tables can be used to display the information in the user interface, as well as to search the tables by the specified parameters. 3.3 User Interface Development Since the Flask software environment uses the Jinja2 template engine as a tool, the interface templates should be prepared. All dynamic elements in the template use the rules adopted in Jinja2, and the static part is designed to match all the rules for creating the web documents. During the template design process, at the analysis of the previous version of the interface, it was found that parts of the interface can be divided into separate templates. The separate documents were created, representing various parts that are common to most pages: the set of meta tags, two navigation panels (a general and for DB queries), and the error panel. If necessary, these templates are embedded in other html docu- ments. In particular, the set of meta tags and the general navigation panel are embedded in any html-documents, while the navigation panel for database queries is embedded in the template for searching the database, the template for viewing saved user queries, and the template for viewing all the works available in the database. In addition, the pages for user registration and authorization were rewritten. 3.4 Development of the Initial Code of Application The Flask software environment uses the concept of "blueprints" to create the various components of the application. The blueprints are used to expand the application, and they can also be applied to divide the code into logically related sets of operations. The operations in the described module can be divided into two sets: the functions for working with the data of the corpus in the database and the functions for working with the user's personal data. Based on this, the blueprints 'main' and 'auth' are written, respectively. The blueprint “main” is a set of the following operations: ─ poems_list – the output of all the poetic texts available in the database; ─ index – the output of the main page of the interface; ─ search – the output of the form used to make a query in the database for data sam- pling from the corpus; ─ search_result – the output of query results to the database; ─ save_search – the saving a specific DB query for the user; ─ saved_search – the output of all saved queries for the user. The blueprint "auth" combines the following operations: ─ signup_post – the creation of a user, ─ login_post – the authorization, ─ signup – the view of the user creation page, ─ login – the view of the authorization page, ─ signout – the exit application. 418 The save_search and saved_search operations are only available for authorized us- ers, just like in the old version of the application. The difference is that when the code was rewritten, the Flask Login library was used, which automates the process of creat- ing of a user session, and in the PHP version of the application, the session was created manually. At the attempt to open the pages corresponding to these operations without authorization, the user sees an error message and a request to log in. The search_result operation has been split into the separate functions, since the pro- cessing of a user-filled form and the building of a query are different operations. The function of the building of a query to the database is also divided into component parts, depending on which part of a query is being built, since the determination of the col- umns of the table from which the selection is made and the determination of the data filtering parameters can also be understood as two different actions. 3.5 Results The end result of cross-language migration of the application is a code which is com- pletely separate from the presentation of html pages, and which dynamically builds the queries to the database depending on the parameters set by the user. The application structure has been rebuilt (Figure 2). 4 Conclusion The technological features of cross-language migration between PHP and Python were identified on the base on the solved tasks. Compared to the PHP language, Python does not allow does not allow to embed the code in html pages, so if there are similar pages in the project, it becomes necessary to isolate the logic inherent in the embedded code. The setting up of the correct and secure database queries is done in simpler ways than in PHP, where it may be non-trivial. The Python scripts and the related frameworks may not be supported by default on the using web server, and if there is no opportunity to use another web server, it is necessary to make additional settings. There also may be the cases where the different data types are mixed in the initial code of a PHP application and the implicit but valid conversions occur. In such situa- tions, when an analog of the application in Python is being written, it is necessary to have a clear idea of what data is used in a particular situation, and to apply a secure data type conversion. This imposes the certain limitations related to error processing: it is necessary to convert all the data sent by the user. Acknowledgement. The study was carried with the support of the Russian Science Foundation (project No. 19-18-00466). 419 Fig. 2. The process of the work of the search module in the flow diagram after migration. References 1. Palimote, J., Ebenezer, O., Atu, L.: A Framework for Diagnosis of Covid-19 Infection using Deep Learning Approach. International Journal of Computer Applications in Technology, 10(2), 93–98 (2021). 2. Clemencic, M., Couturier, B., Kyriazi, S.: Improvements to the User Interface for LHCb's Software continuous integration system. Journal of Physics: Conference Series, 664 (2015). 420 3. Clemencic, M., Couturier, B.: A New Nightly Build System for LHCb. Journal of Physics: Conference Series, 513 (2014). 4. Steve K., et al: Data-intensive Science: A New Paradigm for Biodiversity Studies. BioSci- ence, 59(7), 613–620 (2009). 5. Michael F., Barbara R., Florian A.: On Testing Data-Intensive Software Systems. In: Secu- rity and Quality in Cyber-Physical Systems Engineering, pp. 129–148. Springer Interna- tional Publishing, Cham (2019). 6. Barakhnin, V., Kozhemyakina, O., Borzilova, Yu.: The Development of the Information System of the Representation of the Complex Analysis Results for the Poetic Texts. Vestnik NSU, Series: Information Technologies, 17(1), 5–17 (2019). 7. Barakhnin, V., Kozhemyakina, O., Mukhamediev, R., Borzilova, Yu., Yakunin, K.: The de- sign of the structure of the software system for processing text document corpus. Business Informatics, 13(4), 60–72 (2019). 8. Demeyer, S., Ducasse, S., Nierstrasz, O.: Object-Oriented Reengineering Patterns. 1st edn. Square Bracket Associates, Switzerland (2009). 9. Fowler, M., et al.: Refactoring: Improving the Design of Existing Code. 1st edn. Addison- Wesley, United States (1999). 10. Sokolin, D., Timokhovich, A.: Methods for Comprehensive Security of SQL Server Against SQL Injection Attacks. Academy, 3(19), 7–9 (2017). (in Russian) 11. TIOBE Index for May 2021, https://www.tiobe.com/tiobe-index/, last accessed 07 May 2021. 12. Purer, K.: PHP vs. Python vs. Ruby – The web scripting language shootout. In: Vienna Uni- versity of Technology, Institute of Computer Languages, Compilers and Languages Group, 185.307 Seminar aus Programmiersprachen (2009). 13. Smorodsky, I.: Using the WSGI Standard When Developing Web Application in Python. STUDNET, 3(5), 348–357 (2020). 14. Flask Documentation (1.1.x), https://flask.palletsprojects.com/en/1.1.x/index.html/, last ac- cessed 10 May 2021. 15. Aslam, F., Mohammed, H., Lokhande, P.: Efficient Way Of Web Development Using Py- thon And Flask. International Journal of Advanced Research in Computer Science, 6(2), 54– 57 (2015). 16. Barakhnin, V., Kozhemyakina, O., Borzilova, Yu.: The SQL Queries Optimization on the Example for the Search Module of the System for Complex Analysis of Literary Texts. Cloud of Science, 7(4), 749–763 (2020). 17. SQLAlchemy 1.4 Documentation, https://docs.sqlalchemy.org/en/14/, last accessed 10 May 2021. 421