Towards Russian Text Generation Problem Using OpenAI’s GPT-2 Oleksii Shatalov, Nataliya Ryabova National University of Radio Electronics, Nauky av., 14, Kharkiv, 61000, Ukraine Abstract This work is devoted to Natural Language Generation (NLG) problem. The modern approaches in this area based on deep neural networks are considered. The most famous and promising deep neural network architectures that are related to this problem are considered, in particular, the most popular free software solutions for NLG based on Transformers architecture with pre-trained deep neural network models GPT-2 and BERT. The main problem is that the main part of already existing solutions is devoted to the English language. But there are few models that are able to generate text in Russian. Moreover, the text they generate often belongs to a general topic and not about a specific subject area. The object of the study is the generation of a contextually coherent narrow-profile text in Russian. Within the framework of the study, a model was trained for generating coherent articles of a given subject area in Russian, as well as a software application for interacting with it. Keywords 1 Natural Language Generation, Natural Language Processing, Transformers Architecture, Deep Learning, Transfer Learning, GPT-2 1. Introduction The current rate of growth of content is so great that organizations are beginning to fail to keep up with their own set of speeds. Editors and copywriters do not have time to create new texts from scratch, think over ideas for new publications so that they are original. Hiring a large staff of additional staff can significantly increase the costs of the company, which will lead to lower profits. The second option for solving the problem is to reduce or maintain the speed of content formation, which will also give negative results in the future, since the company will be a loser in comparison with competitors. One of the options for solving the problem is the use of the latest artificial intelligence (AI), machine learning and deep learning technologies for such a task as well as others related to Natural Language Processing (NLP) problems. Deep neural networks and their training has become a real breakthrough in solving basic AI problems, including NLP [1, 2, 3, 4]. This area of AI is rapidly developing, there are separate areas within deep learning, such as generative deep learning, reinforcement learning, within which new modern models of deep neural networks are being developed that can solve traditionally complex AI problems faster and, most importantly, more efficiently [4]. The impressive results of deep neural networks are certainly achieved thanks to modern information technologies, such as large-scale machine learning libraries TensorFlow, PyTorch with API for Python language [5, 6, 7, 8, 9]. The main component of many neural language understanding and generating models is pretrained word representation, proposed in [9, 10]. Word embeddings are the basis of deep learning for NLP. Word embeddings (word2vec, GLoVe) are often pretrained on the text corpus from co-occurrence statistics. But learning highquality representations in many cases is challenging task. Word representations are applied in a context free manner. So, the solution of this problem is train COLINS-2021: 5th International Conference on Computational Linguistics and Intelligent Systems, April 22–23, 2021, Kharkiv, Ukraine EMAIL: oleksii.shatalov@nure.ua (O. Shatalov); nataliya.ryabova@nure.ua (N. Ryabova) ORCID: 0000-0002-7267-6718 (O. Shatalov); 0000-0002-3608-6163 (N. Ryabova) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) contextual representations on text corpus. In the paper [12] authors introduced a new type of deep contextualized word representation, they used vectors from bidirectional LSTM (Long Short-Term Memory recurrent networks). This language model authors called ELMo (Embeddings from Language Models) representations. So we can see that deep neural networks models are the most up- to-date and constantly evolving approach for solving many problems of NLP and NLG. The rest of the paper is organized in the following way. The state of research and recent advances in deep learning for natural language generation are reviewed in Section 2. In Section 3 general description of the GPT model is given and test runs of the original model are described. Section 4 is devoted to searching Russian language resources with a large database of articles on technological topics, development of software script and formation of dataset. Section 5 contains detailed description of the experiments, taking into account all technical details of the implementation. Experiments include two stages: model learning and model training. The most interesting parts of experiments include train the model to generate whole texts and to generate article titles. In Section 6 experimental results are analyzed. In Section 7 the integration of the models with web application considered. The main characteristics of the proposed web application are described. Conclusions and perspectives for future work are discussed in Section 8. 2. Related Works The neural text generation problem is analized in many works, for example [4, 13]. Authors describe the classic and recently proposed neural text generation models. The development of Recurrent Neural Networks Language Models (RNNLMs) discussed in detail with three training paradigms: supervised learning, reinforcement learning, adversarial training. In 2017, a new simple neural network architecture was proposed, called transformer, based solely on the attention mechanism, without recurrence, i.e. sequential calculations [14]. Transfer learning technology allows you to retrain ready-made models. For Today, this technology is the most promising in deep learning and is used in the most advanced neural models for the generation of natural language texts [15]. The next step was to demonstrate that language models begin to learn NLP tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText [16, 17]. Authors are researchers from OpenAI and they demonstrate how work their largest model GPT-2 (Generative Pre-Trained Transformer). This is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Recently some pretrained high-capacity neural language models have become increasingly important in natural language processing and generation. There are such deep neural networks as ELMo [12], BERT [18, 19, 20], GPT-2,3 [15, 21, 22]. They are able to predict the next word in a sequence or some masked word anywhere in a given sequence. BERT (Bidirectional Encoder from Transformers) is neural network from Google, which demonstrated the best results on a number of NLP tasks (machine translation, text analysis, chat bots, answering questions, etc.). Google has released pre-trained models of BERT, but they suffer from a lack of documentation. In fact, BERT from Google is an improved GPT network from OpenAI (bidirectional instead of unidirectional), also on the transformer architecture. BERT is the best in almost all popular NLP benchmarks. Unlike BERT, another popular generative pretrained transformer GPT-2 is created for generating samples of synthetic text with a completely logical narrative, if you give it any beginning [15]. So GPT-2 is available for testing and experimenting with it. Therefore, many researchers and practitioners in AI, NLP and deep learning are trying to solve their problems of generating texts using GPT-2. For example, copywriters and editors will be more focused on editing texts, or writing them on ready- made topics that the model will provide. For such tasks, the Transformers architecture is used, which is able to perceive the context by processing the token chain at once. It should also be noted that many trained models will be required in their subject areas, since at the moment there are no models that could equally successfully generate coherent text in several non-related branches of human activity at once. The multilingualism of the model requires a similar remark. Thus, it was decided to train a model of a non-profit company OpenAI called GPT-2, namely, a medium-sized model with 300 million parameters, generating texts in Russian about information technologies, blockchain and artificial intelligence. To train the model, the Transfer learning will be used, which allows additional training of ready-made models [21]. This technology will be easy to apply to the chosen GPT-2 [22]. For such training with so many parameters, a fairly extensive dataset in Russian is required. The preparation of the dataset will also be described in the article. 3. GPT model overview The GPT model is designed to predict the next word in the text, forming, when repeating the operation, a complete coherent text with meaning, context, logic of presentation and completeness of thought. It was originally designed to answer user questions. According to the developer, a non-profit company OpenAI, the product they offer can be used in the future to help in speech recognition, articles and publications editing, keeping control of the storytelling quality. The creation of GPT (General Purpose Technology, and later Generative Pretrained Transformer) models began with the announcement and release of GPT-1 in 2018. At that time, of course, it was a breakthrough in the field of text generation and the use of Transfer learning technology, which was new at that time. But, nevertheless, due to the fact that it was trained on a small amount of data, its work left much to be desired. With the announcement of the first generation of GPT, the foundation was laid for continued research in this area. GPT-2, trained on more than 40 GB of data from 8 million web pages, impressed its own developers so much that the company initially released only a beta version, citing the malicious use of their brainchild to generate fake news, spam, and more. The release of GPT-2 took place in 2019, immediately after the release, the work on GPT-3 has begun. The 3rd generation GPT made a closed API for the same reason – the possibility of using it to harm. We can also assume that the new model has a sufficiently large amount of data that does not allow it to be disseminated in the way we are used to. Among other things, a fairly large amount of money was spent on the creation of GPT-3, on the order of several million dollars. And this is one of the key factors that does not allow teaching it as effectively even with the knowledge of the mathematical apparatus of the structure of the model. 3.1. Test runs of the original model When starting and initializing the model, there were several questions to be answered: What weights to use to work with the model What is the maximum length of the generated text What is the concept of "temperature" and how it affects the generation Consumed resources To work with the model, the weights of the PyTorch library are used, as well as a special configuration file and an encoder model for storing tokens. The maximum length of the generated text can be any, however the model context window is 1024 tokens. "Temperature" is a parameter that is adjusted during the generation of text by the model, it shows the degree of "madness" of the text, that is, how far the model can deviate from the examples set during training. Average consumed resources were calculated in the middle of Google Colaboratory after 100 launches of each of the original models. The results can be seen in Table 1. Table 1 Consumed resources Number of parameters Occupied disk space, GB RAM, GB GPU memory, GB GPT-2 124M 0.5 2.48 2.37 GPT-2 355M 1.42 3.46 4.14 GPT-2 774M 3.1 6.09 8.28 GPT-2 1558M 6.23 9.98 10.69 When the generation is started, an initial phrase is sent to the input, through which the context of the generated text is set. The minimum size is 1 token, the maximum is 1023 tokens. Based on observations, the larger the volume of the input phrase, the longer it takes to generate the text, it should also be noted that the increase in time is not linear. There were also several launches of all models with different input phrases of the "information technology" subject area. In Figure 1 below, you can see an example of the generated text for the input phrase "The future of machine learning". Figure 1: An example of the generated text by the original model GPT-2 124М The pre-trained model works reasonably well in English, generating grammatically correct texts while maintaining context. The 1.5 billion parameter model is expected to have more coherent text than the 124 million parameter model, and is about the same as the 774 million parameter model. But to run a larger model, more resources are needed, and they work longer. An interesting feature is that the network itself was able to generate, albeit non-existent, but valid links. Sometimes it can get stuck – repeating the same phrase. In Russian, a network of any size works very poorly – this is due to the fact that they were taught mainly in English. You can verify this by looking at Figure 2, the text " Все, что я могу сказать об этом действии, это то, что" was fed to the model input. There are several analogs – models, pre-trained on texts in Russian and capable of generating coherent texts of general topics. But it often loops, it is not able to generate an adequate text on a specialized topic because there were no corresponding texts in the training dataset. Text generation is carried out in a style which is closed to the works of fiction of classical literature. It is also worth noting that in some cases the model recognizes the text as lyrics and begins to supplement the text with white verse. There is a noticeable improvement in the use of words in context, and also there are no missing words that do not exist in the explanatory dictionary. At the same time, there is a noticeable improvement in the use of punctuation marks in comparison with the previous experiment, words and the generation of meaningful text, as well as specific characters (for example, "?" And "!"). For further research, it was decided to take the so-called "average" model of 355 million parameters. Figure 2: An example of generating text in Russian by the original model 4. Formation of a dataset As we understood from the text above, for the model for correct work it’s required to train it on a sufficiently large amount of text. Thus, the primary task before training the model was the search for Russian-language resources with a large database of articles on technological topics. After the research, a small list of them was formed with an approximate number of articles on the portal. The list can be found in Tables 2, 3 and 4. Table 2 List of portals for receiving thematic texts "Technologies" Source name Approximate number of publications Populyarnaya Mekhanika 44000 Hightech 10000 TJ 5900 Rusbase 3000 Techliga 1000 Table 3 List of portals for receiving thematic texts "Machine learning" Source name Approximate number of publications Habr 8700 3D News 800 ITC 500 Robotics 300 Korrespondent 200 IZ 200 VC 200 Table 4 List of portals for receiving thematic texts "Cryptocurrencies" Source name Approximate number of publications ForkLog 22000 ProBlockchain 21000 RBK 20000 bits.media 1600 VC 1200 Habr 900 Due to the number of articles and the approximate amount of text in them, it was decided to form a dataset based on articles from the "Populyarnaya Mekhanika" and "ForkLog" portals. For further work, a software script was written that unloaded a monthly archive of portals and saved it as HTML pages. For a higher speed of the program, this problem was solved by parallel programming. It allowed to increase the speed of page retrieval by 8 times. After the pages have been swapped out, they should be processed. We tried several options for processing the dataset. We got the heading under the