<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ORCID:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>OpenAI's GPT-2 Russian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oleksii Shatalov</string-name>
          <email>oleksii.shatalov@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nataliya Ryabova</string-name>
          <email>nataliya.ryabova@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Deep Learning</institution>
          ,
          <addr-line>Transfer Learning, GPT-2</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National University of Radio Electronics</institution>
          ,
          <addr-line>Nauky av., 14, Kharkiv, 61000</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Natural Language Generation</institution>
          ,
          <addr-line>Natural Language Processing, Transformers Architecture</addr-line>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This work is devoted to Natural Language Generation (NLG) problem. The modern approaches in this area based on deep neural networks are considered. The most famous and promising deep neural network architectures that are related to this problem are considered, in particular, the most popular free software solutions for NLG based on Transformers architecture with pre-trained deep neural network models GPT-2 and BERT. The main problem is that the main part of already existing solutions is devoted to the English language. But there are few models that are able to generate text in Russian. Moreover, the text they generate often belongs to a general topic and not about a specific subject area. The object of the study is the generation of a contextually coherent narrow-profile text in Russian. Within the framework of the study, a model was trained for generating coherent articles of a given subject area in Russian, as well as a software application for interacting with it.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The current rate of growth of content is so great that organizations are beginning to fail to keep up
with their own set of speeds. Editors and copywriters do not have time to create new texts from
scratch, think over ideas for new publications so that they are original. Hiring a large staff of
additional staff can significantly increase the costs of the company, which will lead to lower profits.
The second option for solving the problem is to reduce or maintain the speed of content formation,
which will also give negative results in the future, since the company will be a loser in comparison
with competitors. One of the options for solving the problem is the use of the latest artificial
intelligence (AI), machine learning and deep learning technologies for such a task as well as others
related to Natural Language Processing (NLP) problems. Deep neural networks and their training has
become a real breakthrough in solving basic AI problems, including NLP [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ]. This area of AI
is rapidly developing, there are separate areas within deep learning, such as generative deep learning,
reinforcement learning, within which new
      </p>
      <p>
        modern models of deep neural networks are being
developed that can solve traditionally complex AI problems faster and, most importantly, more
efficiently [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The impressive results of deep neural networks are certainly achieved thanks to
modern information technologies, such as large-scale machine learning libraries TensorFlow,
PyTorch with API for Python language [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8 ref9">5, 6, 7, 8, 9</xref>
        ].
      </p>
      <p>
        The main component of many neural language understanding and generating models is pretrained
word representation, proposed in [
        <xref ref-type="bibr" rid="ref9">9, 10</xref>
        ]. Word embeddings are the basis of deep learning for NLP.
Word embeddings (word2vec, GLoVe) are often pretrained on the text corpus from co-occurrence
statistics. But learning highquality representations in
many cases
is challenging task.
      </p>
      <sec id="sec-1-1">
        <title>Word representations are applied in a context free manner. So, the solution of this problem is train</title>
        <p>2021 Copyright for this paper by its authors.
contextual representations on text corpus. In the paper [12] authors introduced a new type of deep
contextualized word representation, they used vectors from bidirectional LSTM (Long Short-Term
Memory recurrent networks). This language model authors called ELMo (Embeddings from
Language Models) representations. So we can see that deep neural networks models are the most
upto-date and constantly evolving approach for solving many problems of NLP and NLG.</p>
        <p>The rest of the paper is organized in the following way. The state of research and recent advances
in deep learning for natural language generation are reviewed in Section 2. In Section 3 general
description of the GPT model is given and test runs of the original model are described. Section 4 is
devoted to searching Russian language resources with a large database of articles on technological
topics, development of software script and formation of dataset. Section 5 contains detailed
description of the experiments, taking into account all technical details of the implementation.
Experiments include two stages: model learning and model training. The most interesting parts of
experiments include train the model to generate whole texts and to generate article titles. In Section
6 experimental results are analyzed. In Section 7 the integration of the models with web application
considered. The main characteristics of the proposed web application are described. Conclusions and
perspectives for future work are discussed in Section 8.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        The neural text generation problem is analized in many works, for example [
        <xref ref-type="bibr" rid="ref4">4, 13</xref>
        ]. Authors
describe the classic and recently proposed neural text generation models. The development of
Recurrent Neural Networks Language Models (RNNLMs) discussed in detail with three training
paradigms: supervised learning, reinforcement learning, adversarial training. In 2017, a new simple
neural network architecture was proposed, called transformer, based solely on the attention
mechanism, without recurrence, i.e. sequential calculations [14]. Transfer learning technology allows
you to retrain ready-made models. For Today, this technology is the most promising in deep learning
and is used in the most advanced neural models for the generation of natural language texts [15]. The
next step was to demonstrate that language models begin to learn NLP tasks without any explicit
supervision when trained on a new dataset of millions of webpages called WebText [16, 17]. Authors
are researchers from OpenAI and they demonstrate how work their largest model GPT-2 (Generative
Pre-Trained Transformer). This is a 1.5B parameter Transformer that achieves state of the art results
on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText.
      </p>
      <p>Recently some pretrained high-capacity neural language models have become increasingly
important in natural language processing and generation. There are such deep neural networks as
ELMo [12], BERT [18, 19, 20], GPT-2,3 [15, 21, 22]. They are able to predict the next word in a
sequence or some masked word anywhere in a given sequence. BERT (Bidirectional Encoder from
Transformers) is neural network from Google, which demonstrated the best results on a number of
NLP tasks (machine translation, text analysis, chat bots, answering questions, etc.). Google has
released pre-trained models of BERT, but they suffer from a lack of documentation. In fact, BERT
from Google is an improved GPT network from OpenAI (bidirectional instead of unidirectional), also
on the transformer architecture. BERT is the best in almost all popular NLP benchmarks. Unlike
BERT, another popular generative pretrained transformer GPT-2 is created for generating samples of
synthetic text with a completely logical narrative, if you give it any beginning [15]. So GPT-2 is
available for testing and experimenting with it. Therefore, many researchers and practitioners in AI,
NLP and deep learning are trying to solve their problems of generating texts using GPT-2. For
example, copywriters and editors will be more focused on editing texts, or writing them on
readymade topics that the model will provide. For such tasks, the Transformers architecture is used, which
is able to perceive the context by processing the token chain at once. It should also be noted that many
trained models will be required in their subject areas, since at the moment there are no models that
could equally successfully generate coherent text in several non-related branches of human activity at
once. The multilingualism of the model requires a similar remark.</p>
      <p>Thus, it was decided to train a model of a non-profit company OpenAI called GPT-2, namely, a
medium-sized model with 300 million parameters, generating texts in Russian about information
technologies, blockchain and artificial intelligence. To train the model, the Transfer learning will be
used, which allows additional training of ready-made models [21]. This technology will be easy to
apply to the chosen GPT-2 [22]. For such training with so many parameters, a fairly extensive dataset
in Russian is required. The preparation of the dataset will also be described in the article.</p>
    </sec>
    <sec id="sec-3">
      <title>3. GPT model overview</title>
      <p>The GPT model is designed to predict the next word in the text, forming, when repeating the
operation, a complete coherent text with meaning, context, logic of presentation and completeness of
thought. It was originally designed to answer user questions.</p>
      <p>According to the developer, a non-profit company OpenAI, the product they offer can be used in the
future to help in speech recognition, articles and publications editing, keeping control of the storytelling
quality.</p>
      <p>The creation of GPT (General Purpose Technology, and later Generative Pretrained Transformer)
models began with the announcement and release of GPT-1 in 2018. At that time, of course, it was a
breakthrough in the field of text generation and the use of Transfer learning technology, which was new at
that time. But, nevertheless, due to the fact that it was trained on a small amount of data, its work left much
to be desired. With the announcement of the first generation of GPT, the foundation was laid for continued
research in this area.</p>
      <p>GPT-2, trained on more than 40 GB of data from 8 million web pages, impressed its own developers so
much that the company initially released only a beta version, citing the malicious use of their brainchild to
generate fake news, spam, and more. The release of GPT-2 took place in 2019, immediately after the
release, the work on GPT-3 has begun.</p>
      <p>The 3rd generation GPT made a closed API for the same reason – the possibility of using it to harm.
We can also assume that the new model has a sufficiently large amount of data that does not allow it to be
disseminated in the way we are used to. Among other things, a fairly large amount of money was spent on
the creation of GPT-3, on the order of several million dollars. And this is one of the key factors that does
not allow teaching it as effectively even with the knowledge of the mathematical apparatus of the structure
of the model.
3.1.</p>
    </sec>
    <sec id="sec-4">
      <title>Test runs of the original model</title>
      <p>When starting and initializing the model, there were several questions to be answered:
 What weights to use to work with the model
 What is the maximum length of the generated text
 What is the concept of "temperature" and how it affects the generation
 Consumed resources</p>
      <p>To work with the model, the weights of the PyTorch library are used, as well as a special configuration
file and an encoder model for storing tokens. The maximum length of the generated text can be any,
however the model context window is 1024 tokens. "Temperature" is a parameter that is adjusted during
the generation of text by the model, it shows the degree of "madness" of the text, that is, how far the model
can deviate from the examples set during training. Average consumed resources were calculated in the
middle of Google Colaboratory after 100 launches of each of the original models. The results can be seen
in Table 1.</p>
      <sec id="sec-4-1">
        <title>Number of parameters</title>
      </sec>
      <sec id="sec-4-2">
        <title>Occupied disk space, GB RAM, GB</title>
      </sec>
      <sec id="sec-4-3">
        <title>GPU memory, GB</title>
        <p>When the generation is started, an initial phrase is sent to the input, through which the context of
the generated text is set. The minimum size is 1 token, the maximum is 1023 tokens. Based on
observations, the larger the volume of the input phrase, the longer it takes to generate the text, it
should also be noted that the increase in time is not linear.</p>
        <p>There were also several launches of all models with different input phrases of the "information
technology" subject area. In Figure 1 below, you can see an example of the generated text for the input
phrase "The future of machine learning".</p>
        <p>The pre-trained model works reasonably well in English, generating grammatically correct texts while
maintaining context. The 1.5 billion parameter model is expected to have more coherent text than the 124
million parameter model, and is about the same as the 774 million parameter model. But to run a larger
model, more resources are needed, and they work longer.</p>
        <p>An interesting feature is that the network itself was able to generate, albeit non-existent, but valid links.
Sometimes it can get stuck – repeating the same phrase.</p>
        <p>In Russian, a network of any size works very poorly – this is due to the fact that they were taught
mainly in English. You can verify this by looking at Figure 2, the text " Все, что я могу сказать об этом
действии, это то, что" was fed to the model input.</p>
        <p>There are several analogs – models, pre-trained on texts in Russian and capable of generating coherent
texts of general topics. But it often loops, it is not able to generate an adequate text on a specialized topic
because there were no corresponding texts in the training dataset.</p>
        <p>Text generation is carried out in a style which is closed to the works of fiction of classical literature.</p>
        <p>It is also worth noting that in some cases the model recognizes the text as lyrics and begins to
supplement the text with white verse.</p>
        <p>There is a noticeable improvement in the use of words in context, and also there are no missing words
that do not exist in the explanatory dictionary.</p>
        <p>At the same time, there is a noticeable improvement in the use of punctuation marks in comparison
with the previous experiment, words and the generation of meaningful text, as well as specific characters
(for example, "?" And "!").</p>
        <p>For further research, it was decided to take the so-called "average" model of 355 million parameters.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Formation of a dataset</title>
      <p>As we understood from the text above, for the model for correct work it’s required to train it on a
sufficiently large amount of text. Thus, the primary task before training the model was the search for
Russian-language resources with a large database of articles on technological topics. After the
research, a small list of them was formed with an approximate number of articles on the portal. The
list can be found in Tables 2, 3 and 4.</p>
      <sec id="sec-5-1">
        <title>Approximate number of publications</title>
        <p>44000
10000
5900
3000
1000
Approximate number of publications
8700
800
500
300
200
200
200</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Experiments to explore model training</title>
      <p>After a series of experiments, which were described above, studies were carried out on various devices
on the basis of which the training took place.
5.1.</p>
    </sec>
    <sec id="sec-7">
      <title>Choosing the main device for computing during training</title>
      <sec id="sec-7-1">
        <title>Test training of models was carried out on 3 computing devices:</title>
        <p> CPU
 GPU
 TPU</p>
        <p>On average, the learning rate on a GPU is 2 times higher than the learning rate on the CPU, and the
same rate of learning on the TPU is higher than that on the GPU. Also, the Google Colab environment
offers TPUv2-8 for use, which means a possible division of training into 8 threads, which, in theory, will
increase model training by 16 times compared to a GPU. Table 6 shows the elapsed training time for 1
epoch on different devices, based on measurements made during the experiments.</p>
        <p>Thus, after several test runs on different devices and receiving data on the elapsed time, it was decided
to configure the server with a connection to Google Colud TPUv2-8.
5.2.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Model training</title>
      <p>As mentioned above, we decided to train 2 models: one only for generating text titles, and the
second for generating whole texts, including the title. First, a study was carried out according to the
second model.</p>
    </sec>
    <sec id="sec-9">
      <title>5.2.1. Train the model to generate whole texts</title>
      <p>In total, about 300 experiments were carried out with models of this type. We changed the markup
of the texts, the learning rate, tried a different number of articles from one source or another.
Ultimately, about 80% of the models suffered from "looping": a part of the text (most often it was a
phrase or a sentence) was repeated several times in the text, making it impossible to supplement the
content. A clear example of this can be seen in Figure 4.</p>
      <sec id="sec-9-1">
        <title>This loop is caused for several reasons:</title>
        <p> Model overfitting
 A small value of the "temperature" parameter, which is responsible for the probability
threshold for predicting the next word (accordingly, if the temperature is too high, then everything
that is generated will be incoherent text), is set during text generation
 A small "window" for choosing the most probable words, also set during text generation
During the tests, the most optimal values of the parameters above were formed:
 The number of epochs at which the generated text is human-readable is 1000
 The value of the "temperature" parameter was set to 0.8, since at lower values the model
began to "loop", and at higher values – to generate incoherent text
 The value of the "window" for taking the most probable subsequent words by the model was
set to 40</p>
        <p>Also, during training, it was customary to save and test the models every 100 epochs with a small step
decrease at small epoch values. Experiments on them showed that the model has not yet learned how to
normally generate text for exactly the topic that was laid down in the dataset, and it began to make
progress in the latter after the 800th era of training.</p>
        <p>Due to the fact that the volume of the generated text was quite small, and the chosen subject area is
assistance to editors and copywriters, it was decided to filter the dataset by the number of words. 3000 was
taken as the extreme value. Thus, the number of articles in the dataset was reduced to 31686. The results of
testing the model confirmed our guess: the articles became longer and the coherence of the text inside
them increased. An example can be seen in Figure 5.</p>
        <p>Also, during the application of Transfer Learning technology, we achieved improved results by
"unfreezing" as many layers of the model as possible, and then gradually decreasing this number.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>5.2.2. Train the model to generate article titles</title>
      <p>During the work, we decided to move by more generalization of the task: generating the titles of an
article on a given topic is a much narrower task than generating the entire text of an article. Thus, here we
used the developments obtained when training the model in the previous paragraph.</p>
      <p>At first, a number of experiments were carried out to retrain the original Russian-language model with
titles from the datasets presented above, but after that an increase in the efficiency of the model was
noticed if the ready-made model was retrained for generating articles. Thus, it was already guaranteed that
the headings would be of a given subject, just this additional training regulated the length of the generated
text in the future.</p>
      <p>Taking into account all the comments from the previous section, a dataset of titles was created. An
example of an excerpt from this dataset can be found in Figure 6.</p>
      <p>And although service tokens are clearly invisible here, at the encoding stage, the line feed character
turns into a service token, according to which the titles are separated both during training and at the
postprocessing stage during generation.</p>
    </sec>
    <sec id="sec-11">
      <title>6. Experimental results</title>
      <p>To test the learning outcomes, 2 networks were connected (the output of the model for generating titles
was the input for the model for generating articles) and launched for iterative generation of 500 instances.
In total, the process took about 2.5 days. Each final model weighed 1.5 GB and took some time at startup
to initialize.</p>
      <p>Nevertheless, the results of the generation were thematic and easy to understand by a person, and 10%
of all articles did not require almost any editing at all. Thus, the task of training directly similar deep
learning models has been successfully completed. Examples of generated text are in Figure 7.</p>
      <p>Also, do not ignore the ability of the model system to generate special characters. Figure 8 shows
the ability to generate enumeration lists, and the model is capable of generating links.</p>
    </sec>
    <sec id="sec-12">
      <title>7. Integration of models with web application</title>
      <p>For ease of use, it was decided to develop a web application that would be able to generate such articles
based on the user's input text, as well as an open REST API of the application to be able to use it through
other applications. In Figure 9, you can see the use-case diagram. The application is capable of:
 Generate an article via UI without entered text
 Generate an article via UI with the entered text (taken into account by the system as the title
of the article)
 Generate titles via REST API
 Generate article by title via REST API
 Generate an article without the entered text via REST API</p>
      <p>The application can run on any Linux-based machine, all environment parameters can be configured by
installing the specified required libraries for operation. Also, the code contains the internal logic of
postprocessing of the text after generating the content. The visual interface of the application can be seen in
Figure 10.</p>
      <p>The application is a test one, and therefore it is very easy to overload the server: the generation of the
text will continue, but due to the resources consumed by another generation process, the speed of both will
be reduced.</p>
      <p>By default, the number of tokens for generating full-size articles is 500 tokens, and for titles it is 100.</p>
    </sec>
    <sec id="sec-13">
      <title>8. Conclusions and Future Work</title>
      <p>Experiments were carried out with the selection of pre-trained models. They ended with the selection
of a Russian-language model pre- trained on classical literature. Initial experiments were done at Google
Colab.</p>
      <p>Next, a dataset was prepared: web pages were downloaded from the selected portals about IT topics
and then processed, as indicated in the article. Thus, the volume of text sufficient for training the model on
a given topic was provided.</p>
      <p>Further training was deployed on Google Cloud TPU. Experiments were carried out to train models on
various datasets (changes in tags, number of articles, volume of text within an article), and some
generation problems were solved, for example, looping. Also, a web service has been developed for
interacting with the model.</p>
    </sec>
    <sec id="sec-14">
      <title>9. References</title>
      <p>[10] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of words and
phrases and their compositionality, in: NIPS, 2013.
[11] J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, in:</p>
      <p>EMNLP, 2014.
[12] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep</p>
      <p>Contextualized word representations, arXiv preprint arXiv: 1802.05365 v2 [cs.CL] 22 Mar 2018.
[13] S. Lu, Y. Zhu, W. Zhang, J. Wang, Y. Yu, Neural Text Generation : Past, Present and Beyond,
arXiv preprint arXiv: 1803.07133 v1 [cs.CL] 15 Mar 2018.
[14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, I. Polosukhin.</p>
      <p>Attention Is All You Need, in: Advances in Neural Information Processing Systems 30: Annual
Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach,
CA, USA, pages 6000–6010.
[15] D. Rothman, Transformers for Natural Language Processing: Build innovative deep neural
network architectures for NLP with Python, PyTorch. BERT, RoBERTa, T5, GPT-2, architecture
of GPT-3, and much more, Packt Publishing Ltd, Birmingham, UK, 2021.
[16] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, Improving language understanding with
unsupervised learning. Technical report, OpenAI, 2018.
[17] A. Radford, J. Wy, R. Child, D. Luan, D. Amodei, I. Sutkever. Language Models are</p>
      <p>Unsupervised Multitask Learners, Computer Science, 2019.
[18] J. Devlin, M. Chang, K. Lee, K. Toutanova. BERT: Pre-training of Deep Biderectional
Transformers for Language Understanding, arXiv preprint arXiv: 1810.04805 v1 [cs.CL] 11 Oct
2018.
[19] S. Ravichandiran, Getting Started with Google BERT, Packt Publishing Ltd.,
Birmingham</p>
      <p>Mumbai, 2021.
[20] J. Cage, Python Natural Language Processing (NLP) Exercises: From Basics to BERT, Amazon</p>
      <p>Kindle Edition, 2020.
[21] S. Golovanov, R Kurbanov, S. Nikolenko, K. Truskovskyi, A. Tselousov, T. Wolf, Large Scale
Transfer Learning for Narural Language Generation, in: Proceedings of the 57th Annual Meeting
of the Association for Computational Linguistics, Florence, Italy, july 28 – August 2, 2019, pp.
6053 – 6058.
[22] P. Budzianowski, I. Vulic, Hello, It’s GPT-2 – How Can I Help You? Towards the Use of
Pretrained Language Models for Task-Oriented Dialogue Systems, arXiv preprint arXiv:
1907.05774v2 [cs.CL] 4 Aug 2019.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <source>Deep Learning (Adaptive Computation and Machine Learning Series)</source>
          , The MIT Press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <source>Neural Networks Methods for Natural Language Processing</source>
          , Morgan&amp;Claypool Publishing,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Aggarval</surname>
          </string-name>
          ,
          <source>Neural Networks and Deep Learning</source>
          , Springer International Publishing AG,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Foster</surname>
          </string-name>
          , Generative Deep Learning. Teaching Machines to Paint, Write, Compose and Play,
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc., USA,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Bengfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bilbro</surname>
          </string-name>
          , T. Ojeda,
          <article-title>Applied Text Analysis with Python. Enabling Language-aware Data Products with Machine Learning,</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc., USA,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Hobson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hannes</surname>
          </string-name>
          , Natural Language Processing in Action. Understanding, analyzing, and
          <article-title>generating text with Python</article-title>
          ,
          <source>Manning Publications Co</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ganegedara</surname>
          </string-name>
          ,
          <article-title>Natural Language Processing with TensorFlow. Teach language to machines using Python's deep learning library</article-title>
          , Packt Publishing Ltd, UK,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <article-title>Advanced Natural Language Processing with TensorFlow 2: Build effective real-world NLP applications using NER, RNNs, seq2seq models, Transformers, and more</article-title>
          , Packt Publishing Ltd, Birmingham, UK,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McMahan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Natural Language Processing with PyTorch. Build Intelligent Language Applications Using Deep Learning</surname>
            ,
            <given-names>O</given-names>
          </string-name>
          <string-name>
            <surname>'Reilly Media</surname>
          </string-name>
          , Inc., USA,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>