<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Applying Machine Learning to the Task of Generating Search Queries</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Kazan Federal University</institution>
          ,
          <addr-line>Kazan</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>84</fpage>
      <lpage>97</lpage>
      <abstract>
        <p>In this paper we research two modifications of recurrent neural networks - Long Short-Term Memory networks and networks with Gated Recurrent Unit with the addition of an attention mechanism to both networks, as well as the Transformer model in the task of generating queries to search engines. GPT-2 by OpenAI was used as the Transformer, which was trained on user queries. Latentsemantic analysis was carried out to identify semantic similarities between the corpus of user queries and queries generated by neural networks. The corpus was converted into a bag of words format, the TFIDF model was applied to it, and a singular value decomposition was performed. Semantic similarity was calculated based on the cosine measure. Also, for a more complete evaluation of the applicability of the models to the task, an expert analysis was carried out to assess the coherence of words in artificially created queries.</p>
      </abstract>
      <kwd-group>
        <kwd>natural language processing</kwd>
        <kwd>natural language generation</kwd>
        <kwd>machine learning</kwd>
        <kwd>neural networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Natural language generation is a process of creating meaningful phrases and sentences
in the form of natural language. Two main approaches can be distinguished among the
algorithms for creating texts: methods based on rules and methods based on machine
learning. The first approach allows to achieve high quality texts, but requires
knowledge of the rules of the language and is time consuming to develop [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], while the
second approach depends only on training data, but often makes grammatical and
semantic errors in the created texts [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Currently, generation of texts using neural networks is being actively researched;
one of the most popular algorithms is recurrent neural networks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The second leading
architecture is the Transformer model [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These architectures were considered in the
solution of the generating search queries task.
      </p>
      <p>The purpose of this article is to study the above-mentioned architectures, analyze
their quality and applicability to this task. The use of automatically generated queries
to search engines is relevant, since most companies do not issue their search queries for
free, and a search engine must be tested while being developed. Also, the received
queries can be used to improve the efficiency and optimize the search engine.</p>
      <p>
        Search queries from users of AOL (America Online), which were anonymously
posted on the Internet in 2006, were used in this paper. Although the company did not
identify its users, personal information was present in many queries [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which
companies are now trying to avoid. Algorithms have been proposed to help preserve user
anonymity, but the question is whether data that can be safely published is of practical
use. To solve this problem, it is proposed to use automatically generated queries.
1.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Subject area overview</title>
      <p>Natural language text generation algorithms are actively studied and used in many
software systems, so at the moment there is a large amount of research in this area.</p>
      <p>
        One of the first approaches is the fill-in-the-gap template system. It is used in texts
that have a predefined structure and, if it is necessary to fill in a small amount of data,
this approach can automatically fill in the blanks with data obtained from spreadsheets,
databases, etc. An example of this approach is Microsoft Word mailmerge [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>The second step was to add general-purpose programming languages to the first
approach that support complex conditionals, loops, etc. This approach is more powerful
and useful, but the lack of language capabilities makes it difficult to create systems that
can generate quality texts.</p>
      <p>
        The next step in the development of template-based systems is the addition of
wordlevel grammatical functions that deal with morphology and spelling. Such functions
greatly simplify the creation of grammatically correct texts. Next, systems dynamically
create sentences from representations of the values they need to convey. This means
that systems can handle unusual cases without the need to explicitly write code for each
case, and are significantly better at generating high-quality "micro-level" writing.
Finally, in the next stage of development, systems can generate well-structured
documents that are relevant to users. For example, a text that needs to be persuasive can be
based on models of argumentation and behavior change [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        After moving from templates to dynamic text generation, it took a long time to
achieve satisfactory results. If we consider the generation of texts in natural language a
subsection of natural language processing, then there is a number of the most developed
algorithms – Markov chains [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], recurrent neural networks, long short-term memory
netwroks and the Transformer model. There are text generation tools based on these
methods, for example commercial Arria NLG PLC, AX Semantics, Yseop and others,
as well as open source programs Simplenlg, GPT, GPT-2, BERT, XLNet.
      </p>
      <p>
        Also, the use of generative adversarial networks for text generation is currently being
researched, since they show excellent results in the task of generating images [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
2
      </p>
      <sec id="sec-2-1">
        <title>Data collection</title>
        <p>User queries in English from the 2006 AOL search engine were selected as data for
training neural networks. Researchers try to avoid using this data in their work, as it
can be considered revealing, but this paper uses only the texts of the requests
themselves, without the IDs of users and websites, that is, without using personal
information. The initial data are presented in the form shown in Fig. 1.</p>
        <p>Queries longer than 32 words and erroneous requests containing no information were
removed from the corpus. Duplicate queries and queries containing website names have
also been removed, as they are not natural language examples. In total, 100 thousand
queries were randomly selected for training. Fig. 2 shows examples of data after
preprocessing.</p>
        <p>The queries were separated into character-tokens, each character was assigned with a
natural number, the entire corpus was encoded using this dictionary.
3</p>
      </sec>
      <sec id="sec-2-2">
        <title>Recurrent networks</title>
        <p>
          Recurrent neural networks (RNN) are a family of neural networks where the
connections between the elements form a directed sequence [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. They can use their internal
memory to process sequences of arbitrary length, and they are also good at identifying
the dependencies between tokens.
        </p>
        <p>
          However, recurrent networks learn slowly, and their ability to memorize long
dependencies is limited due to the vanishing gradient problem [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>
          Two types of recurrent networks were implemented; they are most often used in the
task of generating texts in natural language – Long Short-Term Memory Network [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
and Gated Recurrent Unit [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Studies have shown that these types of networks have
comparable accuracy, and, depending on the task, one network can be more accurate
than the other.
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Long short-term memory networks</title>
      <p>
        Long short-term memory network (LSTM) is a deep learning system that avoids the
vanishing and exploding gradient problems [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. LSTM networks can memorize
significantly longer sequences of characters. They use gates, which are internal
mechanisms that can control information flow. Fig. 3 shows the standard form of the LSTM
cell.
      </p>
      <p>Each cell has 3 gates: input, output and forget gates. The forget gate vector is
calculated as:</p>
      <p>=  (    +   ℎ −1 +   ),
where   is the input vector, ℎ −1 is the output vector of the previous cell, σ is sigmoid
function,   ,   ,   are weight matrices and bias vector.</p>
      <sec id="sec-3-1">
        <title>Next, the input gate updates the state of the cell:</title>
      </sec>
      <sec id="sec-3-2">
        <title>Then the new value of cell state is calculated:</title>
        <p>=   °  −1 +   °ĉ ,
where   −1 is the state of the previous cell. Finally, an output vector decides what the
next hidden state should be

 =  (    +   ℎ −1 +   ),</p>
        <p>ℎ =   °tanh⁡(  ).</p>
        <p>The results are passed to the next cell.
3.2</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Gated Recurrent Unit</title>
      <p>
        The second implemented model is a network with Gated Recurrent Unit (GRU), which
is a new generation of recurrent neural networks, similar to a long short-term memory
network [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. However, compared to LSTM, this type of networks has fewer
parameters, and therefore these models are trained faster. The GRU has only 2 gates: update
and reset gates. Fig. 4 shows a standard GRU cell.
      </p>
      <p>Update gate acts as input and forget gates in LSTM and is calculated as:</p>
      <sec id="sec-4-1">
        <title>Reset gate is calculated as</title>
      </sec>
      <sec id="sec-4-2">
        <title>Output vector of GRU cell is calculated as</title>
        <p>ℎ =   °ℎ −1 + (1 −   )° ( ℎ  +  ℎ(  °ℎ −1) +  ℎ).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Attention mechanism in recurrent neural networks</title>
      <p>
        Attention mechanism is a technique used in neural networks to identify dependencies
between parts of input and output data [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        Attention mechanism allows a model to determine the importance of each word for
the prediction task by weighing them when creating the text representation. The
approach with a single parameter per input channel was used [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]:
      </p>
      <p>= ℎ   ,
  =
exp⁡(  )

∑ =1 exp⁡(  )</p>
      <p>,
 = ∑   ℎ .</p>
      <p>Here ℎ is the representation of the word at time step t,   is the weight matrix for the
attention layer,   are the attention importance scores for each time step, v is the
representation vector for the text.
4</p>
      <sec id="sec-5-1">
        <title>Implementation of recurrent networks</title>
        <p>
          All neural networks were implemented in Python 3.7 in Google Collab [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], since it
has the ability to use GPUs, which significantly decrease the training time of the
models. For the implementation of neural networks, we chose the Keras library [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], which
is a high-level add-on over TensorFlow. This library greatly simplifies the development
of neural networks, since it already has ready-made implementations of the main layers,
activation and loss functions. The Adam optimizer (Adaptive Moment Estimation [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ])
is used. It is an algorithm in which the learning rate is adjusted for each parameter.
Also, the Learning Rate Scheduler function [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] is used as a callback, which allows
calculating the learning rate coefficient using a specific function.
        </p>
        <p>The general architecture of the model is shown in Fig. 5.</p>
        <p>
          The preprocessed data was divided into training and validation sets, which
constituted 80% and 20% of the corpus, respectively. The training data is fed into the
Embedding layer, which converts numbers into vectors that reflect the correspondence
between the character sequences and the projections of those sequences. The resulting
representations are input to the first LSTM layer (GRU), its output is passed to the
second LSTM layer (GRU), and the third in the same way. Next, the output data from
the Embedding layer and these three layers are combined and fed to the
AttentionWeightedAverage layer. The representation vector obtained from the attention layer is
a high-level encoding of the entire text, which is used as input to the final fully
connected layer with Softmax activation for classification [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>To check how well the model has trained for a particular epoch, the Categorical
CrossEntropy loss function is calculated as</p>
        <p>( ,  ̂) = ⁡ − ∑ =0  =0(  log( ̂ )),</p>
        <p>∑
where  ̂ are the predicted values.</p>
        <p>We conducted experiments with changing the number of LSTM layers (GRU) in the
model (2 and 3 layers), as well as adding a Dropout layer after the Embedding layer,
which randomly excludes a given number of neurons to prevent network from
overfitting and to generalize the model better. Networks with 3 recurrent layers and Dropout
performed better.</p>
        <p>
          Bidirectional models of these networks were also trained. Bidirectional recurrent
neural network is a model proposed in 1997 by Mike Schuster and Kuldip Paliwal [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ],
which allows to consider the context of a word not only to the left of it but also to the
right of the sequence. A general view of bidirectional neural networks is shown in Fig.
        </p>
        <p>In the case of the generating search queries task, the bidirectional model has shown
itself to be better than the unidirectional one; the obtained values of the loss function
after training the models for 30 epochs are shown in Table 1.</p>
        <sec id="sec-5-1-1">
          <title>Loss Validation Loss</title>
          <p>The value of the loss function decreased on the training data, however, on the validation
data, the improvement was less significant, which suggests that the bidirectional model
in this task does not learn so well but rather “remembers” the sequences of symbols.</p>
          <p>With the help of the implemented model, queries with different "temperatures" were
generated. This is a parameter that affects the chance of choosing an unlikely character.
5</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Transformer</title>
        <p>
          Transformer is a deep learning model, which was introduced in 2017 [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. A general
view of its architecture is shown in Fig. 7.
        </p>
        <p>Transformers consist of stacks of equal numbers of encoders and decoders. Encoders
process input sequences and encode data to show information about them and their
characteristics. Decoders do the opposite, they process the information received from
the encoder and generate output sequences. All encoders have the same structure and
consist of two layers: self-attention and feed-forward neural network. The input
sequence being fed into the encoder first passes through the layer of internal attention,
which helps the encoder to look at other words in the input sentence while encoding a
particular word. The output of this layer is sent to the feedforward neural network. The
same network is applied independently to each word. The decoder also contains these
two layers, but in between there is an extra layer of attention that allows the decoder to
identify the relevant parts of the input sentence.</p>
        <p>Internal attention allows the model to see dependencies between the word being
processed and other words in the input sequence, which help to better encode the word.</p>
        <p>
          After all decoders, a fully connected Softmax layer is used, which converts the
obtained values into probabilities, from which the largest value is then selected, and the
word corresponding to it becomes the output for this time step.
GPT-2 is a large language model based on Transformer, created by the non-profit
company OpenAI, with parameters ranging from 117 million to 1.5 billion, trained on a
dataset of 8 million web pages [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. GPT-2 learns with a simple goal: to predict the
next word, given all the previous words in some text.
        </p>
        <p>GPT-2 is built using only decoder blocks, which have the same structure as the
Transformer model described above.</p>
        <p>
          GPT-2 does not use words as input but tokens obtained using the Byte Pair Encoding
(BPE) method. It is a data compression technique in which the most common pairs of
consecutive bytes of words are replaced by bytes that do not appear in those words [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
This method provides a balance between character and word representations, which
allows it to cope with large corpuses of data.
        </p>
        <p>Internal attention in GPT-2 also uses masking, which blocks information from
tokens to the right of the position that is being calculated.
5.2</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>GPT-2 implementation</title>
      <p>A medium size GPT-2 model with 345 million parameters was used, consisting of 24
decoder blocks.</p>
      <p>The model was further trained using fine-tuning on the corpus of search queries in
English which were also used to train recurrent neural networks. Using the resulting
model search queries were generated.</p>
      <p>We used the model implementation available at https://github.com/nshepperd/gpt-2.
The model was trained for 1000 steps.
6</p>
      <sec id="sec-6-1">
        <title>Latent semantic analysis</title>
        <p>Latent Semantic Analysis (LSA) is a natural language processing technique for
analyzing dependencies between collections of documents and the terms they contain [24].</p>
        <p>
          This method uses a term document matrix that describes the frequency of occurrence
of terms in a collection of documents. The elements of such a matrix can be weighted,
for example, using TF-IDF: the weight of each element of the matrix is proportional to
the number of times the term occurs in each document, and inversely proportional to
the number of times the term occurs in all documents in the collection. After compiling
the term-document matrix, its singular value decomposition is carried out, i.e. it is
represented as  =    , where matrices U and V are orthogonal, and S is a diagonal
matrix, the values of which are called singular values of matrix A. This expansion
reflects the basic structure of dependencies present in the original matrix, allowing to
ignore noise [
          <xref ref-type="bibr" rid="ref24">25</xref>
          ].
6.1
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Implementation of latent semantic analysis</title>
      <p>
        To carry out latent semantic analysis, the gensim library for Python was used [
        <xref ref-type="bibr" rid="ref25">26</xref>
        ]. We
created a corpus of 10,000 documents containing human-written reference searches.
Frequently occurring official words of the English language (prepositions, articles) and
words that occur once were then removed from it, since they do not help to calculate
the semantic relationship between documents. Using the Dictionary class of the gensim
library, a dictionary was created with words and their indices, then using the doc2bow
method of this class, all documents were presented in a bag of words format. The TFIDF
model was applied to the resulting data corpus, and the LsiModel class performed
singular value decomposition. The requests generated using neural networks were
tokenized and, using a dictionary created on the reference corpus, transformed into a bag
of words format. Finally, using the MatrixSimilarity class, semantic similarities
between these corpuses were calculated using a cosine measure.
      </p>
      <sec id="sec-7-1">
        <title>Results of evaluating generated queries</title>
        <p>Comparing each document, in this case a query, with documents from the corpus with
real queries, the method returns a value from -1 to 1, reflecting the semantic similarity
of the documents. The analysis results are shown in Table 2.
The corpus of real queries is varied, so the average value of the result of comparing
each generated document with all documents from the reference corpus differs slightly
from zero. At the same time, for each query artificially created using GRU and LSTM
networks, there are on average 16 and 14 semantically close documents, when the
values are greater than 0.7, and for the GPT-2 model, this number is 9 documents. Also,
for each request generated by the GPT-2 model, out of 10,000 compared documents,
2684 have a value greater than 0, and for LSTM and GRU networks, 4659 and 4000,
respectively. From this, we can conclude that LSTM and GRU used more words
semantically similar to words from the training data when generating queries than
GPT2. This makes sense, since the first two models were trained from scratch on the input
data, while the main training of the last model took place on a completely different
corpus, it was only fine-tuned in order to generate queries suitable for structure. It is
also important to take into account that the comparison was carried out with 10
thousand reference queries, although the models were trained on 100 thousand, therefore
not all dependencies were taken into account, however, the obtained values are
sufficient for analysis.</p>
        <p>The analysis results show that the generated queries have similar semantics to the
corpus of real user queries, but at the same time they do not repeat them literally, that
is, they are new queries in meaning.</p>
        <p>The GRU and LSTM networks were trained by characters and could have generated
non-existent words, so it was decided to test them. Each word from the queries was
checked for existence using a corpus containing more than 466 thousand English words,
available at https://github.com/dwyl/english-words. In the queries generated by the
GRU network, 141 words out of 4431 were not found, and in the queries of the LSTM
model - 166 out of 4325. The words that were not found contained typos or mistakes in
words that the models remembered. Therefore, it may be worth preprocessing the data
by correcting typos and errors of this kind. However, queries with typos can be useful
depending on the task in which they will be applied. So, for example, when they are
used to test a new search engine or to optimize it, they will be more relevant with typos,
as they have a greater similarity with real user queries.</p>
        <p>Due to the fact that neural networks cannot understand the meaning of a sentence,
although they often find the correct dependencies between tokens, an expert (manual)
analysis was carried out to assess the quality of the generated search queries.</p>
        <p>From the queries generated by each model, 100 queries were randomly selected. It
was determined whether each search query makes sense, whether it is similar to a real
possible user query. It should be noted that this assessment is subjective. Queries were
considered "good" if the words in them were consistent with each other.</p>
        <p>The analysis results are shown in Table 3.
The table shows that the GRU and LSTM networks showed almost the same results,
while GPT-2 is slightly better. During the analysis, it was observed that the GPT-2
model generates shorter queries than the other two models.</p>
        <p>The results of the analysis showed that the GRU and LSTM networks have
approximately the same quality when solving the task of generating search queries, and the
GPT-2 model was worse in automatic analysis, but better in expert judgment.
Therefore, this model is better suited for generating search queries, since the significance of
the expert judgment is higher than the automatic one, although for more accurate results
it is worth carrying out this assessment with the help of other experts.
8</p>
      </sec>
      <sec id="sec-7-2">
        <title>Conclusion</title>
        <p>In the course of this work, we researched the leading models used to generate texts in
natural language and their ability to solve the task of generating queries for search
engines and we conducted their comparative analysis. Two neural networks are fully
implemented: a network with a long short-term memory and a network with a gated
recurrent unit. The GPT-2 architecture based on the Transformer model was researched;
it was also fine-tuned using the corpus of real user requests.</p>
        <p>Latent semantic analysis showed that the GPT-2 model performs worse than the
other two networks. However, the automatic metrics for evaluating the generated text
do not always reflect the quality of the model, since at the moment it is impossible to
assess the meaningfulness of the texts using the algorithm. To solve this problem, an
expert analysis of the generated texts was also carried out, according to the results of
which the GPT-2 model was better than the other two models. At the same time, the
LSTM and GRU networks showed approximately the same quality according to the
results of all analyses performed.</p>
        <p>Acknowledgments. This work was subsidy of the Russian fund of fundamental
research, grant agreement 18-07-00964.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. van Deemter,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Krahmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Theune</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Real vs. template-based natural language generation: a false opposition? (</article-title>
          <year>2005</year>
          ) https://wwwhome.ewi.utwente.nl/~theune/PUBS/templates-squib.pdf,
          <source>last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Neural Text Generation: A Practical Guide (</article-title>
          <year>2017</year>
          ) https://arxiv.org/pdf/1711.09534.pdf,
          <source>last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A</given-names>
            <surname>Comprehensive</surname>
          </string-name>
          <article-title>Guide to Natural Language Generation (</article-title>
          <year>2019</year>
          ) https://medium.com/sciforce/a
          <article-title>-comprehensive-guide-to-natural-language-generationdd63a4b6e548, last accessed</article-title>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Arrington</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>AOL proudly releases massive amounts of user search data (</article-title>
          <year>2006</year>
          ) https://techcrunch.com/
          <year>2006</year>
          /08/06/aol-proudly
          <article-title>-releases-massive-amounts-of-usersearch-data/</article-title>
          ,
          <source>last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Reiter</surname>
          </string-name>
          , E.:
          <article-title>NLG vs Templates: Levels of Sophistication in Generating Text (</article-title>
          <year>2016</year>
          ). https://ehudreiter.com/
          <year>2016</year>
          /12/18/nlg-vs-templates,
          <source>last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gagniuc</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Markov Chains: From Theory to Implementation and Experimentation</article-title>
          . USA, NJ: John Wiley &amp; Sons (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Press,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Bar</surname>
          </string-name>
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Bogin</surname>
          </string-name>
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Berant</surname>
          </string-name>
          <string-name>
            <surname>J.</surname>
          </string-name>
          , Wold L.:
          <article-title>Language Generation with Recurrent Generative Adversarial Networks without Pre-training (</article-title>
          <year>2017</year>
          ). https://arxiv.org/pdf/1706.01399.pdf,
          <source>last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rumelhart</surname>
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Learning representations by back-propagating errors (</article-title>
          <year>1986</year>
          ). http://www.cs.utoronto.ca/~hinton/absps/naturebp.pdf,
          <source>last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frasconi</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies (</article-title>
          <year>2001</year>
          ). https://www.bioinf.jku.at/publications/older/ch7.pdf,
          <source>last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Long-Short Term Memory</surname>
          </string-name>
          (
          <year>1997</year>
          ). http://web.archive.org/web/20150526132154/http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf,
          <source>last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Heck</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salem</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Simplified Minimal Gated Unit Variations for Recurrent Neural Networks (</article-title>
          <year>2017</year>
          ). https://arxiv.org/abs /1701.03452, last accessed
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Bahdanau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.:</given-names>
          </string-name>
          <article-title>Neural Machine Translation by Jointly Learning to Align and Translate (</article-title>
          <year>2016</year>
          ). https://arxiv.org/pdf/1409.0473.pdf,
          <source>last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Felbo</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mislove</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Søgaard</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahwan</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm (</article-title>
          <year>2017</year>
          ). https://arxiv.org/pdf/1708.00524.pdf,
          <source>last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Bisong</surname>
          </string-name>
          , E.:
          <article-title>Google Colaboratory</article-title>
          .
          <source>In: Building Machine Learning and Deep Learning Models on Google Cloud Platform</source>
          (
          <year>2019</year>
          ) Apress, Berkeley, CA.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Chollet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Keras</surname>
          </string-name>
          (
          <year>2015</year>
          ). https://keras.io,
          <source>last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Adam: A Method for Stochastic Optimization (</article-title>
          <year>2014</year>
          ). https://arxiv.org/abs/1412.6980, last accessed
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <article-title>Learning Rate Scheduler</article-title>
          . https://keras.io/api/callbacks/learning_rate_scheduler/,
          <source>last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Schuster</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paliwal</surname>
            ,
            <given-names>K.:</given-names>
          </string-name>
          <article-title>Bidirectional recurrent neural networks (</article-title>
          <year>1997</year>
          ). https://www.researchgate.net/publication/3316656_Bidirectional_recurrent_neural_networks,
          <source>last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmar</surname>
          </string-name>
          , N.:
          <string-name>
            <surname>Attention Is All You Need</surname>
          </string-name>
          (
          <year>2017</year>
          ). https://arxiv.org/pdf/1706.03762.pdf,
          <source>last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Language Models Are Unsupervised Multitask Learners</surname>
          </string-name>
          (
          <year>2018</year>
          ). https://d4mucfpksywv.cloudfront.
          <article-title>net/better-language-models/language-models</article-title>
          .pdf,
          <source>last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          : BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding (</article-title>
          <year>2018</year>
          ). https://arxiv.org/pdf/
          <year>1810</year>
          .04805.pdf,
          <source>last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , T.,
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Language Models Are Few-Shot Learners</surname>
          </string-name>
          (
          <year>2019</year>
          ). https://arxiv.org/abs/
          <year>2005</year>
          .14165, last accessed
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Gage</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A New Algorithm for Data Compression (</article-title>
          <year>1994</year>
          ). https://www.derczynski.com/papers/archive/BPE_Gage.pdf,
          <source>last accessed</source>
          <year>2020</year>
          /06/15 14. Deerwester,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Harshman</surname>
          </string-name>
          , R.:
          <article-title>Indexing by Latent Semantic Analysis (</article-title>
          <year>1987</year>
          ). https://www.cs.bham.ac.uk/ ~pxt/IDA/lsa_ind.pdf,
          <source>last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          25.
          <string-name>
            <surname>Nakov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Getting Better Results with Latent Semantic Indexing (</article-title>
          <year>2009</year>
          ). http://citeseerx.ist.psu.edu/viewdoc/download?doi
          <source>=10.1.1.59.6406 &amp;rep=rep1&amp;type=pdf, last accessed</source>
          <year>2020</year>
          /06/15
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          26.
          <string-name>
            <surname>Rehurek</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soika</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Software Framework for Topic Modelling with Large Corpora (</article-title>
          <year>2010</year>
          ).
          <source>Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</source>
          . University of Malta.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>