<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Neural Networks in the Processing of Natural Language Texts in Information Learning Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Olha Tkachenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kostiantyn Tkachenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleksandr Tkachenko</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roman Kyrychok</string-name>
          <email>r.kyrychok@kubg.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vladyslav Yaskevych</string-name>
          <email>v.yaskevych@kubg.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Borys Grinchenko Kyiv Metropolitan University</institution>
          ,
          <addr-line>18/2 Bulvarno-Kudryavska str., Kyiv, 04053</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute</institution>
          ,”
          <addr-line>37 Beresteyskyi ave., Kyiv, 03056</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>State University of Infrastructure and Technologies</institution>
          ,
          <addr-line>9 Kirillivska str., Kyiv, 04071</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>73</fpage>
      <lpage>87</lpage>
      <abstract>
        <p>Processes of natural language text processing in informational learning systems or informational learning systems with elements of intellectualization are considered. Among these processes, attention was paid to simplification, and normalization of the text, highlighting the main essences of the subject area of the courses supported by the appropriate informational learning system. The use of neural networks for text processing consisted, in particular, of the unification of words, the formation of abbreviations, the removal of redundant clarifications, the replacement of terms (slang words), the removal of clarifying constructions and redundant symbols, and the correction of errors and paraphrasing. Natural language text processing in information learning systems or information learning systems with elements of intellectualization is based on the use of the Transformer model in neural networks, which, with the help of its unique architecture, facilitates the parallelization of processing processes, simplifies the use of these processes, and increases the efficiency and speed of training of the corresponding neural network. The considered model of the neural network effectively determines the patterns of test fragments (words, phrases) and finds connections in the training data of the network. All this contributes to the acceleration of natural language text processing processes, even when using a small amount of training data for training. The use of the Transformer model in neural networks contributes to normalization (words, phrases), simplification of the text and reduction of text volumes, and removal of complex wording. All this contributes to the efficient and quick processing of large texts.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Natural language text processing</kwd>
        <kwd>neural network</kwd>
        <kwd>normalization</kwd>
        <kwd>simplification</kwd>
        <kwd>embedding</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Processing of natural language in intellectual
systems (including educational ones) began
when A. Turing proposed a model for testing the
system for so-called “consciousness”.</p>
      <p>Then algorithms for finding connections
between words began to be developed, word
formation was studied, lexical/syntactic analysis
of sentences was performed, etc.</p>
      <p>All this became the basis of the first automatic
translator.</p>
      <p>
        Progress in natural language processing
began with the use of artificial deep neural
networks for text processing, which, with the
help of statistical, combinatorial, and
mathematical methods, determine the
relationships between elements themselves [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7">1–7</xref>
        ].
      </p>
      <p>This contributed to the exclusion of such
processes in natural language processing as:
• Finding connections between words.
• Word formation research.
• Lexical analyses.
• Syntactic analyses.</p>
      <sec id="sec-1-1">
        <title>And such methods as:</title>
        <p>• Normalization.
• Fragmentation.
• Tokenization of text performs
preprocessing of text for neural networks.
The recurrent neural network is the first model
that was able to achieve good results in natural
language processing.</p>
        <p>Recurrent neural networks process input
data sequentially and store context in blocks of
memory that are used “in time” (from iteration
to iteration).</p>
        <p>
          The Transformer model in a neural network
is a model that, unlike recurrent neural
networks, processes data not sequentially, but
simultaneously, “exploring” the context of the
entire block of data at once using its
architecture [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>The Transformer model in a neural network
consists of an encoder and a decoder, which
helps to simplify the structure of the text and
accelerate the learning of the model.</p>
        <p>
          Among the systems that solve the problems
of natural language processing, we highlight:
• ChatGPT (Generative Pre-trained
Transformer model in a neural network)
is a chatbot that simulates answers to
questions. With high accuracy, it can give
advice, write simple program code,
simulate a certain entity (for example, a
database), etc. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
• DALL-E is based on the given context,
and generates an image, even if the given
entities in real life are difficult or
impossible to combine [
          <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
          ].
• Grammarly checks the English text for
grammatical, lexical, and spelling errors
(not correcting the text immediately, but
offering options for corrections) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
• Rytr generates a variety of texts based on
topic, genre, environment, mood, etc.
(based on ChatGPT’s API) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>These systems use a Transformer model in
a neural network that helps them process text
data efficiently.</p>
        <p>In particular, the encoder is used for quick
context aggregation, and the decoder is used
for generating a new sequence of tokens:
• Words.
• Phrases.
• Individual sentences.
• Text fragments.</p>
        <p>
          ChatGPT remembers large amounts of
context used to generate texts. Learning is
uncontrolled and controlled [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>During unsupervised training, the neural
network learns to generate words in the
correct order using n-context words.</p>
        <p>Controlled learning uses already trained
neural network on unsupervised data and
applies “reinforcement learning with human
feedback”, which consists of the following
steps:
1. Controlled fine-tuning.</p>
        <p>
          A small amount of demonstration data
selected by experts is used to train the SFT
(supervised fine-tuning) model [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. This
stage is used only once.
2. Imitation of human preferences.
        </p>
        <p>
          Experts evaluate the output of the SFT
model, creating a new data set consisting of
“comparison data”. A reward model
(reward model [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]) is trained on this set.
3. Proximal policy optimization.
        </p>
        <p>A reward model is used to further fine-tune
and improve the SFT model.</p>
        <p>
          Grammarly is a Ukrainian system that helps
to find and correct errors in English text [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          Grammarly uses machine learning, deep
neural networks, and natural language
processing techniques to solve the problems of
parsing and improving a piece of text in natural
language [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>
          GECToR is an approach to correcting
grammatical errors in the text based on the
principle of “tags, not rewriting” [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
        </p>
        <p>That is, problem areas in the text are
marked with tags, and options for their
correction are offered.</p>
        <p>GECToR contains an encoder block and two
parallel linear layers that are responsible for
detecting and correcting errors in the text.</p>
        <p>The system works by predicting changes
that will occur in the text.</p>
        <p>Tokens contain such special tags as:
• “keep”
• “delete”
• “append_t1”
• “replace_t2” (t1 and t2 are new tokens).</p>
        <p>There are also special tags to indicate the
change of case or tense of the verb.</p>
        <p>
          DALL-E generates an image according to its
textual description provided by the user [
          <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
          ].
        </p>
        <p>
          The connection between textual semantics
and visual representation in DALL-E is created
with the help of another OpenAI model—CLIP
(contrastive language image pre-training) [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
        </p>
        <p>The CLIP model is trained on a dataset of
captions and the pictures they describe.</p>
        <p>CLIP model training includes, in particular:
• Passage of images and texts through
encoders to display objects in
ndimensional space, where the
relationship between the text and the
image is determined.
• Finding the level of similarity between
text and image embeddings.
• Learning maximizes the level of
similarity between n-coded words.
• Using the Transformer model in a neural
network to encode text and images.</p>
        <p>After training the CLIP model, images are
generated taking into account the available
visual semantics.</p>
        <p>For this, a diffusion model is used, the
training of which in DALL-E is used to
transform the encoded CLIP text context into
an encoded image and to decode it into a
human-understandable representation.</p>
        <p>
          Rytr—generates various texts for different
situations, for example, for posts on social
networks or when generating questions for an
interview [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>
          Rytr uses ChatGPT to generate texts. Rytr
serves as a kind of wrapper for quickly making
queries to ChatGPT and filtering its
responses [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
2. Text Processing Methods
Text preprocessing is an integral part of any
natural language processing system.
        </p>
        <p>There are many methods to perform text
optimization, fragmentation, tokenization,
finding the relationship between tokens, etc.</p>
        <p>The full cycle of natural language
processing involves:
• Tokenization.
• Lexical analysis.
• Syntactic analysis.
• Semantic analysis.
• Pragmatic analysis.</p>
        <p>Tokenization—segmentation of text
consisting of symbols (letters, spaces,
numbers, punctuation marks, etc.) into words
and phrases.</p>
        <p>The main task of tokenization is the
unification of the text, its fragmentation, and
division into tokens (morphemes, words, or
phrases) according to defined rules.</p>
        <p>Normalization of the text involves the
unification of tokens, for example, “y.”, “year”
and “Year” should be reduced to the same form.</p>
        <p>Words are divided into separate parts using
computational morphology.</p>
        <p>In the Ukrainian language are
distinguished:
• Prefixes.
• Roots.
• Suffixes.
• Endings.</p>
        <p>Categories to which the structure of the
word can be attributed:
• Isolated, when it is impossible to divide
the word into smaller parts.
• Agglutinative, when the word can be
divided into small morphemes.
• Inflectional, when there are no clear
boundaries between morphemes and
morphemes can acquire several
grammatical meanings.
• Polysynthetic, which is similar to
agglutinative, but also suggests such a
variant, when morphemes are combined
and form a complex of words (groups of
words), which can be a whole sentence.</p>
        <p>
          Embedding (vector representation of a
word) is a set of methods by which words are
transformed into an n-dimensional vector of
real numbers [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
        </p>
        <p>
          The simplest method of representing
tokens as a vector is to use a one-hot vector
[
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], which has the dimension of a dictionary
and consists of 0.
        </p>
        <p>The token is encoded using the
corresponding position in the dictionary,
which will be equal to 1 in this vector [23].</p>
        <p>The disadvantage of a one-hot vector is that
dictionaries can have millions of different
values, and deep neural networks usually
contain many linear layers of abstraction. Such
encoding is inefficient and is replaced by other
embedding methods, such as [24]:
• word2vec [25]
• GloVe
• fastText, etc.</p>
        <p>These methods have a small (compared to a
one-hot vector), fixed, and defined size.</p>
        <p>These methods are based on the use of
neural networks, and the goal of their training
is the vector approximation of words that are
close to each other in the context.</p>
        <p>Word2vec is a simple and efficient method of
learning vectors for tokens, which can be
created using [24, 25]:
• CBOW (continuous bag-of-words) [26].
• Skip-gram (SG) [27].
• Optimization methods (hierarchical
softmax and negative selection) [28].</p>
        <p>The model used is a single context, which is
a fully connected neural network.</p>
        <p>
          The input is a word represented by a
onehot vector [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] of dimension V (dictionary
size), where only one value is 1, and all others
are 0.
        </p>
        <p>Multi-contextuality works according to the
same principle, only the input is not one
context word, but a set of context words.</p>
        <p>In Skip-gram, context is found using a word.</p>
        <p>Before propagating an error is the sum of all
errors.</p>
        <p>The models discussed above are not very
complex, but they take a long time to train on
large amounts of data.</p>
        <p>Therefore, optimization methods have been
developed that speed up learning, namely:
• Hierarchical softmax.
• Negative sampling.</p>
        <p>Negative sampling is a method of
optimizing the training of word2vec models.</p>
        <p>The main problem with the models is that
during training, all the weights of the original
matrix are used.</p>
        <p>Therefore, each iteration of learning a
neural network (including such a neural
network as a Transformer), which will then be
used in the processing of natural language
texts, is expensive.</p>
        <p>When using negative sampling, not all
vectors are taken from the original matrix, but
only a small part of them.</p>
        <p>To these vectors is added a so-called
“positive” vector, which represents (displays)
the predicted word.</p>
        <p>When using negative sampling, a
mechanism must be added that will reduce the
impact of tokens that are in the context of
“positive.”</p>
        <p>A unigram table of a certain size is created,
which is almost a thousand times larger than
the size of the dictionary.</p>
        <p>The table is filled with token indexes.</p>
        <p>The more often a token occurs in the
training sample of the neural network, the
more times its index will occur in the unigram
table.</p>
        <p>Hierarchical softmax uses a binary tree that
represents a dictionary of size V (V words must
be leaves of the tree).</p>
        <p>The number of internal nodes is equal to
(V–1). The input matrix representing
(displaying) the words is unchanged.</p>
        <p>There is no output matrix, and instead, the
weights of the internal nodes of the binary tree
are used, the number of which is smaller than
the size of the dictionary.</p>
        <p>The created path is used to estimate the
probability of finding a word in the context.</p>
        <p>To find a word, you need to move from the
root with the selection of a branch using the
formula:
 ( ,</p>
        <p>) =  (ℎ ∗   ),
where ri are weight coefficients of internal
nodes.</p>
        <p>The second method of creating embedding
is global vectors for representing GloVe words.</p>
        <p>It is a log-linear regression that combines
the advantages of matrix factorization and
local context methods (for example, skip-gram,
CBOW) [26, 27, 29].</p>
        <p>This approach:
• Effectively uses statistical information.
• Learning occurs only on non-zero
elements in the word-sharing matrix X,
which contains elements Xij (the value of
which is the number of times word j
occurred in context i).</p>
        <p>Also calculated:</p>
        <p>Xi = ∑   is the number of words in the
context.</p>
        <p>Pij = P(i|j) = Xij/Xi is the probability of word j
appearing in context i.</p>
        <p>To get started, you need:
• Consider a simple example that reveals
connections between two words i and j.
• Their relationship can be tested by
examining their co-occurrence with
another word k.</p>
        <p>The following patterns are distinguished:
• If k is more related to i, then Pik/Pjk will
have a value greater than 1.
• If k is more related to j, then Pik/Pjk will
have a value close to 0.
• If k is almost equally related to i and j,
then Pik/Pjk will have a value close to 1.
neural
network.</p>
        <p>Generative neural networks are used in natural
language processing to generate a sequence of
tokens that depend on the input context and
the already generated sequence of words.</p>
        <p>The following main neural network models
are used for sequence processing: recurrent
network
and</p>
        <p>Transformer neural</p>
        <p>Recurrent neural networks are networks in
which connections between elements form a
directed loop.</p>
        <p>A recurrent neural network sequentially
processes input data using an internal memory
that
aggregates</p>
        <p>context and transfers it
between iterations.</p>
        <p>The recurrent neural network has the
following</p>
        <p>modifications: LSTM (long
shorttemp
memory) [30, 31] and GRU
(gated
recurrent units) [32], which improve the
internal memory mechanism.</p>
        <p>The transformer neural network, unlike the
recurrent neural network, processes the input
data not sequentially, but all at the same time.</p>
        <p>A recurrent neural network uses an internal
state (memory) to process sequential inputs,
unlike
conventional
feedforward
neural
networks. With the help of this feature, it
becomes possible to effectively process natural
language, and form appropriate sentences.</p>
        <p>There are</p>
        <p>many examples of recurrent
neural
network
applications in
natural
language processing.</p>
        <p>For example, machine translation, which
simultaneously contains two models, which
are recurrent neural networks.</p>
        <p>One of them serves as an encoder, and the
other as a decoder.</p>
      </sec>
      <sec id="sec-1-2">
        <title>This model is called seq2seq [33].</title>
        <p>LSTM is a recurrent neural network that
processes a sequence of dynamic size</p>
        <p>X = (x1, x2 … xn),
adding new content to a block of memory with
activation gates that control what information
should be forgotten and what should be
retained.</p>
        <p>At each iteration t, the memory ct and the
hidden state ht are updated.</p>
        <p>The GTU (like the LSTM) contains enable
gates that control the flow of information in the
middle of the block but does not have a
separate memory module.</p>
        <p>The activation of output ht at the time t is
the linear interpolation between the result of
the last iteration
candidate for output ℎ̃t:</p>
        <p>ht-1 and the activated
ℎ = (1 −   )ℎ −1 +   ℎ̃ ,
where zt is the so-called “update valve”, which
updates the</p>
        <p>received information and is
calculated by the formula:</p>
        <p>=  (    +   ℎ −1)
ℎ̃t is calculated similarly to the component of a
recurrent neural network:
ℎ̃ =</p>
        <p>ℎ(   +  (  ⊙ ℎ −1))
where rt is the so-called “state clearing valve”
and is calculated similarly to the “update
valve”:</p>
        <p>=  (    +   ℎ −1)</p>
        <p>The best common feature between GRU and
LSTM is the update component, which is
missing from a conventional recurrent neural
network that simply replaces context with
value.</p>
        <p>This component is computed with the new
input token and the previous hidden state.</p>
        <p>Although both GRU and LSTM preserve the
current context and add a new one on top of it.</p>
        <p>
          A transformer is a neural network that
avoids repetition, and its main component is an
attention
mechanism
for
creating
dependency between input and output [
          <xref ref-type="bibr" rid="ref23">34</xref>
          ].
a
        </p>
        <p>With the help of the Transformer neural
network, effective parallelization is carried
out, which increases the quality and ease of
learning.</p>
        <p>The Transformer model in a neural network
adheres to the encoder-decoder architecture.
The encoding and decoding blocks contain the
attention
mechanism,
normalization,
and
forward propagation network.</p>
        <p>The encoder consists of n identical layers.</p>
        <p>Each layer contains two sublayers. The first
is a multilateral mechanism of attention, and
the second is a fully connected network of
forward propagation.</p>
        <p>After each sublayer, the residual connection
and its normalization are used.</p>
        <p>The decoder consists of n identical layers,
and a third layer is added to the two sublayers
from the encoder.</p>
        <p>This layer adds the context of the encoder to
the decoder using a multi-attention mechanism.</p>
        <p>Like the encoder, each sublayer has a
residual connection and layer normalization.
Also, in the first sub-layer of multilateral
attention, a mask is added to prevent the
transition between positions (generated words).</p>
        <p>The attention mechanism is the main
component of the Transformer neural
network.</p>
        <p>Attention accepts as input “question,” “key,”
and “value” in vector form.</p>
        <p>The output is a weighted sum of input
vectors, each of which is combined with its
weight.</p>
        <p>Multilateral attention consists of h
attention. At the output, they are concatenated
and passed through an additional level of
weights.</p>
        <p>The input vectors V, K, and Q are the same
for all multi-attention sublayers, except for the
layer that adds the encoder context to the
decoder, where V and K will be the output of
the encoder, and Q will be the current state of
the decoder.</p>
        <p>The transformer model in the neural
network uses a conventional forward
propagation neural network, which consists of
two linear transformations and an activation
function between them:</p>
        <p>FFN(x) = max(0, xW1+b1)W2+b2</p>
        <p>For more efficient work, you need to use
trained embedding to represent (represent)
words in vector n-space.</p>
        <p>At the output of the decoder, a linear
transformation is performed from the
embedding size to the dictionary size, and
softmax is applied to find the next word token
until a custom token is generated.</p>
        <p>To optimize the Transformer neural
network, a dropout layer is used, which is
applied to the outputs of each of the sublayers
before performing the residual connection.</p>
        <p>Also, the inputs to the encoder and decoder
are additionally activated with the help of the
discard layer.
3. Methods of Optimization
Neural Networks
of
There are many techniques to help improve
neural networks, from updating weights to
preventing overtraining.</p>
        <p>
          The following methods can be distinguished:
• throwing away
• stochastic optimizer Adam [
          <xref ref-type="bibr" rid="ref24">35</xref>
          ]
• residual connection [
          <xref ref-type="bibr" rid="ref25">36</xref>
          ].
        </p>
        <p>Neural networks with limited training data
retrain quickly but show poor results even on
the training set.</p>
        <p>Therefore, many methods have been
developed that try to solve this problem.</p>
        <p>
          With the help of the optimization
mechanism—the dropout method [
          <xref ref-type="bibr" rid="ref26">37</xref>
          ],
neurons are extracted from the input or hidden
layers.
        </p>
        <p>When removed, neurons are thought to
become temporarily inactive along with all
their input and output connections.</p>
        <p>In the simplest version, each neuron is
stored with a fixed probability that is close to
0.5, and each neuron is independent of the
others.</p>
        <p>The Adam stochastic optimizer is used for
gradient optimization of first-order stochastic
objective functions using lower-order moment
estimation.</p>
        <p>The Adam stochastic optimizer is simple to
implement, yet computationally efficient,
requires little memory, and is invariant to
sudden changes in gradients.</p>
        <p>Residual connection is used for deep neural
networks, which are difficult to train due to the
constantly growing number of layers.</p>
        <p>The main problem that arises in the training
of neural networks is the rapidly occurring
gradient explosion on large structures of
neural networks.</p>
        <p>To solve this problem, normalization layers
(layer normalization, batch normalization,
etc.) have been introduced, which help
networks with dozens of levels to become
similar at the time of stochastic gradient
descent.</p>
        <p>During the assimilation step, the problem of
degradation occurs, where the desired
accuracy is gradually reached and then drops
sharply.</p>
        <p>Therefore, it is possible to solve the
problem of degradation by using the residual
connection in the neural network.</p>
        <p>Denoting the base representation as H(x)
for the nonlinear layers of the neural network
the representation looks like</p>
        <p>F(x) = H(x) – x.</p>
        <p>Thus, the representation of H(x) will look
like this:</p>
        <p>F(x) + x,
which is a residual compound.</p>
        <p>To create a text simplification and
normalization system, you need to perform the
following actions:
• Pre-processing of the text, which
standardizes the sentence, breaks it into
tokens and adds special tokens.
• representation of the word in the form of
a vector, which transforms the token into
the corresponding vector space.
• use of a generative model that creates a
text sequence depending on the received
context.</p>
        <p>Before performing the above actions, a text
database (a database of text educational
content of an information learning system or
an information learning system with elements
of intellectualization) is created with pairs:
&lt;input text, desired result&gt;.</p>
        <p>The input text is separate sentences from
various information sources that correspond
to the topic of the selected subject area (theory
of algorithms, cyber security, ontologies,
artificial intelligence).</p>
        <p>The desired result is a modified version of
the input text, which is created with the help of
experts, following the following rules:
• Combined words (groups of words) are
converted into an abbreviation, for
example:
– “neural network”—“NN”
– “Turing machine”—“MT”
– “subject area”—“SA”.
• Removal of phrases (words, groups of
words) that do not significantly affect
the content of the text, for example:
“The first step in the study of neural
networks was made in 1943, when an article
by neurophysiologist Warren McCulloch and
mathematician Walter Pitts was published,
devoted to artificial neurons, as well as the
implementation of a neural network model
using electrical circuits”
turns into
“The study of NN began with an article
devoted to the implementation of NN using
electrical circuits”.</p>
        <p>• Exclusion of redundant clarifications in
the educational text, for example:
“This science (meaning artificial
intelligence) is related to neuroscience,
including cognitive neuroscience, systems
neuroscience, computational neuroscience”
becomes “This science is related to
neuroscience.”</p>
        <p>The larger the text database (the database
of the text educational content of the
information learning system or the
information learning system with elements of
intellectualization), the more the neural
network will recognize patterns and find
various connections.</p>
        <p>The creation of such a database is one of the
most important stages because its quality
directly affects the output of natural language
processing of educational content.</p>
        <p>Natural language text preprocessing is an
important part of preparing the data for
training the Transformer neural network.</p>
        <p>This network works with the content of an
informational learning system or an
informational learning system with elements
of intellectualization.</p>
        <p>The stages of processing depending on the
goals of text processing (educational content)
of the informational learning system.</p>
        <p>To simplify and normalize the text (in
particular, the answers of students given in
natural language), the following steps should
be taken to standardize this text:
• Transfer of all alphabetic characters in
the words of the natural language text
(for example, the student’s answers) to
the lower case of writing (that is, all
capital letters are removed).
• Reduction (but not abbreviations) are
translated into full words, for example,
words such as “y.” and “etc.”, are reduced
to the words “year”, “and the like”, “and
so on” respectively.
• Replacement of various variants of
quotation marks with one type of
“double quotation marks” (depending on
the language, these quotation marks
have different coding).
• Insertion before and after individual
characters of the English language,
which is a separate letter, an additional
special character &lt;t&gt; (there may also be
a space &lt; &gt;) to check whether it belongs
to an indefinite article (if this character
is not an article, then it is excluded from
consideration and simplify the text).
• Inserting before and after individual
characters of the Ukrainian language,
which is a separate letter, an additional
special character &lt;t&gt; (there may also be
a space &lt; &gt;) to check whether it belongs
to the group of prepositions (if this
character is not a preposition, it is
excluded from consideration and
simplify the text).
• Insertion before and after individual
characters, which is a number (single
digit) of an additional special character
&lt;t&gt; (there may also be a space &lt; &gt;).
• Replacing several consecutive spaces in
the text with one space.</p>
        <p>After standardizing the text, sentences of
the answer (or educational content) are
divided into tokens with the help of a special
symbol (maybe a space), which is placed
during the standardization and normalization
of the text.</p>
        <p>At the same time, the so-called “empty
elements” among the received tokens should
be removed.</p>
        <p>All sequences of additionally received
tokens begin with the special character &lt;s&gt;
and end with the special character &lt;/s&gt; to
mark the beginning and end of a sentence (a
period-ended piece of educational content
text), respectively.</p>
        <p>All sentences after standardization and
normalization (an ordered collection of tokens
from which these sentences are created) are
used to train the corresponding neural
network.</p>
        <p>This model is used in natural language
processing, and all sentences form a dictionary
in which the total number of each token in the
training sets is counted.</p>
        <p>Tokens are sorted by their calculated
number (from the largest number to the
smallest), acquire their index relative to the
occupied position, and are deleted if their
number is less than the specified minimum
number.</p>
        <p>Vector representation of the word
(embedding).</p>
        <p>The use of embedding helps to increase the
speed and accuracy of neural network training,
which plays the role of the main model of the
natural language processing process.</p>
        <p>
          The main tasks of creating a vector
representation of words in the text of a
student’s answer given in natural language:
• Get rid of the one-hot vector during the
design of the main model [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].
• Approximation of vectors of words that
occur in the context of each other.
• Finding an associative connection
between individual pairs of words (word
forms, lexemes, etc.).
        </p>
        <p>Transformer model in the neural network
model as an input, the encoder accepts a
sequence of words of the input text, which has
already undergone primary pre-processing.</p>
        <p>The decoder accepts the coded text
generated in the system (based on the use of a
trained neural network) and the context
provided by the encoder.</p>
        <p>The initial value of the encoded text is a
special token &lt;s&gt;) and the context is provided
by the encoder.</p>
        <p>Each newly generated token is appended to
the generated text to be reused by the decoder
until the token equals the special character
&lt;/s&gt;.</p>
        <p>The encoder consists of three consecutive
sublayers.</p>
        <p>Each sublayer contains an attention block
and a fully connected neural network.</p>
        <p>After these sub-layers, the main neural
network, according to which processing of
natural language in the information learning
system with elements of intellectualization of
the system takes place, contains a layer of
discarding and a layer in which the residual
connection and/or normalization takes place.</p>
        <p>The decoder, like the encoder, contains n
linear sublayers, each of which has a similar
structure to the encoder sublayers, but with
certain differences.</p>
        <p>The last vector of the generated matrix is
taken at the output of the decoder.</p>
        <p>This vector undergoes a linear
transformation and a softmax activation
function to form a one-hot vector that defines
the next token in the sequence.</p>
        <p>Normalization and simplification of natural
language text (in particular:
• Student answers.
• The educational content of the
description of completed individual
tasks.).</p>
        <p>The information educational system with
elements of intellectualization can be
performed under the following conditions:
• The source text has a smaller or equal
size than the input text (the size is the
number of tokens in the sentences of the
corresponding natural language text).
• Replacement of terms (in particular,
their synonymization).
• Replacing slang words with their literary
counterparts.
• Replacing slang words (including
Internet slang words) with their literary
counterparts.
• Replacement of a group of words with
corresponding constant abbreviations.
• Addition/deletion/modification of text
(for example, an answer presented in
natural language) contained in the
database of text educational content of
the information educational system with
elements of intellectualization.
• Simplification of the selected fragment of
the text (including standardization, and
normalization).
• Taking into account when simplifying
the text presented in natural language,
contexts either already available in the
database, or generated using a neural
network and taking into account the
dictionary of contexts.</p>
        <p>Simplification of the text in the
informational educational system with
elements of intellectualization involves:
• Preparation of educational data.
• Using a neural network model for
teaching embedding.
• A neural network model for text
generation.
• Services for managing simplification
processes.</p>
        <p>An extension that uses and interacts with
the simplification system provides:
• Adding text to the database of text
educational content of the information
educational system with elements of
intellectualization.
• Simplification of the selected text from
the database (according to the above
rules).
• Simplification of the entered text, taking
into account possible contexts and the
corresponding dictionary of contexts.</p>
        <p>The collection of training data of the neural
network of the information training system
with elements of intellectualization involves:
• Primary text processing (both manual
and automatic).
• Text fragmentation (breaking into
separate sentences).
• Sentence tokenization (breaking the text
into separate tokens).
• Adding special tokens (sentence start
(&lt;ts&gt;) and sentence end (&lt;/ts&gt;)) that
facilitate interaction with the neural
network.
• Creation of a dictionary (sampling and
counting of tokens for all tokenized
sentences is performed, the dictionary is
organized in descending order by
number).</p>
        <p>Tokenized sentences and created
(generated from the texts of the selected
subject area) dictionary will be used by
embedding and sentence generation systems.</p>
        <p>Embedding involves:
• Loading of tokenized sentences,
dictionaries, and trained vectors (if the
training was already performed).
• Generation of a unigram table from the
dictionary.
• Neural network training using
embedding.
• Preservation of the so-called “learned
vectors” in the neural network, which
will then serve as “standards” in the
processing of natural language texts; and
learning.
• Launching the API server for embedding
testing (checking the distance to a
specific token and finding the nearest
tokens).</p>
        <p>The operation of the Transformer model in
the neural network includes, in particular:
• Loading of tokenized sentences, trained
vectors, and trained model, if training
was already performed.
• Launch of the API server for text
normalization and simplification using a
generative model.
• Management of training with a specified
number of epochs and to a specified
accuracy, testing (on a training set), and
simplification of the entered text.</p>
        <p>When processing educational data, where
raw text is accepted as input, the following are
created:
• A file with a list of input-tokenized
sentences (these sentences will be
accepted by neural networks as input
during training).
• A file with a list of output tokenized
sentences (these sentences will be used
by neural networks to approximate
internal parameters by comparison with
input sentences).
• Dictionary that stores all tokens and
their number in JSON format.</p>
        <p>The sources for teaching embedding are, in
particular, files from:
• Input weights (stored in text format,
where each token is a new line in the file,
the first element of the line is the value of
the token (word, number, sign, etc.), the
other elements of the line are a vector
representation of the token, where the
values are arranged in order).
• Output weights (stored in text format,
where each token is a new line in the file,
the first element of the line is the value of
the token, and the other elements of the
line are a vector representation of the
specified token (the values are in order).
The weights are used in training (as a
communication between the hidden and
output layers) and to form a vector
representation of the word).</p>
        <p>In pairs:
&lt;token, vector representation&gt;,
that were trained using the word2vec
embedding model.</p>
        <p>In the informational educational system
with elements of intellectualization, a trained
neural network (Transformer deep neural
network) is used to process text presented in
natural language (it can be a student’s answer
to a question or a description of a completed
individual task).</p>
        <p>Pre-processing of the text, which is
performed by the neural network training data
processing service, normalizes the text for its
further use in the informational training
system with elements of intellectualization.</p>
        <p>The preparation of training data is an
important step because it starts all the
processes of training the Transformer model in
a neural network in the information learning
system with elements of intellectualization.</p>
        <p>The more quality data, the better the result
will be at the output after processing the
natural language text.</p>
        <p>The sentence is divided into tokens using a
special delimiter symbol, and a tokenized
sentence is formed, that is an array of tokens.</p>
        <p>The token library of the information
learning system is formed from these
tokenized sentences.</p>
        <p>The number of tokens in the training set of
the neural network is counted and sorted in
the descending direction.</p>
        <p>With the help of a dictionary, a unigram
table of the distribution of word noise is
created (so that more words are more often
included in the negative sample).</p>
        <p>The neural network Transformer learns to
find various patterns and connections in the
training sample, which are used during training
(both in training the neural network and in
further processing of natural language texts).</p>
        <p>To train the Transformer neural network,
the following are used:
• pre-prepared sentences (from the
educational content of the online course,
which is supported by an informational
educational system with elements of
intellectualization).
• generated vector representation for
each known token.</p>
        <p>The Transformer model in neural networks
has, in particular, the following characteristics:
• “question” and “key” in the attention
layers have a dimension of 150 n, and
“value”—15 n.
• the neural network has hidden layers.
• the dimension of the hidden layers is 2k
n (6≤k≤9).
• n—the number of tokens in the input
sentence.
• n is not a constant value, but it can
change dynamically.
• discarding layers are used, which with a
probability of 0.1 turn off one or another
(unimportant, redundant) neuron, the
absence of which does not affect the
quality of natural language text
processing, but reduces the time of this
processing.</p>
        <p>In the informational educational system
with elements of intellectualization of learning
control of the Transformer neural network.</p>
        <p>This network is used in the processing of
natural language texts, it is performed using
the following actions:
• Training with a given number of
iterations.
• Testing the model (transformer neural
network) on the training set (training
sample).
• Generation of the original text from the
entered (input) text, taking into account
the existing context.
• Preservation of the learned model in the
knowledge base of the informational
educational system with elements of
intellectualization.
• Using the trained model for:
– simplification of the text
– text normalization
– text standardization.
• Expansion (if necessary) of the system
dictionary.
• Expansion (if necessary) of the library of
contexts.</p>
        <p>After training a neural network, an
informational learning system with elements
of intellectualization that uses it can:
• Recognize the contexts it (neural
network) has studied.
• Generate new sentences.</p>
        <p>
          Using different levels of abstraction to
recognize certain patterns and relationships
that help generate new sentences.
4. Identifying Entities in Text
An important element of natural language text
analysis is the identification of named entities in
the text (Named Entity Recognition, NER) [
          <xref ref-type="bibr" rid="ref27">38</xref>
          ]:
• Names of people.
• Geographical names.
• Names of academic disciplines of the
university.
• Monetary amounts.
• Professional terms (concepts) of a
certain subject area, etc.
        </p>
        <p>To meet the unique needs of an information
training system with intellectualization
elements, it is necessary to adapt NER, for
example, to:
• Classes of learning tasks.
• Learning objectives.
• Subject areas of training courses.
• Language of presentation of educational
content.</p>
        <p>Modern solutions for identifying entities in
natural language texts use two technologies:
• Dictionary selection of named entities</p>
        <p>NER.
• Machine detection of NER entities.</p>
        <p>For example, the task is to determine the
essence of “programming language” and
highlight its name.</p>
        <p>If an information learning system with
intellectualization elements is being developed
for specialty 121 “Software Engineering” or
122 “Computer Science”, then:
• The list of programming languages is
finite.
• To determine the language in a request,
it is easier to use a special dictionary in
which the authors of training courses
describe all existing or studied
programming languages at the
university.</p>
        <p>Dictionaries also allow you to set the
synonymy of names (programming language
C++—“pluses”) and normalize the name found
in the request to form a query to the database
of the information learning system with
elements of intellectualization.</p>
        <p>For the NER task, in addition to the neural
network, an approach based on language rules
(rule-based) is also used, in which students
communicate with an information learning
system with elements of intellectualization.</p>
        <p>
          In addition, several libraries are combined in
one convenient API, which allows solving basic
NLP problems for the Ukrainian (English)
language, for example, as [
          <xref ref-type="bibr" rid="ref28 ref29">39, 40</xref>
          ]:
• Tokenization.
• Segmentation of offers.
• Word embedding.
• Morphology marking.
• Lemmatization.
• Normalization of phrases.
• Parsing.
• NER tags.
• Extraction of facts.
        </p>
        <p>Based on modern technologies for
identifying data from natural language texts of
various natures, modern solutions are
implemented for a wide range of problems
solved using an information learning system
with elements of intellectualization, in
particular:
• Registration of a student’s request (for
example, to a course database and/or
teacher), a student’s response, and a
description of the solution to an
individual assignment.
• Highlighting entities in a student’s
request, his answer to a question, and a
description of the solution to an
individual task.
• Prompt extraction of educational
content (at the student’s request) from
the database of additional educational
material of the course (including from
the Internet via links).
• Comparison of documents (for example,
different versions of the same answer, a
piece of educational content).</p>
        <p>
          Classification of texts (classification of
requests, answers, and appeals, determining the
importance of a request and searching for a
ready-made answer, counseling students, etc.).
5. Understanding the Text
Technologies and systems for understanding
natural language texts (Natural Language
Understanding, NLU) link together entities
(concepts, terms, keywords, etc.) that appear
in the text with specific relationships [
          <xref ref-type="bibr" rid="ref30">41</xref>
          ].
        </p>
        <p>Entities and connections between them form
a unified system of what is described in the text.</p>
        <p>This is the so-called “machine
understanding” of the text, which allows, for
example, to give correct answers to questions
related to this description.</p>
        <p>Important for NLU is the resolution of
frequently occurring linguistic constructions,
such as:
• Ellipsis—restoration of missing words
(for example, in the sentence “The speed
of program execution on computer P1
increased by 20%, and on P2—by 30%,”
the words “computer” and “increased”
are missing).
• Anaphora—a pronoun replaces a noun.
• Homonymy—a polysemantic
interpretation of a word (for example, a
word such as “course” can refer to a
subject area:
– “finance” (exchange rate).
– “navigation/shipping” (ship’s course
at sea).
– “training” (course of educational
material, course of study).</p>
        <p>Processing natural language text entities
involves turning data into knowledge.</p>
        <p>Extracting entities in this case involves the
use of an ontological approach.</p>
        <p>This means that during the analysis of
natural language text, a semantic hierarchy of
concepts is built.</p>
        <p>Thus, an important goal of processing texts
in natural language is not just training the
appropriate neural network, but also finding
certain regularities present in these texts,
connections (relationships) between concepts
(terms, entities) and in them.</p>
        <p>After training, the neural network (a natural
language text processing model) even
sometimes finds ways to further simplify the
training data (output).</p>
        <p>That is, after training, the neural network
can find flaws in natural language text
processing when filling a text database (a
database of the textual educational content of
an informational learning system or an
informational learning system with elements
of intellectualization).</p>
        <p>It is also possible to achieve a higher level of
abstraction of natural language text generation
by feeding the trained neural network a text
that it has already simplified.</p>
        <p>It should be noted that the main context is
almost unchanged with such processing.</p>
        <p>And if you continue this chain of
simplification, then the neural network will
gradually lose the main context and reach the
following logical conclusions:
• looped simplification (simplification
stops happening, i.e. changes stop).
• generating texts with random words.</p>
        <p>During the training of the Transformer
neural network, a pattern was deduced
regarding the dimensionality of the layer of
multidirectional attention.</p>
        <p>Scales have dimension n×l,
where
n is the size of the incoming message
(dynamic parameter).</p>
        <p>l is a defined parameter of the linear
transformation.</p>
        <p>The parameter l affects the number of
socalled “found” connections, and this number
depends on the size of the embedding.</p>
        <p>The following patterns are distinguished:
• If l is larger than the size of the
embedding, then the neural network
(natural language text processing
model) will “invent” new connections.
• If l is smaller than the embedding size,
then the neural network (natural
language text processing model) may
miss certain connections.
• If l is equal to the embedding size, then the
trained neural network performs best.</p>
        <p>This pattern can be compared to the
resizing of an image when neural networks
learn to recognize an image.</p>
        <p>When the image is enlarged, its additional
pixels are formed from those next to it, and if it
is reduced, then certain pixels are removed.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>6. Conclusions</title>
      <p>The use of a neural network, according to
which texts are processed in natural language,
can also be used to solve other problems, in
particular, such as:
• Simplification of documents.
• Unification of license agreements.
• Simplification of the texts of Internet
pages (opened in the browser), which
can be provided in educational courses,
as a link to additional material.
• Simplification of the texts of PDF
documents (opened in the browser),
which can be provided in training
courses as links to additional material.</p>
      <p>The approach to natural language text
processing proposed in the paper involves
normalization and simplification of the text.</p>
      <p>At the same time, the Transformer model in a
neural network is used for text processing, with
the help of which good results were obtained.</p>
      <p>A trained neural network can demonstrate
good results:
• On sentences that she once met (“saw”).
• Sentences that are similar in context to
those that she already knows how to
process.
• Sentences with “damaged” (for example,
incompletely spelled or misspelled)
words.</p>
      <p>The work included: various methods of
neural network optimization that prevent
cases of gradient explosion, gradient damping,
overtraining, etc. are considered and analyzed.</p>
      <p>Various natural language processing
methods are considered.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>O.</given-names>
            <surname>Romanovskyi</surname>
          </string-name>
          , et al.,
          <article-title>Prototyping Methodology of End-to-End Speech Analytics Software</article-title>
          ,
          <source>in: 4th International Workshop on Modern Machine Learning Technologies and Data Science</source>
          , vol.
          <volume>3312</volume>
          (
          <year>2022</year>
          )
          <fpage>76</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.</given-names>
            <surname>Iosifov</surname>
          </string-name>
          , et al.,
          <source>Transferability Evaluation of Speech Emotion Recognition Between Different Languages, Advances in Computer Science for Engineering and Education</source>
          <volume>134</volume>
          (
          <year>2022</year>
          )
          <fpage>413</fpage>
          -
          <lpage>426</lpage>
          . doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>031</fpage>
          -04812-8_
          <fpage>35</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Iosifov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Iosifova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sokolov</surname>
          </string-name>
          ,
          <article-title>Sentence Segmentation from Unformatted Text using Language Modeling and Sequence Labeling Approaches</article-title>
          , in: 7th
          <source>International Scientific and Practical Conference Problems of Infocommunications. Science and Technology</source>
          (
          <year>2020</year>
          )
          <fpage>335</fpage>
          -
          <lpage>337</lpage>
          . doi:
          <volume>10</volume>
          .1109/PICST51311.
          <year>2020</year>
          .
          <volume>9468084</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I.</given-names>
            <surname>Iosifov</surname>
          </string-name>
          , et al.,
          <article-title>Natural Language Technology to Ensure the Safety of Speech Information</article-title>
          , in: Workshop on Cybersecurity Providing in In-formation and
          <source>Telecommunication Systems</source>
          , vol.
          <volume>3187</volume>
          , no.
          <issue>1</issue>
          (
          <year>2022</year>
          )
          <fpage>216</fpage>
          -
          <lpage>226</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Iosifova</surname>
          </string-name>
          , et al.,
          <source>Analysis of Automatic Speech Recognition Methods, in: Workshop on Cybersecurity Providing in Information and Telecommunication Systems</source>
          , vol.
          <volume>2923</volume>
          (
          <year>2021</year>
          )
          <fpage>252</fpage>
          -
          <lpage>257</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Romanovskyi</surname>
          </string-name>
          , et al.,
          <source>Automated Pipeline for Training Dataset Creation from Unlabeled Audios for Automatic Speech Recognition</source>
          , Advances in Computer Science for Engineering and
          <string-name>
            <surname>Education</surname>
            <given-names>IV</given-names>
          </string-name>
          , vol.
          <volume>83</volume>
          (
          <year>2021</year>
          )
          <fpage>25</fpage>
          -
          <lpage>36</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -80472-
          <issue>5</issue>
          _
          <fpage>3</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>O.</given-names>
            <surname>Iosifova</surname>
          </string-name>
          , et al.,
          <source>Techniques Comparison for Natural Language Processing, in: 2nd International Workshop on Modern Machine Learning Technologies and Data Science</source>
          , vol.
          <volume>2631</volume>
          , no. I (
          <year>2020</year>
          )
          <fpage>57</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>[8] An Introduction to Transformer Models in Neural Networks and Machine Learning (</article-title>
          <year>2023</year>
          ). URL: https://www.linkedin.com/pulse/intro duction-transformer
          <string-name>
            <surname>-</surname>
            models-neural- [23]
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bagui</surname>
          </string-name>
          , et al.,
          <source>Machine Learning and networks-machine-learning Deep Learning for Phishing Email</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>OpenAI</given-names>
            <surname>ChatGPT. URL</surname>
          </string-name>
          <article-title>: Classification using One-Hot Encoding</article-title>
          , J. https://chat.openai.
          <source>com/ Comput. Sci</source>
          .
          <volume>7</volume>
          (
          <issue>17</issue>
          ) (
          <year>2021</year>
          )
          <fpage>610</fpage>
          -
          <lpage>623</lpage>
          . doi:
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>OpenAI</surname>
            <given-names>DALL</given-names>
          </string-name>
          -E.
          <source>Generating Digital</source>
          <volume>10</volume>
          .3844/JCSSP.
          <year>2021</year>
          .
          <volume>610</volume>
          .623.
          <string-name>
            <surname>Images</surname>
            from Natural Language [24]
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Church</surname>
          </string-name>
          , Word2Vec, Nat. Lang. Eng. Descriptions. URL: https://openai.com/ 1(
          <issue>23</issue>
          ). (
          <year>2016</year>
          )
          <fpage>155</fpage>
          -
          <lpage>162</lpage>
          , doi: dall-e-
          <volume>2</volume>
          <fpage>10</fpage>
          .1017/S1351324916000334.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>How</surname>
            <given-names>DALL-</given-names>
          </string-name>
          <article-title>E 2 Actually Works</article-title>
          . URL: [25]
          <string-name>
            <given-names>T.</given-names>
            <surname>Pickard</surname>
          </string-name>
          , Comparing word2vec and https://www.assemblyai.com/blog/ho GloVe for Automatic Measurement of w-dall-e-2
          <string-name>
            <surname>-</surname>
          </string-name>
          actually-works/ MWE Compositionality, in: Joint
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Grammarly</surname>
          </string-name>
          .
          <source>Generative AI assistance for Workshop on Multiword Expressions Writing</source>
          . URL: https://www.grammarly. and
          <string-name>
            <surname>Electronic Lexicon</surname>
          </string-name>
          (
          <year>2020</year>
          )
          <fpage>95</fpage>
          -
          <lpage>100</lpage>
          . com/ [26]
          <article-title>Continuous Bag of Words (CBOW</article-title>
          ) in
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Rytr</surname>
          </string-name>
          .
          <article-title>AI writing assistant</article-title>
          .
          <source>URL: NLP</source>
          (
          <year>2023</year>
          ). URL: https://www.geeksfo https://rytr.me rgeeks.
          <article-title>org/continuous-bag-of-words-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <article-title>How ChatGPT actually works</article-title>
          . URL: cbow-in-nlp/ https://www.assemblyai.com/blog/ho [27]
          <string-name>
            <surname>Skip-Gram: NLP Context Words</surname>
          </string-name>
          w
          <article-title>-chatgpt-actually-works/ Prediction Algorithm (</article-title>
          <year>2019</year>
          ). URL:
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Supervised</surname>
          </string-name>
          Fine-tuning: customizing https://towardsdatascience.com/skipLLMs. URL: https://medium.com
          <article-title>/ gram-nlp-context-words-predictionmantisnlp/supervised-fine-tuning- algorithm-5bbf34f84e0c customizing-llms-a2c1edbf22c3 [28] Negative Sampling vs Hierarchical</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pitis</surname>
          </string-name>
          ,
          <article-title>Failure Modes of Learning Softmax (</article-title>
          <year>2021</year>
          ). URL: https://flavienReward Models for
          <article-title>LLMs and other vidal.medium.com/negative-samplingSequence Models, in: The Many Facets of vs-hierarchical-softmax-462d063dfca4 Preference-Based Learning (</article-title>
          <year>2023</year>
          ). [29]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <surname>C Manning</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Data</surname>
            <given-names>Right</given-names>
          </string-name>
          :
          <article-title>Building a Successful Dataset GloVe: Global Vectors for Word From the Ground Up</article-title>
          . URL: Representation, Conference on Empihttps://www.grammarly.
          <article-title>com/blog/eng rical Methods Natural Language ineering/high-quality-nlp-datasets/</article-title>
          <string-name>
            <surname>Processing</surname>
          </string-name>
          , Association for Compu-
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <article-title>Experimenting with GECToR: Research tational Linguistics (</article-title>
          <year>2014</year>
          )
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          . into Ensembling and Knowledge doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>D14</fpage>
          -1162.
          <article-title>Distillation for Large Sequence Taggers</article-title>
          . [
          <volume>30</volume>
          ]
          <string-name>
            <given-names>Long</given-names>
            <surname>Short-Term Memory Network</surname>
          </string-name>
          <string-name>
            <surname>URL</surname>
          </string-name>
          : https://www.grammarly.com/ (
          <year>2021</year>
          ). URL: https://www. blog/engineering/experimenting-with
          <article-title>- sciencedirect.com/topics/computergector/ science/long-short-term-memory-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19] CLIP:
          <article-title>Connecting text and images</article-title>
          . URL: network https://openai.com/research/clip [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , L. Dong,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lapata</surname>
          </string-name>
          , Long Short-
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <article-title>ChatGPT Architecture: Will ChatGPT Term Memory-Networks for Machine Replace Search Engine</article-title>
          ? URL: Reading, Conference on Empirical https://opchatgpt.com/chatgpt
          <article-title>- Methods In Natural Language Procearchitecture-will-chatgpt-replace- ssing, Association for Computational search-engine/ Linguistics (</article-title>
          <year>2016</year>
          )
          <fpage>551</fpage>
          -
          <lpage>561</lpage>
          . doi:
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghannay</surname>
          </string-name>
          , et al.,
          <source>Word Embedding 10.18653/v1/D16-1053. Evaluation and Combination</source>
          , Tenth [32]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pasa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sperduti</surname>
          </string-name>
          , Pre-training
          <source>of International Conference on Language Recurrent Neural Networks via Linear. Resources and Evaluation (LREC'16)</source>
          , Autoencoders, 27th
          <source>International European Language Resources Conference Neural Information Association (ELRA)</source>
          (
          <year>2016</year>
          )
          <fpage>300</fpage>
          -
          <lpage>305</lpage>
          . Processing System (
          <year>2015</year>
          )
          <fpage>3572</fpage>
          -
          <lpage>3580</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <article-title>One-Hot Encoding in NLP</article-title>
          . URL: [33]
          <string-name>
            <given-names>A.</given-names>
            <surname>Takezawa</surname>
          </string-name>
          , How to Implement https://www.geeksforgeeks.org/one- Seq2Seq
          <source>LSTM Model in Keras</source>
          (
          <year>2019</year>
          ).
          <article-title>hot-encoding-in-nlp/ URL: https://towardsdatascience.com/ how-to-implement-seq2seq-lstm-model -in-keras-shortcutnlp-6f355f3e5639</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , et al.,
          <source>Attention is All you Need, 31st International Conference Neural Information Processing System</source>
          (
          <year>2017</year>
          )
          <fpage>6000</fpage>
          -
          <lpage>6010</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <article-title>Adam: a Method for Stochastic Optimization</article-title>
          , 3rd International Conference Learning Representations (
          <year>2015</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.1412.6980.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          , et al.,
          <article-title>Deep Residual Learning for Image Recognition, 29th IEEE Conference Computer Vision</article-title>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2016</year>
          )
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . doi:
          <volume>10</volume>
          .1109/cvpr.
          <year>2016</year>
          .
          <volume>90</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>N.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          , et al.,
          <article-title>Dropout: a Simple Way to Prevent Neural Networks from Overfitting</article-title>
          ,
          <source>J. Mach. Learning Res</source>
          .
          <volume>1</volume>
          (
          <issue>15</issue>
          ) (
          <year>2014</year>
          )
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>Named</given-names>
            <surname>Entity</surname>
          </string-name>
          <article-title>Recognition: The Mechanism</article-title>
          , Methods,
          <string-name>
            <given-names>Use</given-names>
            <surname>Cases</surname>
          </string-name>
          , and Implementation Tips (
          <year>2023</year>
          ). URL: https://www.altexsoft.com/blog/name d-entity-recognition/
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>J.</given-names>
            <surname>Eisenstein</surname>
          </string-name>
          ,
          <source>Natural Language Processing</source>
          . (
          <year>2018</year>
          ). URL: https://cseweb.ucsd.edu/~nnakashole/ teaching/eisenstein-nov18.pdf
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tarwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Edem</surname>
          </string-name>
          ,
          <source>Survey on Recurrent Neural Network in Natural Language Processing, Int. J. Eng. Trends Technol</source>
          .
          <volume>6</volume>
          (
          <issue>48</issue>
          ) (
          <year>2017</year>
          301-
          <fpage>304</fpage>
          . doi:
          <volume>10</volume>
          .14445/22315381/IJETT-V48P253.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gillis</surname>
          </string-name>
          ,
          <source>Natural Language Understanding (NLU)</source>
          (
          <year>2023</year>
          ). URL: https://www.techtarget.com/searchent erpriseai/definition/natural-languageunderstanding-NLU
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>