<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Textual embeddings with word-type-weighted word2vec and graph neural networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Theodor Ladin</string-name>
          <email>theodor.lagin@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lukáš Korel</string-name>
          <email>lukas.korel@fit.cvut.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Holeňa</string-name>
          <email>martin@cs.cas.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Information Technology</institution>
          ,
          <addr-line>CTU, Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Gymnázium Nad Štolou</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Computer Science, Czech Academy of Sciences</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The increasing use of neural networks for semantic text analysis highlights the need for more eficient methods without compromising quality. We propose a lightweight approach combining traditional word embeddings with graph convolutional networks (GCNs) to improve sentence similarity recognition. By incorporating syntactic information, such as parts of speech and grammatical functions, our method reduces computational demands at least 2.5 times while maintaining accuracy, when tested statistically indiferent to larger models.</p>
      </abstract>
      <kwd-group>
        <kwd>text representation learning</kwd>
        <kwd>text embedding</kwd>
        <kwd>text preprocessing</kwd>
        <kwd>word2vec</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Artificial neural networks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] allow for the modeling of complex patterns and relationships within data.
However, training these models, especially large-scale architectures such as transformers, remains
a major challenge due to the immense computational resources required. This limitation becomes
particularly relevant when training models for smaller tasks where a high-performance infrastructure
may not be available.
      </p>
      <p>
        The present work focuses specifically on the task of identifying semantic equivalence between
sentences expressed using diferent wording. One of the most widely used methods in text-based
machine learning is Word2Vec [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which enables eficient modeling of semantic relationships between
words through vector representations. This technique has significantly improved the representation of
meaning in large-scale text corpora and has found applications in areas such as machine translation
and sentiment analysis.
      </p>
      <p>However, traditional Word2Vec models have certain limitations, especially in their ability to capture
semantic similarity at the sentence level. This challenge has been addressed by more recent models such
as BERT (Bidirectional Encoder Representations from Transformers), which extend the capabilities of
vector representations by incorporating a deeper contextual understanding. Despite their improved
accuracy, these complex architectures come with higher computational costs and longer training times.</p>
      <p>
        The focus of this study is the use of graph neural networks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] (GNNs), a class of deep learning models
designed for processing graph-structured data, i.e., data represented by means of nodes and edges.
GNNs are particularly well-suited for capturing intricate relationships between entities, making them
valuable in domains such as social networks, biological systems, and knowledge graphs. In natural
language processing [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], GNNs ofer a promising way to model semantic relationships between words,
phrases, or entire sentences, capturing contextual dependencies more efectively. Unlike traditional
approaches such as recurrent neural networks (RNNs) or transformers, which focus on sequential data,
GNNs are capable of modeling complex structures and long-range dependencies more naturally.
      </p>
      <p>The objective of this research is to develop an eficient solution that focuses on the problem of semantic
similarity detection between sentences, with an emphasis on reducing computational requirements. To
achieve this, the study explores the use of graph matching structures and relational graph convolutinal</p>
      <p>CEUR
Workshop</p>
      <p>
        ISSN1613-0073
networks. This difers from standard approaches by not building upon existing transformer [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], but
using the sole graph neural network to get best results possible.
      </p>
      <p>
        This paper builds upon our previously published work on weighted word2vec[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background section</title>
      <sec id="sec-2-1">
        <title>2.1. Applicability of Sentence Embeddings</title>
        <p>Text embeddings transform sentences into high-dimensional vectors, typically consisting of hundreds
of dimensions, that capture their semantic meaning. These embeddings enable a range of downstream
applications, including:
• Text classification — such as sentiment analysis or topic assignment (e.g., sports, politics).
• Semantic similarity search — including paraphrase detection and more accurate information
retrieval.</p>
        <p>• Text summarization — extracting key information for automated summaries.</p>
        <p>
          One commonly used embedding method is Word2Vec, which maps semantically similar words to
nearby points in the vector space. To compare meanings, similarity metrics such as cosine similarity [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
are used, where similar vectors have a value close to 1. The model is bidirectional, allowing transitions
between words and their corresponding vectors. The final quality of the embeddings depends on various
hyperparameters used during the training process.
        </p>
        <p>
          For example, in training the Google News Vectors Negative 300 model, the Continuous Bag of
Words (CBOW) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] algorithm was used. This method selects context windows, containing words both
before and after the target word, and attempts to predict the target word based on these surrounding
words.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Bayesian Optimization</title>
        <p>
          Bayesian optimization [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] is a technique for optimizing black-box functions by modeling the
probabilistic distribution of function values and selecting new evaluation points based on the expected
information gain. It is widely used for hyperparameter optimization, where the goal is to minimize
the number of function evaluations while exploring the parameter space eficiently. This approach
is especially valuable in the training of neural networks, where overfitting and memorization [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]
can hinder generalization, and careful tuning of hyperparameters is crucial for performance. This
approach was especially useful when training the R-GCN 3.3 network, where it could afect how it
either memorized or correctly learned. Bayesian optimization was therefore used on dropout, weight
decay, learning rate and amount of hidden dimensions.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Graph Convolutional Networks</title>
        <p>Graph Convolutional Networks (GCNs) extend the concept of convolution from traditional
gridstructured data, such as images, to graph-structured data, enabling deep learning models to operate
in non-Euclidean domains. The core idea of GCNs is to aggregate information from a node’s local
neighborhood, allowing the network to learn representations that capture the features of the nodes and
the topology of the graph.</p>
        <p>The original GCN model has a limited receptive field it can efectively utilize, with potential challenges
such as oversmoothing in deeper layers, vanishing gradients, and information bottlenecks. Several
extensions, such as graph attention networks (GAT), graph isomorphism networks (GIN), and relational
graph convolutional networks (R-GCN), address these limitations by introducing attention mechanisms
or more expressive aggregation functions.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Text Preprocessing</title>
        <p>
          For basic natural language processing, the pre-trained spaCy model en_core_web_sm was employed.
The extracted tokens were annotated with parts of speech and syntactic dependencies, also obtained
using the spaCy model. During preprocessing, stop words (e.g., “and”, “is”, “in”) were removed, and the
remaining tokens were vectorized [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>The tokens were then transformed into a graph structure suitable for graph convolutional networks.
Syntactic dependencies were used to define the edges between nodes (tokens), while parts of speech
represented the node types (see Figure 2).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Graph Convolutional Network</title>
        <p>A graph convolutional network (GCN) model was constructed with an input layer of 300 dimensions and
two hidden layers consisting of 384 and 192 dimensions, respectively. These values were calculated on
smaller sized datasets through an iterative algorithmic approach and then validated on larger datasets.
The model also included an output layer of 300 dimensions, which aggregated the node vectors by
averaging to produce the final output representation. The activation function used was Leaky ReLU,
which, unlike the standard ReLU, allows small negative values to pass through.</p>
        <p>
          Training was performed using 50 % of the Quora Question Pairs (QQP) dataset. Further 20 % was used
as validating data and 30 % as testing data. The final training process ran on an NVIDIA GeForce RTX
3070 Laptop GPU, aided by multiprocessing [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] techniques for dataset loading and eficient conversion
of individual sentences into graph structures.
        </p>
        <p>Bayesian optimization was applied to tune the hyperparameters during training.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Relational Graph Convolutional Network</title>
        <p>The relational graph convolutional network (R-GCN) difers from the standard GCN primarily in the
volume of information utilized. In this model, all available graph information was leveraged. Edge
types between nodes were defined based on syntactic dependencies. Each part of speech was assigned
its own weight matrix for nodes with the corresponding POS tag. This weight matrix was used for
information aggregation and influenced the output function (Equation 2):
 = SoftMAX ( 1 ⋅ LeakyReLU(</p>
        <p>T
(  ⋅  ⋅</p>
        <p>dst) + ( dst ⋅   ), 0.1))
 =</p>
        <p>Mean ( ⋅   )
(1)
(2)
where  represents the attention coeficients, T refers to the temperature that controls how much
the attention weights focus on the most relevant inputs, and  is the aggregated node representation
obtained as the mean of neighbor features   weighted by  .   and   are the source and destination
node feature vectors,  is a learnable transformation matrix, and  dst and  dst are relation-specific
attention parameters applied to the transformed source and destination features, respectively. The
LeakyReLU activation introduces a small slope for negative inputs, while the SoftMAX ensures that the
attention scores for each destination node sum to one over all of its neighbors.</p>
        <p>In the R-GCN, the message passing and aggregation functions were modified to incorporate separate
weight matrices for diferent types of relations and nodes.</p>
        <p>The dataset was split identically to the basic GCN setup. Bayesian optimization was also applied
here, with an increased dropout rate observed due to the larger number of parameters.</p>
        <p>The dimensionality of layers remained the same as in the original GCN model.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Graph Matching Network Architecture</title>
      <p>Building upon the previously described relational graph convolutional network (R-GCN), we
developed a graph matching network (GMN) designed specifically for semantic similarity tasks between
sentence pairs. The GMN integrates the R-GCN as an internal component for encoding individual graph
representations, while introducing additional layers to efectively compare and match these encoded
graphs.</p>
      <sec id="sec-4-1">
        <title>4.1. Model Architecture</title>
        <p>The GMN consists of the following components:
• R-GCN Encoder: Each input sentence is first processed into a graph representation, which is
passed through the pre-established R-GCN model. This R-GCN encodes node and edge features
into a latent space, capturing complex relational and syntactic dependencies. The R-GCN layers
maintain the same dimensionality as previously defined, with internal transformations of node
features based on syntactic relations.
• Matching Layers: The encoded graph embeddings from the R-GCN are then fed into subsequent
fully connected layers with dimensionalities 256, 128, and 1, respectively. These layers serve
to learn a non-linear matching function that estimates the semantic similarity between the two
input graphs. The final layer outputs a scalar similarity score.
• Activation Functions and Regularization: Leaky ReLU activations are applied after each
hidden layer to maintain gradient flow and enable learning of complex decision boundaries.
Dropout is used to reduce overfitting, particularly important due to the relatively small size of
the final layers compared to the R-GCN encoder.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Training Procedure</title>
        <p>The GMN was trained end-to-end on the Quora question pairs (QQP) dataset, using data proportion
split 50 % training, 20 % validation, 30 % testing. We have employed Bayesian optimization to tune
hyperparameters, such as learning rate, dropout rate, and layer sizes. The loss function is based on
binary cross-entropy, reflecting the binary nature of semantic equivalence between sentence pairs.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Advantages of the Architecture</title>
        <p>By leveraging the R-GCN for rich graph-structured encoding and adding dedicated matching layers with
controlled dimensionality, the GMN efectively balances expressiveness and computational eficiency.
The dimensionality reduction from the R-GCN output to the matching layers (256, 128, 1) enables the
network to focus on the most salient features for the semantic similarity task, while reducing the risk
of overfitting.</p>
        <p>This hierarchical approach allows the GMN to capture both local relational patterns within sentences
and global matching patterns between sentence pairs, resulting in improved performance compared to
standalone R-GCN or traditional embedding comparison methods.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Employed Tools</title>
        <p>
          For tokenization and extracting sentence structures, the spaCy [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] library was used. It enables precise
identification of parts of speech and syntactic constituents, which is crucial for the approach used in
this work.
        </p>
        <p>
          The training of graph neural networks was conducted using PyTorch [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], which allowed for defining
complex GNN architectures and optimizing them efectively.
        </p>
        <p>To measure similarity, word vector representations from the Google News Word2Vec model were
employed, enabling comparison of this work’s model performance with existing embeddings such as
BERT, without attributing any accuracy increase to a better-trained Word2Vec model.</p>
        <p>Training was performed on the Quora Question Pairs (QQP) dataset, containing approximately
400,000 question pairs labeled for semantic similarity. The large size of this dataset helped to improve
the model accuracy.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. Classification Performance</title>
        <p>
          The R-GCN and GCN results were compared with the work [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], and a fine-tuned BERT model
optimized for sentence embeddings—specifically, the all-MiniLM-L12-v2 variant. All evaluations were
conducted on an independent test dataset balanced equally across classes (semantically equivalent vs.
nonequivalent sentence pairs). Performance metrics included accuracy, F1-score, and Area Under the
Curve (AUC) [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], as shown in 1.
        </p>
        <p>
          Graph convolutional networks were compared against each other, the BERT model from the previous
test, and the weighted Word2Vec model [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Again, all results were obtained from an independent test
dataset. Accuracy, F1-score, and AUC metrics are reported in Figure 4.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Statistical Analysis</title>
        <p>BERT and R-GCN showed similar performance, while GCN was clearly worse than both solutions, and
for sentences with diferent ground truth, it has brought a wider gap, so fewer sentences are marked as
similar when they are actually diferent as seen in Figure 5. Testing was done on NVIDIA GeForce RTX
3070 Laptop GPU, where model R-GCN was 2.5 times faster than model BERT while using the same
amount of memory.</p>
        <p>
          The significance of the diferences between the embedding models was evaluated using the Friedman
test [17]. The null hypothesis that all three embedding methods perform equally was strongly rejected
with a p-value of  = 1.39 × 10 −297. For post-hoc analysis, a Wilcoxon signed-rank test with a two-sided
alternative [18] was applied pairwise between embedding methods. We adopted the arguments of [19]
that, in machine learning, the Wilcoxon signed-rank test is more appropriate for this purpose than
the post-hoc tests presented in [20] and [21]. The Holm method [22] was used to correct for multiple
comparisons, yielding the following adjusted p-values [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]:
• BERT vs. W2V Mean:  = 4.01 × 10 −156
• BERT vs. W2V Weighted:  = 2.12 × 10 −16
• W2V Mean vs. W2V Weighted:  = 1.46 × 10 −183
        </p>
        <p>Similarly, to compare the two types of graph convolutional networks and model BERT, the Friedman
test strongly rejected the null hypothesis of equal performance with a significance of  = 4.71 × 10 −144.
For post-hoc analysis, a Wilcoxon signed-rank test with a two-sided alternative [18] was applied pairwise
between embedding methods. The Holm method [22] was used to correct for multiple comparisons,
yielding the following adjusted p-values:
• R-GCN vs. GCN:  = 9.43 × 10 −10
• BERT vs. RGCN:  = 2.12 × 10 −45
• GCN vs. BERT:  = 8.59 × 10 −38</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>The objective of this work was to develop models based on GNN capable of recognizing semantic
similarity between sentences, while maintaining a level of quality comparable to the widely used and
architecturally complex BERT model.</p>
      <p>The two models of graph neural networks were developed: a basic Graph Convolutional Network
(GCN) and a Relational Graph Convolutional Network (R-GCN). Their respective parameters were
compared against the BERT model. The basic GCN was unable to learn suficient information to
outperform the simple Word2Vec baseline, demonstrating its limitations. In contrast, the R-GCN
efectively utilized all available syntactic information and was competitive with BERT in performance.</p>
      <p>The results obtained suggest the feasibility of constructing models based on vector processing of
entire paragraphs or even entire documents. Furthermore, these findings could be applied to enhance
the quality of generative transformers that incorporate graph convolutional functions in one of their
processing stages.</p>
      <p>Overall, this research demonstrates the efectiveness of graph neural networks for text-understanding
tasks.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was supported by the Grant Agency of the Czech Technical University in Prague, grant No.
SGS23/205/OHK3/3T/18 and by the German Research Foundation (DFG) funded project 467401796.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
[17] M. Hollander, D. A. Wolfe, E. Chicken, Nonparametric Statistical Methods, 3 ed., Wiley, Hoboken,</p>
      <p>New Jersey, 2013.
[18] E. L. Lehmann, H. J. M. D’Abrera, Nonparametrics: Statistical Methods Based on Ranks, revised
1st ed., Springer, New York, 2006.
[19] A. Benavoli, G. Corani, F. Mangili, Should we really use post-hoc tests based on mean-ranks?,
Journal of Machine Learning Research 17 (2016) 1–10. URL: http://jmlr.org/papers/v17/benavoli16a.
html.
[20] J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine</p>
      <p>Learning Research 7 (2006) 1–30.
[21] S. Garcia, F. Herrera, An extension on ”Statistical Comparisons of Classifiers over Multiple Data</p>
      <p>Sets” for all pairwise comparisons, Journal of Machine Learning Research 9 (2008) 2677–2694.
[22] P. H. Westfall, R. D. Tobias, R. D. Wolfinger, Multiple Comparisons and Multiple Tests Using the
SAS System, SAS Institute, Cary, North Carolina, 2011.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Norvig</surname>
          </string-name>
          , Artificial Intelligence:
          <article-title>A Modern Approach (4th Edition)</article-title>
          , Pearson,
          <year>2020</year>
          . URL: http://aima.cs.berkeley.edu/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Srinivasa-Desikan</surname>
          </string-name>
          ,
          <article-title>Natural Language Processing and Computational Linguistics: A Practical Guide to Text Analysis with Python, Gensim, spaCy, and</article-title>
          <string-name>
            <surname>Keras</surname>
          </string-name>
          , Expert Insight, Packt, Birmingham,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Introduction to Graph Neural Networks, Springer, Cham,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bird</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Klein</surname>
          </string-name>
          , E. Loper,
          <string-name>
            <surname>Natural Language Processing with Python</surname>
          </string-name>
          ,
          <source>O'Reilly</source>
          , Beijing,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Syncse: syntax graph-based contrastive learning of sentence embeddings</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>287</volume>
          (
          <year>2025</year>
          )
          <article-title>128047</article-title>
          . URL: https://www.sciencedirect.com/ science/article/pii/S0957417425016689. doi:https://doi.org/10.1016/j.eswa.
          <year>2025</year>
          .
          <volume>128047</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <article-title>Bert-enhanced text graph neural network for classification</article-title>
          ,
          <source>Entropy</source>
          <volume>23</volume>
          (
          <year>2021</year>
          ). URL: https://www.mdpi.com/1099-4300/23/11/1536. doi:
          <volume>10</volume>
          .3390/e23111536.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ladin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Korel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Holena</surname>
          </string-name>
          ,
          <article-title>Textual embeddings with word-type-weighted word2vec</article-title>
          ,
          <source>in: ITAT</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>42</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3792</volume>
          /paper4.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I. M.</given-names>
            <surname>Gel'fand</surname>
          </string-name>
          , M. E. Saul, Trigonometry, Birkhäuser, Boston,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Eficient estimation of word representations in vector space</article-title>
          ,
          <source>OpenReview</source>
          (
          <year>2013</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Garnett</surname>
          </string-name>
          , Bayesian Optimization, Cambridge University Press, Cambridge, United Kingdom,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Lin</surname>
          </string-name>
          , et al.,
          <article-title>On the over-memorization during natural, robust and catastrophic overfitting</article-title>
          ,
          <source>arXiv preprint arXiv:2310.08847</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Turkington</surname>
          </string-name>
          ,
          <article-title>Generalized vectorization, cross-products, and matrix calculus</article-title>
          , Cambridge University Press, New York, USA,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gorelick</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Ozsvald, High Performance Python: Practical Performant Programming for Humans</article-title>
          ,
          <string-name>
            <surname>Second Edition</surname>
            ,
            <given-names>O</given-names>
          </string-name>
          'Reilly, Beijing,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Vasiliev</surname>
          </string-name>
          ,
          <article-title>Natural Language Processing with Python and spaCy: A Practical Introduction</article-title>
          , No Starch Press, San Francisco,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Raschka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mirjalili</surname>
          </string-name>
          ,
          <article-title>Machine Learning with PyTorch and scikit-learn: Develop Machine Learning and Deep Learning Models with Python, Packt</article-title>
          , Birmingham,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>C. M. Bishop</surname>
          </string-name>
          ,
          <source>Pattern Recognition and Machine Learning</source>
          , Springer, New York,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>