Textual embeddings with word-type-weighted word2vec
                                Theodor Ladin1 , Lukáš Korel2 and Martin Holeňa2,3
                                1
                                  Gymnázium Nad Štolou, Prague, Czech Republic
                                2
                                  Faculty of Information Technology, CTU, Prague, Czech Republic
                                3
                                  Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic


                                               Abstract
                                               The increasing use of artificial neural networks for knowledge processing often lacks precise knowledge representation. To
                                               address this issue, we propose using a word-type-weighted Word2Vec model to achieve more accurate representations of
                                               individual words within sentences. Our approach incorporates weighting vector embeddings of words based on parts-of-
                                               speech predictions generated by the spaCy library. Experimental results demonstrate that, compared to simple Word2Vec, our
                                               model enhances the accuracy of recognizing the semantics of a sentence, while maintaining significantly lower computational
                                               requirements than large language models and various variants of Transformer.

                                               Keywords
                                               text representation learning, text embedding, text preprocessing, word2vec,


                                1. Introduction                                                                                         methodology used to find the optimal weights and intro-
                                                                                                                                        duces the tools used in this task, the text pre-processing,
                                Recently, artificial intelligence (AI) and machine learning                                             and the weighting approach. Finally, Section 4 presents
                                (ML) have proved to be extremely useful in most scientific                                              experimental results in comparison with other existing
                                fields [1, 2, 3]. Neural networks have been shown to be a                                               approaches.
                                very powerful tool in text analysis, predictive analytics,
                                image recognition, and many other areas, but they lack
                                in one respect – the processing accessibility, with most                                                2. Applicability of Sentence
                                neural networks for text analysis needing supercomput-                                                     Embeddings
                                ers for their training [4]. This creates a problem, if we
                                want to use a low processing cost program to determine    Textual embedding is a useful tool in NLP (natural lan-
                                the semantic similarity of sentences. For such situations,guage processing). It is a vector representation of text
                                we tried to come up with a solution explained in this     that helps to capture the meaning of sentences[6]. This
                                paper.                                                    makes it valuable for many tasks. For example, in text
                                                                                          classification, such as sentiment analysis, sentence em-
                                   The main objective of this research is to develop a beddings help determine if a sentence is positive, neg-
                                lightweight algorithm for correctly predicting sentence ative, or neutral. It is also useful in topic classification,
                                similarity that utilizes text representation only on the where it helps to sort text into categories like sports,
                                word level, i.e., word embeddings solely at the word politics, or technology.
                                level and parts-of-speech (POS) information. By integrat-    Sentence embeddings are naturally suitable for finding
                                ing a word-type-weighted Word2Vec (W2V) [5] model semantic similarities between sentences. They help in
                                with POS tagging, our approach aims to provide a low- tasks such as paraphrase detection, where the goal is
                                cost alternative to large text embedding models based to find sentences with basically the same meaning. An-
                                on transformers which often require high-performance other important application is in information retrieval,
                                accelerators. In our test case with processor i7-12650H sentence embeddings improve search results by finding
                                and memory 2 × 16 GB DDR5 at 4800 MHz, we have documents that match a query more accurately [7]. They
                                achieved approximately 170 sentences/s with sentence are also used in text summarization by picking out the
                                transformer and 15 500 sentences/s with W2V.              most important sentences. Overall, sentence embeddings
                                                                                          make working with text easier and more effective in
                                   The following section explains the concept of sentence many applications.
                                embeddings and its applicability. Section 3 describes the

                                ITAT’24: Information technologies – Applications and Theory, Septem-
                                ber 20–24, 2024, Javorna, Slovakia
                                $ theodor.lagin@gmail.com (T. Ladin); lukas.korel@fit.cvut.cz
                                (L. Korel); martin@cs.cas.cz (M. Holeňa)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                         Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
3. Methodology                                               3.4.1. Text Preparation
                                                             We first divided the training text into two parts. First
3.1. Overview
                                                             being 60 percent of the text and second being 40 percent
This section outlines the methodology used to develop        of the text. We then used our text preprocessor to vector
the word-type weighted Word2Vec model, used to predict       these parts of text. Both parts were made up of pairs of
semantic similarity of sentences. Our approach integrates    sentences, where half of them had the same semantic
word embeddings with parts-of-speech information to          similarity and half did not.
improve accuracy without large processing costs.
                                                             3.4.2. Initial Weight Optimization
3.2. Employed Tools                                          Initially, we needed to make a sufficiently accurate guess
The corpus we used was the Microsoft Research para-          close to the global minimum. To achieve this, we used
phrase corpus[8]. It contains around 5800 pairs of sen-      the Broyden-Fletcher-Goldfarb-Shanno algorithm, also
tences. We trained the algorithm on the train set of this    known as the BFGS method, to minimize the mean
corpus and tested it on its test set.                        squared error [11]. We opted for this method because,
   We used the public GoogleNews-vectors-                    when tested, it was shown to be the most accurate for
negative300[9] Word2Vec implementation, for more             this specific type of problem.
objective and clear results, because of how widespread          The BFGS algorithm is an iterative method for solving
this corpus is. The model utilizes 300-dimensional           unconstrained nonlinear optimization problems. It be-
vectors and has been trained with around 3 million           longs to the family of quasi-Newton methods, which are
different English words. The size of this model is around    used to find local maxima or minima of functions. The
1,6 GB.                                                      key idea behind BFGS is to update an approximation to
   We used the spaCy library [10] for POS (Parts-of-         the Hessian matrix (or its inverse) at each iteration to
Speech) tagging, because of its efficiency and precision,    improve the convergence rate.
which is crucial to fine-tuning the weights correctly.          The BFGS update formula for the inverse Hessian ma-
                                                             trix 𝐻𝑘+1 is given by:
3.3. Text Preprocessing
                                                                               𝑠𝑘 𝑦 𝑇              𝑦𝑘 𝑠𝑇𝑘     𝑠𝑘 𝑠𝑇
                                                                        (︂              )︂     (︂         )︂
3.3.1. Standard Preprocessing                                 𝐻𝑘+1 =         𝐼− 𝑇 𝑘          𝐻𝑘 𝐼 − 𝑇        + 𝑇 𝑘 (1)
                                                                               𝑦𝑘 𝑠𝑘               𝑦𝑘 𝑠𝑘      𝑦𝑘 𝑠𝑘
As the first step of preprocessing, we use the spaCy li-
brary to tag each word in a sentence, which as a result        where:
also tokenizes the given sentence. SpaCy assigns tags             • 𝐻𝑘 is the approximation of the inverse Hessian
automatically, using a neural network. Then we delete               matrix at iteration 𝑘.
all the symbols. After deleting the symbols, we apply a
                                                                  • 𝑠𝑘 = 𝑥𝑘+1 − 𝑥𝑘 is the change in the vector of
standard spell-checking algorithm to correct the mistakes
                                                                    variables.
created by deleting the symbols. After that we employ
                                                                  • 𝑦𝑘 = ∇𝑓 (𝑥𝑘+1 ) − ∇𝑓 (𝑥𝑘 ) is the change in the
our embedding algorithm.
                                                                    gradient of the objective function.
   This embedding algorithm starts by verifying that the
word is not a stop word. If it passes this check, we clar-        • 𝐼 is the identity matrix.
ify whether the word is present in our model. If the        The BFGS algorithm uses this updating formula it-
word is absent, we proceed to lemmatization and check    eratively to improve the approximation of the inverse
again, followed by stemming and another check. If all of Hessian matrix, ultimately aiding in the efficient opti-
these steps are unsuccessful, we assign to each token themization of the objective function.
embedding based on its assigned tag. For instance, the      This function by preconditioning the gradient deter-
embedding of John is assigned to every first name tagged mines the descent direction, towards the local minimum
as a proper noun because there are missing embeddings    for each weight. The error or the loss function was com-
for them.                                                puted as the difference between the target similarity,
                                                         which could be either -1 or 1, and the cosine similarity
3.4. Weights                                             between embeddings. This process polarized the weights,
                                                         making them highly effective as an initial guess. We also
In this study, we consider weights for each word type, tried iterative weight adaptation without an initial weight
denoted as 𝑤wt , where wt is the index of the word type. guess, but it would take too many iterations to produce
For each 𝑤wt , we assume that 𝑤wt ∈ Q.                   any meaningful guess, and fewer iterations did not yield
                                                         any results.
Table 1
Example output from iterations for each word type
 Word type                              Abbreviation     Weight - 1st iteration            Weight - 2nd iteration            Example word
 Adjective                                  ADJ                            1.000                             1.000           last
 Adposition                                 ADP                            0.210                             0.238           across
 Adverb                                     ADV                            0.903                             1.066           separately
 Auxiliary                                  AUX                            0.415                             0.396           would
 Coordinating Conjunction                 CCONJ                            0.020                             0.007           either
 Determiner                                 DET                            0.071                             0.080           every
 Interjection                               INTJ                           0.020                            -0.006           oh
 Noun                                     NOUN                            -6.150                            -6.651           brother
 Numeral                                   NUM                             3.470                             4.467           five
 Particle                                  PART                            0.095                             0.037           nt
 Pronoun                                   PRON                            0.085                             0.100           somebody
 Proper Noun                              PROPN                           -0.011                            -0.585           Amrozi
 Subordinating Conjunction                SCONJ                            0.119                             0.112           since
 Verb                                      VERB                            3.514                             4.204           reported

                                                  Fitted Gaussian distribution of samples
                  4


                  3


                  2


                  1
        Weight


                  0


                 −1


                 −2
                      ADJ


                            ADP


                                  ADV


                                           AUX


                                                 CCONJ


                                                         DET


                                                                 INTJ


                                                                              NOUN


                                                                                     NUM


                                                                                             PART


                                                                                                    PRON


                                                                                                           PROPN


                                                                                                                     SCONJ


                                                                                                                               VERB

                                                                        POS tag


Figure 1: Fitted Gaussian distribution of samples


3.4.3. Gaussian Distribution                                              3.4.4. Final Weights Optimization
The BFGS method was quite dependent on the initial                        Subsequently, we generated random samples from the
conditions, and hence we did a number of iterations of                    obtained Gaussian distribution. These samples were gen-
this function while changing the text that was supposed                   erally similar (Figure 1), although there were a few excep-
to be similar or not. Afterward, we fitted a Gaussian                     tions, such as with nouns, created from the larger ratio
distribution on the given ratio between weights because                   differences.
we considered the ratio more important than the finalized                    Although some estimates were worse than others, all
weights. Figure 1 depicts the weight ratios obtained in                   the differences could be rectified, with the method we
Table 1. We normalized the overall distribution around                    employed at last. We refined the weights, that were dif-
zero.                                                                     ferent, through an iterative process, comparing them
                                                                          with weights derived from the Gaussian distribution
                                                                          with small enough differences. The refinement was
achieved by minimizing logistic loss using the Nelder- Table 2
Mead method. The logistic loss was calculated based on Summary of Experimental Setup
the prediction accuracy.                                  Category                 Our choice
   The logistic loss for a binary classification problem, Dataset                  Microsoft Research Paper
also known as log-loss or binary cross-entropy loss, is   Minimization Algorithms  BFGS, Nelder-Mead
given by:                                                 Error Functions          Logistic Loss,
                                                                                                     Mean Squared Error
                   𝑁                                                Assumed Distribution             Gaussian Distribution
           1 ∑︁
  𝐿(w) = −       [𝑦𝑖 log 𝜎(x𝑖 · w)                                  Evaluation Metrics               Accuracy, F1-Score, AUC
           𝑁 𝑖=1
                                                                    Number of Executions             10
                     +(1 − 𝑦𝑖 ) log(1 − 𝜎(x𝑖 · w))]     (2)         Training-Testing Set Ratio       60 % : 40 %

  where:
                                                               numerals, and verbs had the largest weights, while other
     • 𝑁 is the number of samples,
                                                               parts of speech, for instance determiners or adpositions
     • 𝑦𝑖 is the true label (0 or 1) for the 𝑖-th sample,      had weights close to zero. This most likely happened due
     • x𝑖 is the feature vector for the 𝑖-th sample,           to these POS having such large impact on sentences. The
     • w is the weight vector,                                 final weights are shown in table 3.
     • 𝜎(𝑧) is the sigmoid function defined as 𝜎(𝑧) =
          1
       1+𝑒−𝑧
               .
                                                               Table 3
   The Nelder-Mead algorithm minimizes the logistic loss       Example of final word type weights, other types were equal
function by iteratively refining a simplex with 𝑛 + 1          to 0.060, but this value almost does not affect results, because
                                                               the occurrence of the other types is very rare. The final token-
vertices in an 𝑛-dimensional space [12]. The Nelder-
                                                               based embedding corrector 𝑤𝑒𝑐 = −0.028.
Mead method is particularly effective for optimizing the
logistic loss function in logistic regression, especially in                  Word type abbreviation           Weight
cases where the gradient is unavailable or the function                       ADJ                               -1.330
is non-smooth. Through successive adjustments of the                          ADP                                0.341
                                                                              ADV                               -0.616
simplex vertices via reflection, expansion, contraction,
                                                                              AUX                               -0.334
and shrinkage, the algorithm steadily progresses toward                       CCONJ                              0.126
the minimum of the logistic loss function.                                    DET                                0.308
                                                                              INTJ                              -0.143
3.4.5. Embedding Correction                                                   NOUN                               4.970
                                                                              NUM                               -2.829
Embeddings were too dependent on the length of their                          PART                              -0.396
sentences. We have created a gradient-based weight cre-                       PRON                              -0.060
ator, which modifies the embedding. It adds a corrector                       PROPN                              0.068
multiplied by the count of tokens in the sentence. We                         SCONJ                             -0.011
chose to use the additive weighted count of tokens in                         VERB                              -2.656
a sentence because, after many tests with different cor-
rections, such as modification by the count of particular
word types and multiplication with a weighted count of 4.2. Classification
tokens, it was shown to be the most differentiating factor
between different sentences.                               We have compared our approach to the BERT (Bidi-
                                                           rectional Encoder Representations from Transformers)
                                                           [13] fine-tuned for sentence embeddings, namely all-
3.5. Full experimental setup
                                                           MiniLM-L12-v2 which has good benchmark results1
In table 2, you can see the full experimental setup of the and simple averaging Word2Vec without weighting. All
methodology.                                               results in this test have been obtained from the indepen-
                                                           dent testing dataset. The testing dataset is balanced to
                                                           contain the same number of records for each class (the
4. Results                                                 same and different descriptions). We used the Accuracy,
                                                           F1 score, and AUC[14] for measuring all the statistics.
4.1. Final Weights
The resulting final weights were in some cases negative,       1
                                                                   benchmark results of available sentence transformers: https://www.
with nouns being overly positive. Adjectives, nouns,               sbert.net/docs/sentence_transformer/pretrained_models.html
Table 4                                                      5. Conclusion
Results obtained on the balanced testing dataset. The best
results have been achieved by the BERT, which is based on This paper introduces word-type-weighted Word2Vec
a neural network that has been trained on large amounts of for sentence embeddings. It is based on Word2Vec and
data and requires high-power computing units to perform aggregates words from a given sentence by the pre-
embedding fast. When we compare the simple Word2vec ap- trained weights into one numeric vector. Our weighted
proach, the word type weighting aggregation brings much Word2Vec embedder was compared on testing data with
better results for sentence embedding in all considered met- average aggregation and with the BERT. The tested task
rics.
                                                             was about recognizing whether the given pair of sen-
   Quality measure      Accuracy F1 score AUC                tences is paraphrased with the same meaning or sen-
   BERT                   0.975767  0.975165 0.975767
                                                             tences with different meanings. The complex neural net-
   W2Vmean                0.806442  0.837831 0.806442
                                                             work architecture of the BERT outperformed our solu-
   W2Vweighted            0.933742  0.929273 0.933742
                                                                            tion, but the simple averaging without weighting had a
                                                                            much narrower gap between target classes in our testing
                            Cosine similarity distrib tion for each class   case. The advantage of our solution is using the simple
                     1.00
                                                                            Word2Vec model.
                     0.75                                                      In future research, we would like to extend our solution
Cosine similarity


                     0.50                                                   to embed whole paragraphs. We also want to consider
                     0.25                                                   other word-based embedders.
                     0.00
                    −0.25
                            Gro nd tr th                                    Acknowledgments
                    −0.50           0
                    −0.75           1                                       This work was supported by the Grant Agency of
                                BERT          W2Vmean       W2Vweighted     the Czech Technical University in Prague, grant No.
                                             Embedder                       SGS23/205/OHK3/3T/18 and by the German Research
                                                                            Foundation (DFG) funded project 467401796.
Figure 2: The distribution of results from BERT, mean aggre-
gated Word2Vec and our solution grouped by ground truth
similarity.                                                                 References
                                                                             [1] C. M. Bishop, Pattern recognition and machine
   The results are represented in Table 4 and the box plot                       learning, volume Information science and statistics,
Figure 2. The weighted solution by word type brings                              Springer, Oxford, 2006.
much better results than simple averaging. The weighted                      [2] T. M. Mitchell, Machine learning, volume McGraw-
solution has a higher margin between similar and dis-                            Hill series in computer science, international ed ed.,
similar sentences, but not as high as the BERT. The high                         McGraw-Hill, new York, 1997.
performance is probably caused by its architecture, train-                   [3] S. Marsland, Machine learning: an algorithmic
ing data, and contextual processing of the whole input.                          perspective, volume Chapman&Hall/CRC machine
   The differences between the considered embedders                              learning&pattern recognition series, second edition
were tested for significance by the Friedman test. The                           ed., Chapman & Hall/CRC, Boca Raton, FL, 2014.
basic null hypothesis that the results for all 3 embedders                   [4] O. Suissa, A. Elmalech, M. Zhitomirsky-Geffet, Text
coincide was strongly rejected, with the achieved signifi-                       analysis using deep neural networks in digital hu-
cance 𝑝 = 1.39 × 10−297 . For the post-hoc analysis, we                          manities and information science, Journal of the As-
employed the Wilcoxon signed rank test with the two-                             sociation for Information Science and Technology
sided alternative for all pairs of the compared embedders,                       73 (2022) 268–287. URL: https://asistdl.onlinelibrary.
because of the inconsistency of the more common mean                             wiley.com/doi/abs/10.1002/asi.24544. doi:https:
ranks post-hoc test with the missing closed-world as-                            //doi.org/10.1002/asi.24544.
sumption in machine learning, as pointed out in [15].                        [5] T. Mikolov, I. Sutskever, K. Chen, G. Corrado,
For correction to multiple hypotheses testing, we used                           J. Dean, Distributed representations of words and
the Holm method, which yielded the following corrected                           phrases and their compositionality, Advances in
results:                                                                         Neural Information Processing Systems 26 (2013).
                                                                             [6] S. Minaee, N. Kalchbrenner, E. Cambria,
                     • BERT vs. W2Vmean: 𝑝 = 4.01 × 10−156                       N. Nikzad Khasmakhi, M. Asgari-Chenaghlu,
                     • BERT vs. W2Vweighted: 𝑝 = 2.12 × 10−16                    J. Gao, Deep learning based text classification: A
                     • W2Vmean vs. W2Vweighted: 𝑝 = 1.46 × 10−183                comprehensive review, 2020.
 [7] M. Zhou, D. Liu, Y. Zheng, Q. Zhu, P. Guo, A
     text sentiment classification model using double
     word embedding methods, Multimedia Tools
     and Applications 81 (2022) 18993–19012. URL:
     https://doi.org/10.1007/s11042-020-09846-x. doi:10.
     1007/s11042-020-09846-x.
 [8] W. B. Dolan, C. Brockett, Microsoft research
     paraphrase corpus, Microsoft Research, 2005.
     URL: https://www.microsoft.com/en-us/download/
     details.aspx?id=52398, accessed: August 13, 2024.
 [9] T. Mikolov, K. Chen, G. Corrado, J. Dean, Ef-
     ficient estimation of word representations in
     vector space, arXiv preprint arXiv:1301.3781
     (2013).     URL:      https://github.com/mmihaltz/
     word2vec-GoogleNews-vectors, accessed: August
     13, 2024.
[10] M. Honnibal, I. Montani, S. Van Landeghem,
     A. Boyd, spaCy: Industrial-strength Natural Lan-
     guage Processing in Python (2020). doi:10.5281/
     zenodo.1212303.
[11] C. T. Kelley, Iterative Methods for Optimization,
     SIAM, 1999, pp. 71–86. URL: https://epubs.siam.
     org/doi/abs/10.1137/1.9781611970920.ch4. doi:10.
     1137/1.9781611970920.ch4.
[12] J. A. Nelder, R. Mead, A simplex method for function
     minimization, The Computer Journal 7 (1965) 308–
     313. URL: https://academic.oup.com/comjnl/article/
     7/4/308/354237. doi:10.1093/comjnl/7.4.308.
[13] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
     Pre-training of deep bidirectional transformers for
     language understanding, in: J. Burstein, C. Do-
     ran, T. Solorio (Eds.), Proceedings of the 2019 Con-
     ference of the North American Chapter of the As-
     sociation for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short
     Papers), Association for Computational Linguistics,
     Minneapolis, Minnesota, 2019, pp. 4171–4186. URL:
     https://aclanthology.org/N19-1423. doi:10.18653/
     v1/N19-1423.
[14] C. Ferri, J. Hernández-Orallo, R. Modroiu, Be-
     yond Accuracy, F-Score and ROC: A Family of Dis-
     criminant Measures for Performance Evaluation,
     Springer, 2009.
[15] A. Benavoli, G. Corani, F. Mangili, Should we really
     use post-hoc tests based on mean-ranks?, Journal
     of Machine Learning Research 17 (2016) 1–10. URL:
     http://jmlr.org/papers/v17/benavoli16a.html.