Textual embeddings with word-type-weighted word2vec Theodor Ladin1 , Lukáš Korel2 and Martin Holeňa2,3 1 Gymnázium Nad Štolou, Prague, Czech Republic 2 Faculty of Information Technology, CTU, Prague, Czech Republic 3 Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic Abstract The increasing use of artificial neural networks for knowledge processing often lacks precise knowledge representation. To address this issue, we propose using a word-type-weighted Word2Vec model to achieve more accurate representations of individual words within sentences. Our approach incorporates weighting vector embeddings of words based on parts-of- speech predictions generated by the spaCy library. Experimental results demonstrate that, compared to simple Word2Vec, our model enhances the accuracy of recognizing the semantics of a sentence, while maintaining significantly lower computational requirements than large language models and various variants of Transformer. Keywords text representation learning, text embedding, text preprocessing, word2vec, 1. Introduction methodology used to find the optimal weights and intro- duces the tools used in this task, the text pre-processing, Recently, artificial intelligence (AI) and machine learning and the weighting approach. Finally, Section 4 presents (ML) have proved to be extremely useful in most scientific experimental results in comparison with other existing fields [1, 2, 3]. Neural networks have been shown to be a approaches. very powerful tool in text analysis, predictive analytics, image recognition, and many other areas, but they lack in one respect – the processing accessibility, with most 2. Applicability of Sentence neural networks for text analysis needing supercomput- Embeddings ers for their training [4]. This creates a problem, if we want to use a low processing cost program to determine Textual embedding is a useful tool in NLP (natural lan- the semantic similarity of sentences. For such situations,guage processing). It is a vector representation of text we tried to come up with a solution explained in this that helps to capture the meaning of sentences[6]. This paper. makes it valuable for many tasks. For example, in text classification, such as sentiment analysis, sentence em- The main objective of this research is to develop a beddings help determine if a sentence is positive, neg- lightweight algorithm for correctly predicting sentence ative, or neutral. It is also useful in topic classification, similarity that utilizes text representation only on the where it helps to sort text into categories like sports, word level, i.e., word embeddings solely at the word politics, or technology. level and parts-of-speech (POS) information. By integrat- Sentence embeddings are naturally suitable for finding ing a word-type-weighted Word2Vec (W2V) [5] model semantic similarities between sentences. They help in with POS tagging, our approach aims to provide a low- tasks such as paraphrase detection, where the goal is cost alternative to large text embedding models based to find sentences with basically the same meaning. An- on transformers which often require high-performance other important application is in information retrieval, accelerators. In our test case with processor i7-12650H sentence embeddings improve search results by finding and memory 2 × 16 GB DDR5 at 4800 MHz, we have documents that match a query more accurately [7]. They achieved approximately 170 sentences/s with sentence are also used in text summarization by picking out the transformer and 15 500 sentences/s with W2V. most important sentences. Overall, sentence embeddings make working with text easier and more effective in The following section explains the concept of sentence many applications. embeddings and its applicability. Section 3 describes the ITAT’24: Information technologies – Applications and Theory, Septem- ber 20–24, 2024, Javorna, Slovakia $ theodor.lagin@gmail.com (T. Ladin); lukas.korel@fit.cvut.cz (L. Korel); martin@cs.cas.cz (M. Holeňa) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 3. Methodology 3.4.1. Text Preparation We first divided the training text into two parts. First 3.1. Overview being 60 percent of the text and second being 40 percent This section outlines the methodology used to develop of the text. We then used our text preprocessor to vector the word-type weighted Word2Vec model, used to predict these parts of text. Both parts were made up of pairs of semantic similarity of sentences. Our approach integrates sentences, where half of them had the same semantic word embeddings with parts-of-speech information to similarity and half did not. improve accuracy without large processing costs. 3.4.2. Initial Weight Optimization 3.2. Employed Tools Initially, we needed to make a sufficiently accurate guess The corpus we used was the Microsoft Research para- close to the global minimum. To achieve this, we used phrase corpus[8]. It contains around 5800 pairs of sen- the Broyden-Fletcher-Goldfarb-Shanno algorithm, also tences. We trained the algorithm on the train set of this known as the BFGS method, to minimize the mean corpus and tested it on its test set. squared error [11]. We opted for this method because, We used the public GoogleNews-vectors- when tested, it was shown to be the most accurate for negative300[9] Word2Vec implementation, for more this specific type of problem. objective and clear results, because of how widespread The BFGS algorithm is an iterative method for solving this corpus is. The model utilizes 300-dimensional unconstrained nonlinear optimization problems. It be- vectors and has been trained with around 3 million longs to the family of quasi-Newton methods, which are different English words. The size of this model is around used to find local maxima or minima of functions. The 1,6 GB. key idea behind BFGS is to update an approximation to We used the spaCy library [10] for POS (Parts-of- the Hessian matrix (or its inverse) at each iteration to Speech) tagging, because of its efficiency and precision, improve the convergence rate. which is crucial to fine-tuning the weights correctly. The BFGS update formula for the inverse Hessian ma- trix 𝐻𝑘+1 is given by: 3.3. Text Preprocessing 𝑠𝑘 𝑦 𝑇 𝑦𝑘 𝑠𝑇𝑘 𝑠𝑘 𝑠𝑇 (︂ )︂ (︂ )︂ 3.3.1. Standard Preprocessing 𝐻𝑘+1 = 𝐼− 𝑇 𝑘 𝐻𝑘 𝐼 − 𝑇 + 𝑇 𝑘 (1) 𝑦𝑘 𝑠𝑘 𝑦𝑘 𝑠𝑘 𝑦𝑘 𝑠𝑘 As the first step of preprocessing, we use the spaCy li- brary to tag each word in a sentence, which as a result where: also tokenizes the given sentence. SpaCy assigns tags • 𝐻𝑘 is the approximation of the inverse Hessian automatically, using a neural network. Then we delete matrix at iteration 𝑘. all the symbols. After deleting the symbols, we apply a • 𝑠𝑘 = 𝑥𝑘+1 − 𝑥𝑘 is the change in the vector of standard spell-checking algorithm to correct the mistakes variables. created by deleting the symbols. After that we employ • 𝑦𝑘 = ∇𝑓 (𝑥𝑘+1 ) − ∇𝑓 (𝑥𝑘 ) is the change in the our embedding algorithm. gradient of the objective function. This embedding algorithm starts by verifying that the word is not a stop word. If it passes this check, we clar- • 𝐼 is the identity matrix. ify whether the word is present in our model. If the The BFGS algorithm uses this updating formula it- word is absent, we proceed to lemmatization and check eratively to improve the approximation of the inverse again, followed by stemming and another check. If all of Hessian matrix, ultimately aiding in the efficient opti- these steps are unsuccessful, we assign to each token themization of the objective function. embedding based on its assigned tag. For instance, the This function by preconditioning the gradient deter- embedding of John is assigned to every first name tagged mines the descent direction, towards the local minimum as a proper noun because there are missing embeddings for each weight. The error or the loss function was com- for them. puted as the difference between the target similarity, which could be either -1 or 1, and the cosine similarity 3.4. Weights between embeddings. This process polarized the weights, making them highly effective as an initial guess. We also In this study, we consider weights for each word type, tried iterative weight adaptation without an initial weight denoted as 𝑤wt , where wt is the index of the word type. guess, but it would take too many iterations to produce For each 𝑤wt , we assume that 𝑤wt ∈ Q. any meaningful guess, and fewer iterations did not yield any results. Table 1 Example output from iterations for each word type Word type Abbreviation Weight - 1st iteration Weight - 2nd iteration Example word Adjective ADJ 1.000 1.000 last Adposition ADP 0.210 0.238 across Adverb ADV 0.903 1.066 separately Auxiliary AUX 0.415 0.396 would Coordinating Conjunction CCONJ 0.020 0.007 either Determiner DET 0.071 0.080 every Interjection INTJ 0.020 -0.006 oh Noun NOUN -6.150 -6.651 brother Numeral NUM 3.470 4.467 five Particle PART 0.095 0.037 nt Pronoun PRON 0.085 0.100 somebody Proper Noun PROPN -0.011 -0.585 Amrozi Subordinating Conjunction SCONJ 0.119 0.112 since Verb VERB 3.514 4.204 reported Fitted Gaussian distribution of samples 4 3 2 1 Weight 0 −1 −2 ADJ ADP ADV AUX CCONJ DET INTJ NOUN NUM PART PRON PROPN SCONJ VERB POS tag Figure 1: Fitted Gaussian distribution of samples 3.4.3. Gaussian Distribution 3.4.4. Final Weights Optimization The BFGS method was quite dependent on the initial Subsequently, we generated random samples from the conditions, and hence we did a number of iterations of obtained Gaussian distribution. These samples were gen- this function while changing the text that was supposed erally similar (Figure 1), although there were a few excep- to be similar or not. Afterward, we fitted a Gaussian tions, such as with nouns, created from the larger ratio distribution on the given ratio between weights because differences. we considered the ratio more important than the finalized Although some estimates were worse than others, all weights. Figure 1 depicts the weight ratios obtained in the differences could be rectified, with the method we Table 1. We normalized the overall distribution around employed at last. We refined the weights, that were dif- zero. ferent, through an iterative process, comparing them with weights derived from the Gaussian distribution with small enough differences. The refinement was achieved by minimizing logistic loss using the Nelder- Table 2 Mead method. The logistic loss was calculated based on Summary of Experimental Setup the prediction accuracy. Category Our choice The logistic loss for a binary classification problem, Dataset Microsoft Research Paper also known as log-loss or binary cross-entropy loss, is Minimization Algorithms BFGS, Nelder-Mead given by: Error Functions Logistic Loss, Mean Squared Error 𝑁 Assumed Distribution Gaussian Distribution 1 ∑︁ 𝐿(w) = − [𝑦𝑖 log 𝜎(x𝑖 · w) Evaluation Metrics Accuracy, F1-Score, AUC 𝑁 𝑖=1 Number of Executions 10 +(1 − 𝑦𝑖 ) log(1 − 𝜎(x𝑖 · w))] (2) Training-Testing Set Ratio 60 % : 40 % where: numerals, and verbs had the largest weights, while other • 𝑁 is the number of samples, parts of speech, for instance determiners or adpositions • 𝑦𝑖 is the true label (0 or 1) for the 𝑖-th sample, had weights close to zero. This most likely happened due • x𝑖 is the feature vector for the 𝑖-th sample, to these POS having such large impact on sentences. The • w is the weight vector, final weights are shown in table 3. • 𝜎(𝑧) is the sigmoid function defined as 𝜎(𝑧) = 1 1+𝑒−𝑧 . Table 3 The Nelder-Mead algorithm minimizes the logistic loss Example of final word type weights, other types were equal function by iteratively refining a simplex with 𝑛 + 1 to 0.060, but this value almost does not affect results, because the occurrence of the other types is very rare. The final token- vertices in an 𝑛-dimensional space [12]. The Nelder- based embedding corrector 𝑤𝑒𝑐 = −0.028. Mead method is particularly effective for optimizing the logistic loss function in logistic regression, especially in Word type abbreviation Weight cases where the gradient is unavailable or the function ADJ -1.330 is non-smooth. Through successive adjustments of the ADP 0.341 ADV -0.616 simplex vertices via reflection, expansion, contraction, AUX -0.334 and shrinkage, the algorithm steadily progresses toward CCONJ 0.126 the minimum of the logistic loss function. DET 0.308 INTJ -0.143 3.4.5. Embedding Correction NOUN 4.970 NUM -2.829 Embeddings were too dependent on the length of their PART -0.396 sentences. We have created a gradient-based weight cre- PRON -0.060 ator, which modifies the embedding. It adds a corrector PROPN 0.068 multiplied by the count of tokens in the sentence. We SCONJ -0.011 chose to use the additive weighted count of tokens in VERB -2.656 a sentence because, after many tests with different cor- rections, such as modification by the count of particular word types and multiplication with a weighted count of 4.2. Classification tokens, it was shown to be the most differentiating factor between different sentences. We have compared our approach to the BERT (Bidi- rectional Encoder Representations from Transformers) [13] fine-tuned for sentence embeddings, namely all- 3.5. Full experimental setup MiniLM-L12-v2 which has good benchmark results1 In table 2, you can see the full experimental setup of the and simple averaging Word2Vec without weighting. All methodology. results in this test have been obtained from the indepen- dent testing dataset. The testing dataset is balanced to contain the same number of records for each class (the 4. Results same and different descriptions). We used the Accuracy, F1 score, and AUC[14] for measuring all the statistics. 4.1. Final Weights The resulting final weights were in some cases negative, 1 benchmark results of available sentence transformers: https://www. with nouns being overly positive. Adjectives, nouns, sbert.net/docs/sentence_transformer/pretrained_models.html Table 4 5. Conclusion Results obtained on the balanced testing dataset. The best results have been achieved by the BERT, which is based on This paper introduces word-type-weighted Word2Vec a neural network that has been trained on large amounts of for sentence embeddings. It is based on Word2Vec and data and requires high-power computing units to perform aggregates words from a given sentence by the pre- embedding fast. When we compare the simple Word2vec ap- trained weights into one numeric vector. Our weighted proach, the word type weighting aggregation brings much Word2Vec embedder was compared on testing data with better results for sentence embedding in all considered met- average aggregation and with the BERT. The tested task rics. was about recognizing whether the given pair of sen- Quality measure Accuracy F1 score AUC tences is paraphrased with the same meaning or sen- BERT 0.975767 0.975165 0.975767 tences with different meanings. The complex neural net- W2Vmean 0.806442 0.837831 0.806442 work architecture of the BERT outperformed our solu- W2Vweighted 0.933742 0.929273 0.933742 tion, but the simple averaging without weighting had a much narrower gap between target classes in our testing Cosine similarity distrib tion for each class case. The advantage of our solution is using the simple 1.00 Word2Vec model. 0.75 In future research, we would like to extend our solution Cosine similarity 0.50 to embed whole paragraphs. We also want to consider 0.25 other word-based embedders. 0.00 −0.25 Gro nd tr th Acknowledgments −0.50 0 −0.75 1 This work was supported by the Grant Agency of BERT W2Vmean W2Vweighted the Czech Technical University in Prague, grant No. Embedder SGS23/205/OHK3/3T/18 and by the German Research Foundation (DFG) funded project 467401796. Figure 2: The distribution of results from BERT, mean aggre- gated Word2Vec and our solution grouped by ground truth similarity. References [1] C. M. Bishop, Pattern recognition and machine The results are represented in Table 4 and the box plot learning, volume Information science and statistics, Figure 2. The weighted solution by word type brings Springer, Oxford, 2006. much better results than simple averaging. The weighted [2] T. M. Mitchell, Machine learning, volume McGraw- solution has a higher margin between similar and dis- Hill series in computer science, international ed ed., similar sentences, but not as high as the BERT. The high McGraw-Hill, new York, 1997. performance is probably caused by its architecture, train- [3] S. Marsland, Machine learning: an algorithmic ing data, and contextual processing of the whole input. perspective, volume Chapman&Hall/CRC machine The differences between the considered embedders learning&pattern recognition series, second edition were tested for significance by the Friedman test. The ed., Chapman & Hall/CRC, Boca Raton, FL, 2014. basic null hypothesis that the results for all 3 embedders [4] O. Suissa, A. Elmalech, M. Zhitomirsky-Geffet, Text coincide was strongly rejected, with the achieved signifi- analysis using deep neural networks in digital hu- cance 𝑝 = 1.39 × 10−297 . For the post-hoc analysis, we manities and information science, Journal of the As- employed the Wilcoxon signed rank test with the two- sociation for Information Science and Technology sided alternative for all pairs of the compared embedders, 73 (2022) 268–287. URL: https://asistdl.onlinelibrary. because of the inconsistency of the more common mean wiley.com/doi/abs/10.1002/asi.24544. doi:https: ranks post-hoc test with the missing closed-world as- //doi.org/10.1002/asi.24544. sumption in machine learning, as pointed out in [15]. [5] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, For correction to multiple hypotheses testing, we used J. Dean, Distributed representations of words and the Holm method, which yielded the following corrected phrases and their compositionality, Advances in results: Neural Information Processing Systems 26 (2013). [6] S. Minaee, N. Kalchbrenner, E. Cambria, • BERT vs. W2Vmean: 𝑝 = 4.01 × 10−156 N. Nikzad Khasmakhi, M. Asgari-Chenaghlu, • BERT vs. W2Vweighted: 𝑝 = 2.12 × 10−16 J. Gao, Deep learning based text classification: A • W2Vmean vs. W2Vweighted: 𝑝 = 1.46 × 10−183 comprehensive review, 2020. [7] M. Zhou, D. Liu, Y. Zheng, Q. Zhu, P. Guo, A text sentiment classification model using double word embedding methods, Multimedia Tools and Applications 81 (2022) 18993–19012. URL: https://doi.org/10.1007/s11042-020-09846-x. doi:10. 1007/s11042-020-09846-x. [8] W. B. Dolan, C. Brockett, Microsoft research paraphrase corpus, Microsoft Research, 2005. URL: https://www.microsoft.com/en-us/download/ details.aspx?id=52398, accessed: August 13, 2024. [9] T. Mikolov, K. Chen, G. Corrado, J. Dean, Ef- ficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 (2013). URL: https://github.com/mmihaltz/ word2vec-GoogleNews-vectors, accessed: August 13, 2024. [10] M. Honnibal, I. Montani, S. Van Landeghem, A. Boyd, spaCy: Industrial-strength Natural Lan- guage Processing in Python (2020). doi:10.5281/ zenodo.1212303. [11] C. T. Kelley, Iterative Methods for Optimization, SIAM, 1999, pp. 71–86. URL: https://epubs.siam. org/doi/abs/10.1137/1.9781611970920.ch4. doi:10. 1137/1.9781611970920.ch4. [12] J. A. Nelder, R. Mead, A simplex method for function minimization, The Computer Journal 7 (1965) 308– 313. URL: https://academic.oup.com/comjnl/article/ 7/4/308/354237. doi:10.1093/comjnl/7.4.308. [13] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: J. Burstein, C. Do- ran, T. Solorio (Eds.), Proceedings of the 2019 Con- ference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423. doi:10.18653/ v1/N19-1423. [14] C. Ferri, J. Hernández-Orallo, R. Modroiu, Be- yond Accuracy, F-Score and ROC: A Family of Dis- criminant Measures for Performance Evaluation, Springer, 2009. [15] A. Benavoli, G. Corani, F. Mangili, Should we really use post-hoc tests based on mean-ranks?, Journal of Machine Learning Research 17 (2016) 1–10. URL: http://jmlr.org/papers/v17/benavoli16a.html.