=Paper= {{Paper |id=Vol-3290/long_paper5362 |storemode=property |title=Boosting Word Frequencies in Authorship Attribution |pdfUrl=https://ceur-ws.org/Vol-3290/long_paper5362.pdf |volume=Vol-3290 |authors=Maciej Eder |dblpUrl=https://dblp.org/rec/conf/chr/Eder22 }} ==Boosting Word Frequencies in Authorship Attribution== https://ceur-ws.org/Vol-3290/long_paper5362.pdf
Boosting Word Frequencies in Authorship
Attribution
Maciej Eder
Institute of Polish Language, Polish Academy of Sciences, al. Mickiewicza 31, 31–120 Kraków, Poland


                                      Abstract
                                      In this paper, I introduce a simple method of computing relative word frequencies for authorship at-
                                      tribution and similar stylometric tasks. Rather than computing relative frequencies as the number of
                                      occurrences of a given word divided by the total number of tokens in a text, I argue that a more e昀케-
                                      cient normalization factor is the total number of relevant tokens only. The notion of relevant words
                                      includes synonyms and, usually, a few dozen other words in some ways semantically similar to a word
                                      in question. To determine such a semantic background, one of word embedding models can be used.
                                      The proposed method outperforms classical most-frequent-word approaches substantially, usually by a
                                      few percentage points depending on the input settings.

                                      Keywords
                                      authorship attribution, stylometry, relative word frequencies, word vectors, semantic neighbors




1. Introduction
In a vast majority of text classi昀椀cation studies aimed at distinguishing unique authorial signal –
these include authorship attribution investigations, authorship pro昀椀ling, veri昀椀cation, and sim-
ilar tasks – relative frequencies of the most frequent words (MFWs) are routinely used as the
language features to betray the authorial “昀椀ngerprint”. A vector of such relative word frequen-
cies is then passed to one of the multidimensional machine-learning classi昀椀cation techniques,
ranging from simple distance-based lazy learners, such as Delta [2, 5], to sophisticated deep
learning neural network setups [6].
   Recent advances in machine learning methodology – unheard-of and unprecedented – mas-
sively reshaped the 昀椀eld of text classi昀椀cation. Three main methodological directions are ac-
tively researched: 昀椀rstly, new classi昀椀ers emerge on the horizon to clearly outperform classical
solutions; secondly, feature engineering and dimensionality reduction techniques are intro-
duced to overcome the curse of high dimensionality, and thirdly, alternative style-markers that
can betray authorial idiosyncrasies are being introduced. The present paper explores none of
the above directions, though. Instead, I argue that a reasonable amount of overlooked stylis-
tic information resides in a time-proven, standard bag-of-word representation of textual data,
which is routinely used in dozens of stylometric studies.

CHR 2022: Computational Humanities Research Conference, December 12–14, 2022, Antwerp, Belgium
£ maciej.eder@ijp.pan.pl (M. Eder)
ç https://maciejeder.org/ (M. Eder)
ȉ 0000-0002-1429-5036 (M. Eder)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)




                                                                                                     387
   Certainly, there exist alternative features that prove to be e昀케cient style-markers in author-
ship attribution setups. Most notably, letter n-grams have been suggested as a strong authorial
indicator [16]. Also, grammatical features, such as POS-tag n-grams, turned out to retain infor-
mation about authorial uniqueness [8]. Other intriguing ideas include observing the immedi-
ate lexical context around proper nouns [12]. Even if such alternative textual features exhibit a
great deal of potential to enhance text classi昀椀cation [3], the standard approach relying of word
frequencies continues to be predominant in the 昀椀eld [7, 18]. In this paper, word frequencies
will be used as well, yet the step of normalizing them into relative frequencies will be somewhat
enhanced. Speci昀椀cally, all the other words used to normalize the frequencies will be evaluated
and then reduced, so that a given word in question is normalized by its actual semantic back-
ground. However, the general idea of enhancing the frequencies can be extended, I believe, to
other style-markers, including extra-lexical ones.


2. Word frequencies
The notion of relative word frequencies is fairly simple. We count all the tokens belonging
to particular types (e.g. all the attestations of the word “the”, followed by the attestations of
“in”, “for”, “of” etc.), and for each word, we divide the number of types by the total number of
words in a document. Consequently, each word frequency is equal to its percentage within the
document (e.g. “the” = 0.0382), and all the frequencies sum up to 1. The reason of converting
occurrences to relative frequencies is obvious: by doing so, one is able to reliably compare texts
that di昀昀er in length. The notion of relative word frequencies is so natural and intuitive that
one might very easily overlook its methodological implications, as if it was nothing else than
a simple normalization procedure.
   For the sake of this paper, however, it is important to realize that relative frequencies are
relative to all the other words in a document in question. Convenient as they are, these values
are at the same time very small and – importantly – they are a昀昀ected by hundreds of other
word frequencies. Consequently, the 昀椀nal values might not be su昀케ciently precise to capture
minute di昀昀erences between word occurrences, because the normalization factor evens them
out to some extent. Now, what if we disregard thousands of other words in a text, and instead
compute the frequencies in relation to a small number of words that are relevant? An obvious
example is the mutual relation between the words “on” and “upon” in one document [15];
essentially, more attestations of “upon” come at the cost of the occurrences of the word “on”
– and vice versa. While the classical relative frequency of the word “on” in Emily Bronte’s
Wuthering Heights is 0.00687, the proportion of “on” relative exclusively to “upon” is 0.9762. It
is assumed in this paper that the latter frequency can betray the authorial signal to a greater
extent than the classical approach, because the myriads of other words are not blurring the
昀椀nal value.
   The idea of looking into semantics is not entirely new, since thesaurus-based approaches
have been already proposed in the context of authorship attribution [11, 10]. It has been sug-
gested that a list of words organized into near-synonymous sets (“synsets”) and/or into larger
hierarchies can be used to extract the authorial signal [9], it has been also demonstrated that
pairs of synonyms might contain valuable authorial information [1]. However, the above ap-




                                               388
proaches are focused on identifying meaningful words beyond the usual MFWs, whereas the
present study is aimed to show that there is still some room to enhance the very MFWs.


3. Method
Given the above “on” and “upon” example, it would be tempting to identify one synonym for
each of the words, and to compute the relative proportions in each of the synonym pairs, as
suggested in the already cited study [1]. Linguistically speaking, however, such an approach
would hardly be feasible. Firstly, only a fraction of words have their proper synonyms. Sec-
ondly, some semantic 昀椀elds are rather rich and cannot be reduced to a mere pair of synonyms.
Thirdly, in the case of the most frequent words (articles, particles, prepositions) identifying
their synonyms doesn’t make much sense, yet still, relevant counterparts for these frequent
words obviously exist. On theoretical grounds, however, it is di昀케cult to speculate whether the
number of relevant counterparts should be restricted to a single word – as in the example of
“on” de昀椀ned by its relation to “upon” – or include, say, a dozen related words. E.g., to deter-
mine the relative frequency of the word “make”, one would probably measure its proportion
against the sum of occurrences of “do”, “prepare”, “create”, “turn”, “cra昀琀”, “invent” etc. The ef-
fective size of the semantic background is, however, very di昀케cult to conceptualize – not only
the actual number of related words, but even the order of magnitude are unknown. Take the
above example: should the word “make” be calculated against its 10 similar words, or would
the semantic background of 100 words be better?
   Another nontrivial question is related to the very method of extracting synonyms and other
semantically related words from a corpus. While thesaurus-based search might prove feasible
for single words, it will certainly become more demanding when dozens of seed words are
concerned. There exist, however, at least two strategies to approach the issue computationally.
One strategy involves wordnet, a manually compiled database of thousands of words with their
semantic and syntactic relations [14], while the other relies on distributional semantics meth-
ods. In particular, the algorithm word2vec should be mentioned in this context [13], which
provides a vector representation of words that allows for identifying their semantic similari-
ties. Even if these inferred similarities do not comply with any formal grammar (rather, the
relations are known to be fuzzy at times), they usually look convincing to a human observer.
In the present study, a word vector model GloVe [17] was used to betray word similarities: it
was trained on the benchmark corpus of 99 English novels (as described below), with 100 tar-
get dimensions. A semantic background for a given seed word was de昀椀ned as n neighboring
vectors. Consequently, the resulting semantic background contained the most similar vectors
for a given seed word. E.g., the neighbors for the word “person” were: “woman”, “gentleman”,
“man”, “one”, “sort”, “whom”, “thing”, “young”, etc., whereas the neighbors for the word “the”
were as follows: “of”, “this”, “in”, “there”, “on”, “one”, “which”, “its”, “was”, “a”, “and”, etc. For
each target word, a relative frequency was calculated as the number of occurrences divided by
the sum of occurrences of its n semantic neighbors (n being the size of semantic space to be
tested).
   In order to corroborate the above intuitions, a controlled authorship attribution experiment
was designed. A benchmark corpus of 99 English novels was used: it consists of 33 authorial




                                                389
Figure 1: The performance (F1 scores) for a benchmark corpus of 99 English novels, and the Delta
classifier. Distance measures involve Cosine Delta (top le昀琀), Classic Delta (top right), Eder’s Delta
(bottom le昀琀), and Manhattan (bottom right). The results depend on the MFW vector (y axis) and the
size of the semantic space expressed in the number of most similar words in a vector model (x axis).


classes and 3 novels per author, and is freely available on GitHub repository: https://github.c
om/computationalstylistics/100_english_novels. A corpus of (naturally long) novels might
be considered inferior for authorship benchmarks, the high number of the authorial classes,
however, makes the task di昀케cult enough to su昀케ciently stress-test the classi昀椀er. To make the
task even harder, the amount of training data were restricted to 1 text per author, whereas the
remaining 2 texts per author were used as the validation set (the proportion of 33 vs. 66 texts
were kept in each iteration).
   Since the size of the semantic background is unknown, a grid-search framework was de-
signed to systematically assess tighter (1 relevant counterpart) and broader semantic spaces
(up to 10,000 words, inevitably going far beyond synonyms). The tests were performed using
the package stylo for R [4]. Di昀昀erent classi昀椀ers, MFW vectors and, most importantly, di昀昀er-
ent sizes of the semantic space were tested systematically, in a supervised setup with strati昀椀ed
cross-validation. On theoretical grounds, the size of the semantic space n = 80,000 (roughly the
total number word types in the benchmark corpus) would be equivalent to classical relative fre-
quencies, whereas the space of the size n = 1 means that the frequencies are relative to exactly
one other word (e.g. the frequency of the word “the” would be the number of occurrences of




                                                 390
Figure 2: The gain of performance (baseline F1 scores subtracted from the obtained F1 scores) for a
benchmark corpus of 99 English novels.


“the” divided by the total number of “the” and “of”).
   Independently, an alternative set of tests were performed using a regular relative frequencies.
The outcomes of these tests served as a baseline. In each test reported in this paper, the F1
scores are used as a compact and reliable measure of performance.


4. Results
The obtained results (Table 1, and Fig. 1) clearly suggest that the new method outperforms
the classical relative frequencies solution substantially, no matter which distance measure is
used. In agreement with several previous studies, longer MFW vectors worked better than, say,
100 MFWs. Also according to expectation, Cosine Delta proves to be the undisputed winner
among the classi昀椀ers. Counter-intuitive, however, was the behavior of di昀昀erent classi昀椀ers
with the enhanced word frequencies. As evidenced in Fig. 1 top le昀琀 panel, Cosine Delta works
best with frequencies computed against 5–50 semantically similar words, whereas Burrows’s
Delta (top right) exhibits its sweet spot for 50–100 neighboring words, and so does Eder’s Delta
(bottom le昀琀). When the semantic background is further increased, the behavior of particular
classi昀椀ers becomes uniform across the board: it slowly but surely decreases to ultimately reach
the baseline level.




                                               391
Figure 3: The absolute performance (top) and the performance gain (bottom) for the corpus of 99
English novels. The distance measures are Cosine Delta (le昀琀) and Burrows’s Delta (right). The semantic
space is defined as the words within a given cosine distance from respective source words. The results
depend on the MFW vector (y axis) and the size of the semantic space (x axis).


   Since the introduction of Burrows’s Delta practitioners are aware that scaling (z-scoring)
the features is the very factor responsible for the performance boost observed in Delta and
its derivatives. Even if Manhattan distance does not scale the features (hence its unpopular-
ity in text classi昀椀cation), the improved word frequencies behave di昀昀erently than standard ap-
proaches, which in turn might favor simple distances such as Manhattan. And indeed, the
scores obtained for the Manhattan Distance are radically better than the respective baseline
(Fig. 1, bottom right), yet still, Manhattan still cannot compete with z-scored distances.
   According to the above results, a recipe for a successful authorship attribution setup seems
to be as follows: take roughly 800–900 MFWs, and compute their frequencies using, for each
word, the occurrences of their 5–10 semantic neighbors; then use the Cosine Delta classi昀椀er.
   Since in authorship attribution the results are proven to be unevenly distributed across dif-
ferent MFW vectors, let alone di昀昀erent classi昀椀ers, Fig. 2 presents the same outcomes as pre-
viously, yet this time de昀椀ned as the improvement (in percentage points) over the baseline F1
scores. While the overall best performance is obtained for ca. 850 MFWs computed against 5–
10 words, the biggest gain over the baseline (more than 10 percentage points!) is provided by
the following scenario: 300 MFW frequencies computed against a tight semantic background
of 3 neighboring words. Other reasonable improvements are generally associated with short




                                                 392
Table 1
The best performance (F1 scores) obtained in each tested scenario.
                                     Relative frequencies   Enhanced frequencies
                   Cosine Delta              0.908          0.959
                  Burrows’s Delta            0.823          0.838
                   Eder’s Delta              0.812          0.830
                    Manhattan                0.679          0.771


MFW vectors and the semantic background of 5–100 words. In the case of Burrows’s Delta,
which worked with 900 MFWs computed against 60 neighboring words (Fig. 2, top right), the
improvement over the baseline is biggest for short vectors of MFWs. Interestingly, for Bur-
rows’s Delta the new method proves to be worse than the baseline for long MFW vectors and
tight semantic spaces of 1–10 neighboring words. The picture for Eder’s Delta (bottom le昀琀)
is similar to that for Burrows’s method, even if its hot spot is slightly moved towards longer
MFWs vectors. Surprisingly enough, the results for Manhattan Distance turned out to be sub-
stantially di昀昀erent from the other methods, and much less predictable. A large and pronounced
hot spot of radically improved performance forms for tight semantic spaces, across di昀昀erent
MFWs vectors. On the right hand side, the mountain of performance is followed by a deep
valley of no improvement at all, and then, counter-intuitively, another hill emerges, indicating
the boost of performance for the semantic spaces of 50–100 words. This behavior is di昀케cult to
explain.
   The proposed way of identifying an arbitrarily chosen number of semantic neighbors, might
su昀昀er from an uneven distribution of semantic neighbors in a given model (GloVe, word2vec,
fastText, etc.). E.g., 50 neighboring lexemes might point to a semantically coherent area around
a function word, or indicate but vague associations around a very speci昀椀c technical term. To
account for this factor, a second experiment has been conducted, in which I have de昀椀ned a
semantic background to be all the words located at a speci昀椀c cosine distance from a given ref-
erence word. Consequently, rather than extracting n neighboring words, now I was extracting
all the words within the radius of 0.9 cosine similarity in the 昀椀rst iteration, then 0.85, 0.8 etc.,
all the way to –0.9. The results for Cosine Delta and Burrows’s Delta are shown in Fig. 3. As
can be seen, a clear hot spot forms in the area of 0.7–0.5 cosine similarity, despite the number
of MFWs or the classi昀椀er, and a昀琀er the distance of 0.3 the performance hits the baseline level.
The results con昀椀rm the general picture obtained in the previous experiment (Fig. 1), yet the
sweet spot area seems to be more di昀케cult to generalize.


5. Discussion
The results presented in the previous section call for further exploration and above all, for a
concise discussion. A few general remarks can be formulated here:

   1. No matter which classi昀椀cation method was used, the performance improvement turned
      out to be large, clearly suggesting that bare word occurrences retain much more autho-
      rial signal than the time-proven relative frequencies are able to betray. It can be safely




                                                393
      hypothesized that the method introduced in this paper barely opened a new perspective,
      rather than o昀昀ered an ultimate solution to the problem.
   2. In order to identify the words that matter, a word embedding model was used – and this,
      again, was far from an optimal solution. As a rough proxy, it nevertheless was able to
      improve the word frequencies in the range of 5–50 neighboring words. On theoretical
      grounds, a further improvement should be possible with a more precise method of iden-
      tifying relevant semantic background.
   3. While the new method improves the performance across all the MFW strata, short MFW
      vectors seem to bene昀椀t more. Interesting from a theoretical point of view, this phe-
      nomenon has also a practical implication. Namely, since several studies suggest that
      larger numbers of MFWs should be preferred as they generally exhibit better perfor-
      mance, it is also believed that they are more likely to be a昀昀ected by genre, topic, and
      content of the analyzed texts. With this in mind, some practitioners choose to conduct
      authorship attribution on shorter MFW vectors. The method introduced in this paper
      can greatly improve the performance in such setups.

   An observation that requires further investigation, is the discrepancy between classi昀椀ers
in how they react to the same semantic background. Contrary to intuition, for Burrows’s
Delta the improvement of performance was not simply correlated with the size of the semantic
background. Tight neighborhood – less than 20 synonyms and/or other related words – did not
outperform standard relative frequencies, whereas broader contextual information of ca. 50–
100 related words showed a signi昀椀cant improvement over the baseline. In the case of Cosine
Delta, tight semantic background of ca. 5–10 proved optimal, whereas broader spaces of 50–100
neighboring words were only marginally worse, still outperforming the baseline to a signi昀椀cant
degree.


6. Conclusion
The paper presented a simple method to improve the performance in di昀昀erent stylometric
setups. The method is conceptually straightforward and does not require any NLP tooling.
The only external piece of information that is required is a list of semantically related words for
each of the most frequent words in the corpus. A controlled experiment showed a signi昀椀cant
improvement of classi昀椀cation accuracy in a supervised multi-class authorship attribution setup.


Acknowledgments
This research is part of the project Large-Scale Text Analysis and Methodological Foundations of
Computational Stylistics (2017/26/E/HS2/01019), supported by Poland’s National Science Cen-
tre. The code and the datasets to replicate the experiments presented in this study are posted
on GitHub repository: https://github.com/computationalstylistics/word_frequencies.




                                               394
References
 [1] G. Borski and M. Kokowski. “Copernicus, his Latin style and comments to Commentari-
     olus”. In: Studia Historiae Scientiarum 20 (2021), pp. 339–438. url: https://www.ejournal
     s.eu/Studia-Historiae-Scientiarum/2021/20-2021/art/19754/.
 [2] J. Burrows. “‘Delta’: a measure of stylistic di昀昀erence and a guide to likely authorship”.
     In: Literary and Linguistic Computing 17.3 (2002), pp. 267–287.
 [3] M. Eder. “Style-markers in authorship attribution: a cross-language study of the authorial
     昀椀ngerprint”. In: Studies in Polish Linguistics 6 (2011), pp. 99–114. url: http://www.ejour
     nals.eu/SPL/2011/SPL-vol-6-2011.
 [4] M. Eder, J. Rybicki, and M. Kestemont. “Stylometry with R: a package for computational
     text analysis”. In: R Journal 8.1 (2016), pp. 107–121. doi: 10.32614/rj-2016-007.
 [5] S. Evert, T. Proisl, F. Jannidis, I. Reger, S. Pielström, C. Schöch, and T. Vitt. “Understanding
     and explaining Delta measures for authorship attribution”. In: Digital Scholarship in the
     Humanities 32 (suppl. 2 2017), pp. 4–16. doi: 10.1093/llc/fqx023.
 [6] H. Gómez-Adorno, J.-P. Posadas-Durán, G. Sidorov, and D. Pinto. “Document embed-
     dings learned on various types of n-grams for cross-topic authorship attribution”. In:
     Computing 100.7 (2018), pp. 741–756. doi: 10.1007/s00607-018-0587-8.
 [7] J. W. Grieve. “Quantitative authorship attribution: An evaluation of techniques”. In: Lit-
     erary and Linguistic Computing 22.3 (2007), pp. 251–270. doi: 10.1093/llc/fqm020.
 [8] G. Hirst and O. Feiguina. “Bigrams of syntactic labels for authorship discrimination of
     short texts”. In: Literary and Linguistic Computing 22.4 (2007), pp. 405–417.
 [9] P. Juola. “Thesaurus-based semantic similarity judgments”. In: Drawing Elena Ferrante’s
     pro昀椀le. Ed. by A. Tuzzi and M. A. Cortelazzo. Padova: Padova University Press, 2018,
     pp. 47–59.
[10]   M. Koppel, N. Akiva, and I. Dagan. “Feature instability as a criterion for selecting po-
       tential style markers”. In: Journal of the American Society for Information Science and
       Technology 57.11 (2006), pp. 1519–1525.
[11]   H. Love. Attributing authorship: An introduction. Cambridge: Cambridge University Press,
       2002.
[12]   A. Lučić and C. L. Blake. “A syntactic characterization of authorship style surrounding
       proper names”. In: Digital Scholarship in the Humanities 30.1 (2013), p. 53. doi: 10.1093/l
       lc/fqt033.
[13]   T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. “Distributed representations
       of words and phrases and their compositionality”. In: Advances in neural information
       processing systems. 2013, pp. 3111–3119.
[14]   G. A. Miller. “WordNet: A lexical database for English”. In: Communications of the ACM
       38.11 (1995), pp. 39–41.
[15]   F. Mosteller and D. Wallace. Inference and disputed authorship: The Federalist. Stanford:
       CSLI Publications, 1964.




                                                395
[16]   F. Peng, D. Schuurmans, V. Keselj, and S. Wang. “Language independent authorship at-
       tribution using character level language models”. In: Proceedings of the 10th Conference of
       the European Chapter of the Association for Computational Linguistics. 2003, pp. 267–274.
[17]   J. Pennington, R. Socher, and C. D. Manning. “GloVe: Global vectors for word represen-
       tation”. In: Empirical Methods in Natural Language Processing (EMNLP). 2014, pp. 1532–
       1543.
[18]   E. Stamatatos. “A survey of modern authorship attribution methods”. In: Journal of the
       American Society for Information Science and Technology 60.3 (2009), pp. 538–556.


A. Function to compute enhanced word frequencies
The following code de昀椀nes a function to compute the word frequencies as discussed in this
paper. The code is written in generic R and does not require any external R library to run.
The function takes three arguments: (i) word_frequencies is a document-term matrix, or a
table with raw frequencies (occurrences) or words in a given dataset; unlike typical stylometric
applications, where one usually takes a subset of n most frequent words, here all the informa-
tion about infrequent words is equally important; (ii) word_vector_similarities is a table
containing, for each word, the nearest neighbors in a semantic space, e.g. the row for the word
“person” these are the following words: “woman”, “gentleman”, “man”, “one”, “sort”, “whom”,
“thing”, “young”, etc.; it is su昀케cient to compute the neighbors for 1000 most frequent words or
so, and the semantic depth can be reduced to, say, 100 semantically related words in each case
(for the sake of the present study, a set of 1000 most frequent words with their 10,000 semantic
neighbors were used); (iii) no_of_similar_words a number (integer) of how many semantic
neighbors one wants to take into consideration.

compute_subset_frequencies = function(dtm_matrix,
                                    word_vector_similarities,
                                    no_of_similar_words) {

  semantic_space = word_vector_similarities[ , 1:no_of_similar_words,
                     drop = FALSE]
  no_of_words = dim(semantic_space)[1]
  final_frequency_matrix = matrix(nrow = dim(dtm_matrix)[1],
                     ncol = no_of_words)

  for(i in 1:no_of_words) {

       # check if the required word(s) appears in the corpus
       words_sanitize = semantic_space[i,] %in% colnames(dtm_matrix)
       words_to_compute = semantic_space[i, words_sanitize]
       # if the corpus doesn't contain any of the words required
       # by the model, then grab the the most frequent word
       # for reference (it should not happen often, though)




                                               396
        if(length(words_to_compute) == 0) {
          words_to_compute = colnames(dtm_matrix)[1]
        }
        # add the occurences of the current word being computed;
        # e.g. for the word "of", add "of" to the equation
        words_to_compute = c(colnames(dtm_matrix)[i], words_to_compute)
        # getting the occurrences of the relevant words from
        # the input matrix of word occurrences:
      f = dtm_matrix[, words_to_compute]
      # finally, computing new relative frequencies
      final_frequency_matrix[,i] = f[,1] / rowSums(f)

  }

 # sanitizing again, by replacing NaN values with Os
 final_frequency_matrix[is.nan(final_frequency_matrix)] = 0
 # tweaking the names of the rows and columns
 rownames(final_frequency_matrix) = rownames(dtm_matrix)
 colnames(final_frequency_matrix) = rownames(semantic_space)
 class(final_frequency_matrix) = "stylo.data"

return(final_frequency_matrix)
}




                                      397