Applying VSM to Identify the Criminal Meaning of Texts

            Nina Khairova1 [0000-0002-9826-0286], Anastasiia Kolesnyk1 [0000-0001-5817-0844],
          Orken Mamyrbayev2 [0000-0001-8318-3794] and Svitlana Petrasova1 [0000-0001-6011-135X]

    1National Technical University “Kharkiv Polytechnic Institute”, 2, Kyrpychova str., 61002,

                                           Kharkiv, Ukraine
2Institute of Information and Computational Technologies, 125, Pushkin str., 050010, Almaty,

                            Republic of Kazakhstan
          khairova@kpi.kharkov.ua, kolesniknastya20@gmail.com,
               morkenj@mail.ru, svetapetrasova@gmail.com


         Abstract. Generally, to define the belonging of a text to a specific theme or
         domain, we can use approaches to text classification. However, the task becomes
         more complicated when there is no train corpus, in which the set of classes and
         the set of documents belonged to these classes are predetermined. We suggest
         using the semantic similarity of texts to determine their belonging to a specific
         domain. Our train corpus includes news articles containing criminal information.
         In order to define whether the theme of input documents is close to the theme of
         the train corpus, we propose to calculate the cosine similarity between documents
         of the corpus and the input document. We have empirically established the aver-
         age value of the cosine similarity coefficient, in which the document can be at-
         tributed to the highly specialized documents containing criminal information. We
         evaluate our approach on the test corpus of articles from the news sites of
         Kharkiv. F-measure of the document classification with criminal information
         achieves 96 %.

         Keywords: semantic similarity of texts, VSM, criminal information, news sites,
         cosine similarity, PPMI


1        Introduction

One of the main tasks of NLP and, accordingly, of computer linguistics, in general, is
the task of semantic similarity of different elements of the texts (words, phrases, colo-
cations, sentences and documents). This task is directly related to information retrieval,
ranking of documents, topic modeling of texts, sentiment analysis and more.
    The task of the identification of the documents semantic similarity is used in all ap-
proaches that utilize semantic analysis and semantic technologies including the moni-
toring of public information, telecommunication networks. However, mostly, this task,
which originally was considered by Salton [1], is applied regarding information re-
trieval. In considering the issue of the documents similarity identification, Salton et. al
(1975) focused on measuring document similarity, considering a query to search engine
as a pseudo-document.
     Copyright © 2020 for this paper by its authors.
     Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    We suggest using the semantic similarity of texts to determine their belonging to a
specific domain. Usually, the solutions of the task are based on methods and approaches
to text classification. Good and frequently used methods of text classification are the
decision tree, neural networks [2], Random Forest and Support Vector Machine [3], the
Bayesian method, K-means [4] and others like them [5]. Nevertheless, all these meth-
ods require a trained corpus, in which the set of classes and the set of documents be-
longed to these classes, are predetermined.
    In our case there are no predefined classes, we have only a text corpus of a specific
domain, which includes news articles containing criminal information. It the study, in
order to determine belonging of a text to the specific domain, when we cannot use clas-
sification because there are no predefined classes, we suggest measuring the similarity
of the texts.
    The remainder of the paper is organized as follows. Section 2 gives an overview of
the related works, corresponding with methods and approaches of semantic similarity
of texts. Section 3 describes the application of VSM for semantic analysis. Section 4
presents the usage of our method for identifying the criminal meaning of texts. Section
5 introduces our corpus comprising texts contained criminal information and describes
its usage in our experiment. In the last Section 6, the scientific and practical contribu-
tions of the research, its limitations and future work are discussed.


2      Related Work

Search for the semantic similarity of text information is getting more and more popular
in various fields, for instance, [6] utilized semantic similarity as a very effective method
to identify links between medical objects such as a drug and a diagnosis. Their approach
is based on the transformation of embedding into a drug-prescription model and as-
sesses similarities between them using a vector representation of the link between drug
and prescription. This approach has been empirically studied and shows good results in
bio-medicine.
    Evaluation of semantic similarity of texts also helps in market analysis, banking and
marketing. In paper [7] authors used this approach to determine the similarities between
various press releases of a bank and to assess their impact on potential clients and the
financial market. This was done by calculating the distance between fixed- vectors
length of pairs of press releases (bag of words model). The method assigned more
weight to words that were rare and less weight to words that were frequent. Testing also
showed that the results were not sensitive to weighting.
    This technique calculated the semantic similarity between the text words and the
lexical dictionary is also exploited in the field of sentiment analysis, especially for pro-
cessing various lexical resources. The authors of the study [8] applied this metric for
sentiment classification model using a measure of semantic proximity and embedding
representations. However, the results of the study indicate that the choice of the vocab-
ulary influences cross-dataset analysis.
    The study [9] used a new method to determine the semantic similarity of big docu-
ments that were academic articles. These articles had some kind of topic events that
presented the same information about research objectives, methodologies, etc., as well.
To calculate the degree of semantic similarity, the authors of the study exploited domain
ontology, which was used to calculate similarities between semantic events. Estimated
experiments had shown that such a method based on ontologies got more accurate re-
sults compared to others. Generally, the methods, which utilized ontologies, showed
overall advantages in Correlation, Accuracy and F1-score.
    Overall, a lot of studies provided different ways to improve the computation of se-
mantic similarity. For instance, study [10] provided good results for the new vector
space model based on a random walk algorithm. The peculiarity of this approach was
the comparison of the distribution of each text that induces when used as the seed of a
random walk over a graph derived from WordNet. This algorithm had a relative de-
crease in the error rate in comparison to a conventional vector model.
   Traditionally, there are two groups of approaches to the task of semantic similarity
identification. The first group is based on ontologies. For instance, the ontology-based
approach was utilized by Resnik's method [11] or the extended Lesk Algorithm [12].
However, in the vast majority of cases, such approaches applied to identify the semantic
similarity of short text fragments or words.
   The second group of approaches to the task of semantic similarity identification is
based on statistical methods of distributional similarity. These methods applied to
measure words similarity as well as to similarity of documents and even similarity of
relations [13]. Therefore, the semantic similarity of large documents is still based only
on statistical information, which is clearly insufficient for determining global semantic
values [14].


3      The Application of VSM for Semantic Analysis

The purpose of our study is to find a universal method for solving the problem of iden-
tification of the texts with criminal meaning. When determining the thematic orienta-
tion of texts, namely for the task of classification or clustering, vector space model is
an adequate and well-developed method.
    The vector space model (VSM) allows providing a collection of documents by vec-
tors from one vector space, which is common for the whole collection of documents
[13]. The use of VSM is based on two cognitive hypotheses. The first hypothesis, sta-
tistical semantics hypothesis, states that statistical patterns of word usage in natural
language can be used to find out what people mean. In other words, human intellect can
understand words depending on their environment [15].
    The second hypothesis, formulated by J. Salton for information retrieval [1], is based
on the representation of the text as "a bag of words" and suggests that the frequency of
words in a document often determines the relevance of documents to the query.
    The main idea of VSM is to represent each collection document as a point in multi-
dimensional space (vector in vector space). The points lying close to each other corre-
spond to semantically similar documents. Therefore, in the vector space model, text
representation mainly focuses on two tasks. Firstly, how to build a vector and, secondly,
how to assign weights to vector elements.
    Towards the first objective, each document in a vector model is thought as an unor-
dered set of terms. The terms can be any words, including numbers and proper names.
With a large collection of researched documents, as in our case, and correspondingly a
large number of vectors, it is reasonable to place the data in the matrix. Each row of the
matrix defines a separate term, and each column corresponds to some document.
    The second primary task of VSM is to determine the weight of the terms in the doc-
ument. Weight refers to the importance of a word, its semantic ability to identify a given
text. The easiest way is to determine the frequency of a term.
    Nevertheless as usual, in order to determine the weight of a term in a term-document
matrix, tf*idf index is used, which stands for "term frequency * inverse document fre-
quency".
    The purpose of weighing the terms is to determine how fully they reflect the seman-
tic content of the document. However, frequency and probabilistic methods of tf*idf
have a number of disadvantages, as often the result may be irrelevant documents or lack
of true relevance. Such problems with the result are related to the fact that the methods
described do not take into account that the frequencies of occurrence of different terms
depend on each other, since they can be combined into word combinations. In addition,
it affects the result and the synonymy and plurality of the language.
    In order to solve some of these problems we will use PMI (Pointwise Mutual Infor-
mation) as a weight function [16]. Formally, PMI can be defined in the following way.
    Let F be a traditional term-document matrix in which nr rows and nc columns, i-th
row in matrix F is the vector of row fi: and j-th column in matrix F is the vector of
column f:j. Row fi: corresponds to term wi and column f:j corresponds to document dj.
The value of element fij is the number of times that wi appears in document dj.
    Let X be the matrix that is obtained by calculating the PMI weight function to the
elements of matrix F. Then matrix X will have as many rows and columns as the fre-
quency matrix F. The values of the xij element in matrix X are defined as follows equa-
tions:
                                            nr     nc
                                pij  f ij /  f ij                                  (1)
                                            i 1 j 1
                                     nc          nr     nc
                             pi *   f ij /  f ij                                  (2)
                                    j 1         i 1 j 1
                                     nr          nr     nc
                              p* j   f ij /  f ij                                 (3)
                                     i 1        i 1 j 1
where:
pij is the estimated probability that term wi will appear in document dj;
pi∗ is the estimated probability of term wi, i.e. the probability that the term will appear
     in any collection document;
p∗j is the estimated probability of document dj, i.e probability that the document will
     appear with any term.
    To determine PMI, we calculate the logarithm of the estimated probability pij (1)
divided by the product of two probabilities pi∗ (2) and p∗j (3):
                               pmiij  log( pij /( pi * p* j )),                       (4)
If wi, and dj are statistically independent, then according to the definition of the product
of probability of independent events, pi∗∗ p∗j = pij. In this case, the value of the formula
logarithm of the definition of pmiij (4) will be zero. However, if there is some semantic
interrelation between wi, and dj, it should be expected that pij will be more than it would
be if wi, and dj were semantically independent. Therefore, we should look for the PPMI
(Positive Pointwise Mutual Information) weight function, which is defined as:
                                       pmiij , if pmiij  0
                      x ij  ppmiij                       ,                           (5)
                                       0, otherwise
To measure the similarity of two weighted frequency vectors, we will determine their
cosine similarity [17]. Let x and y be two vectors of n elements. Then the cosine of
angle Ѳ between vectors x and y can be calculated as inner product of vectors normal-
ized by their lengths.
                                              xy
                                                       n

                         cos( x, y )                 i 1    i   i
                                                                                         (6)
                                             x y
                                                n          2       n      2
                                                i 1       i       i 1   i


To calculate the cosine between vectors x and y, we summarize the products of their
coordinates x1 y1 + x2 y2 + … + xn yn, and then divide this product into a square root
of the sum of squares of their coordinates (6).
   According to formula (6), the value of a cosine can vary from minus one if vectors
have opposite directions (Ө =180 degrees) to plus one if directions of vectors coincide
(Ө =0 degrees). When vectors are perpendicular (Ө =90 degrees), the cosine of the
angle between them is equal to zero. Since by definition of PPMI weights cannot be
negative, therefore cosine values between vectors that use PPMI as coordinates will
always lie in the positive range [0, 1].


4      Our Method for Identifying the Criminal Meaning of Texts

The main objective that we are seeking to achieve as a result of this research is to find
a universal method for determining the semantic similarity of the input text to a partic-
ular thematic focus. Namely, we define the thematic closeness to texts that contain
criminal information.
   To determine whether a random text belongs to a criminal subject, we use self-cre-
ated corpus containing news articles related to criminal content. In fact, the concept of
"the criminal meaning of texts" is blurry and subjective. We put the following meaning
into it: the text belongs to the category of "Criminal" if it contains information about
emergency news, war, terrorism, accidents, extremism, criminal offences, etc.
   Within this topic, the corpus of criminal texts and the corpus with the materials,
which do not correspond to the investigated subject, were collected.
   Our method for determining whether a document belongs to a highly specialized
area includes three main steps, as shown in Figure 1: (1) linguistic processing of the
raw corpus and the input document; (2) machine learning phase; (3) cosine similarity
phase.


                    Fig. 1. The general scheme of the applied method.

At the first stage of creating the term-document matrix the linguistic processing of raw
texts consisting of tokenization and text normalization was performed.
   The second stage of processing is the stage of machine learning, which consists of
building a PPMI term-document matrix and PPMI vector of the input document. This
stage is based on the method described in the previous subsection.
   At the third stage, the minimum, maximum and average values of the semantic sim-
ilarity coefficient were calculated, defined in cosine similarity between the input docu-
ment vector and the document vectors of the available PPMI term-document matrix.
   Figure 2 shows an example of the developed application MainWindow, which al-
lows identifying the criminal meaning of the incoming news text.


5      Description of the Corpus and Result of the Experiment

Our dataset includes the corpus of criminal texts and the corpus with articles, which do
not correspond to this topic. To create the first corpus named "criminal" articles were
taken from the news sites of Kharkiv, such as: 057.ua, Mediaport, ATN, Vechirniy
Kharkiv, Misto by the categories of war, terrorism, accidents, extremism, criminal of-
fences, etc. in the period 2007-2018 [18]. For the second corpus named "non-criminal"
texts were collected automatically from the same information sites, but in other catego-
ries. In general, the volume of these two corpora is more than 195,000 text files in text
format.
                  Fig. 2. The example of the running program MainWindow

In order to determine the value of the coefficient simcos allowing attributing the input
document to the criminal theme, we made the following experiment. We have allocated
90% of texts of the "criminal" corpus into a train corpus. The test corpus has consisted
of the last part of texts from "criminal" corpus and texts from “non-criminal”.
   During the experiment, the values of cosine similarity of tested text doc test from
"criminal" part of the test corpus and each text from the trained corpus simcos (doctest,
doctrain) was calculated. Table 1 shows the maximum, minimum and average values of
cosine similarity between each "criminal" text of the test corpus and documents of the
train corpus, which also belong to the criminal theme.

Table 1. The fragment of the analysis results of simcos between criminal texts of the test corpus
and criminal texts of the train corpus
                          File                                Cosine similarity simcos

                                                          Max           Min         Average
                                                        (simcos)      (simcos)      (simcos)
 ATN_2018-08-31_11.33_1.txt                               0,92          0,26          0,72

 ATN_2018-08-31_12.27_2.txt                               0,77          0,35          0,57
 ATN_2018-08-31_12.33_3.txt                               0,79          0,38          0,67
 ATN_2018-08-31_16.24_4.txt                               0,78          0,38          0,67
 Misto X_2018-08-31_16.39_5.txt                           0,82          0,36          0,69
 Misto X_2018-09-01_17.44_6.txt                           0,80          0,39          0,68
 Misto X_2018-09-02_10.26_7.txt                           0,88          0,27          0,69
 Misto X_2018-09-02_12.58_8.txt                           0,80          0,27          0,60
 Misto X_2018-09-02_17.00_9.txt                           0,87          0,31          0,71

 VKh_2018-01-14_13.50_2.txt                               0,77          0,50          0,66
 VKh _2018-01-14_14.10_3.txt                              0,82          0,48          0,68
 VKh_2018-01-14_14.40_4.txt                               0,77          0,49          0,67
 Mediaport _2018-01-15_15.10_5.txt                        0,73          0,28          0,51
 Mediaport _2018-01-16_11.10_6.txt                      0,81         0,36          0,68

 Mediaport _2018-01-16_15.00_7.txt                      0,88         0,31          0,71
 Mediaport _2018-01-16_18.40_8.txt                      0,85         0,33          0,69
 057.ua _2018-10 16_00.09_1.txt                         0,79         0,42          0,70
 057.ua _2018-10-17_00.13_2.txt                         0,67         0,41          0,59
 057.ua _2018-10-17_00.18_3.txt                         0,89         0,37          0,67
 057.ua _2018-10-18_00.18_4.txt                         0,74         0,38          0,61
 057.ua _2018-10-19_00.02_5.txt                         0,77         0,43          0,62
 057.ua _2018-10-22_00.19_6.txt                         0,84         0,43          0,72


The analysis of the obtained results allowed concluding that the minimum value of co-
sine similarity coefficient (simcos) of the texts containing criminal information of the
test corpus and documents of the trained corpus is not less than 0.3 (min(simcos)> 0.3),
the maximum value of max(simcos)> 0.7, and the average value of average(simcos)>
0.55.
   Table 2 shows certain values of min (simcos), max (simcos) and average (simcos) of
cosine similarity between documents of the train corpus and documents from the test
corpus, which can be related to any subject except for the criminally specialized one.

      Table 2. The fragment of the simcom results for the test corpus of a random theme
                       File                                Cosine similarity, simcos
                                                      Max            Min         Average
                                                    (simcos)       (simcos)       (simcos)
 ATN_2018-08-31_1.txt                                 0,63           0,24            0,41
 ATN_2018-08-31_2.txt                                 0,66           0,25            0,47
 ATN_2018-08-31_3.txt                                 0,69           0,19            0,45
 ATN_2018-08-31_4.txt                                 0,67           0,21            0,41
 ATN_2018-08-31_5.txt                                 0,73           0,24            0,48
 ATN_2018-09-6.txt                                    0,60           0,25            0,44
 Misto X_2018-09-7.txt                                0,51           0,15            0,36
 Misto X _2018-09-8.txt                               0,74           0,20            0,53
 Misto X _2018-09-9.txt                               0,67           0,22            0,47
 Misto X _2018-09-10.txt                              0,62           0,23            0,43
 Misto X _2018-09-11.txt                              0,76           0,19            0,55
 Misto X _2018-09-12.txt                              0,75           0,22            0,60
 Misto X _2018-10-19_13.txt                           0,72          0,25           0,51
 Mediaport_2018-10-19_14.txt                          0,65          0,19           0,47
 Mediaport _2018-10-19_15.txt                         0,66          0,19           0,50
 Mediaport _2018-10-19_16.txt                         0,62          0,21           0,38
 Mediaport _2018-10-19_17.txt                         0,56          0,28           0,45
 Mediaport _2018-10-19_18.txt                         0,62          0,20           0,44
 Mediaport _2018-10-19_19.txt                         0,60          0,18           0,39
 VKh_2018-10-19_20.txt                                0,65          0,24           0,43
 VKh _2018-10-19_21.txt                               0,73          0,25           0,50
 VKh _2018-10-19_22.txt                               0,65          0,18           0,44
 057.ua _2018-10-19_23.txt                            0,59          0,23           0,47
 057.ua _2018-10-19_24.txt                            0,56          0,27           0,45
 057.ua _2018-10-19_25.txt                            0,61          0,25           0,48
 057.ua _2018-10-19_26.txt                            0,55          0,20           0,39
 057.ua _2018-10-19_27.txt                            0,70          0,28           0,41

Having analyzed the obtained results, we could conclude that the average value of co-
sine similarity between the texts of the trained corpus and texts of arbitrary subjects is
usually within 0.35 < average (simcos) < 0.50. The maximum and minimum values are
below 0.76 and 0.30, respectively: max (simcos) < 0.76 and min (simcos)< 0.30.
   On the basis of the experimental research, we formulated the hypothesis that if the
average value of the cosine similarity coefficient between the input document and doc-
uments of the trained corpus is more than 0,50 this document can be attributed to the
highly specialized documents, which contain criminal information.
   In order to evaluate the correctness and reliability of the obtained borderline value
of the semantic similarity coefficient, we used the metrics of recall, precision and F-
measure. As a result of the experiment, we analyzed 1064 documents from the test
corpus, that were not used earlier, 520 of which were defined in advance as having
criminally significant information, and 544 - other thematic areas.
   Table 3 shows the fragment of the quality assessment table of the proposed technol-
ogy to determine the semantic similarity to highly specialized texts.

Table 3. The fragment of quality assessment of our technology (where: NC – not criminal text;
                                      C – criminal text)
            File               Apriori     max        min         Average          System
                               infor-    (simcos)   (simcos)      (simcos)         conclu-
                               mation                                               sion
 057.ua_2018-11-09_51.txt    NC             0,65      0,19          0,42             NC
 057.ua _2018-11-09_52.txt   C              0,71      0,12          0,49             NC
 057.ua_2018-10-02_17.txt    C              0,81      0,22          0,59              C
 057.ua _2018-11-09_53.txt   NC             0,78      0,14          0,52             NC
 057.ua _2018-10-02_19.txt   C              0,82      0,25          0,67              C
 Misto X_2018-10-02_20.txt   C              0,83      0,23          0,65              C
 Misto X_2018-11-09_54.txt   NC             0,72      0,21          0,53              C
 Misto X_2018-11-09_55.txt   NC             0,61      0,17          0,41             NC
 Misto X_2018-11-09_56.txt   NC             0,79      0,22          0,43             NC
 Mediaport_2018-08-          C              0,87      0,36          0,71              C
 31_20.txt
 Mediaport_2018-08-          C              0,80      0,47          0,69              C
 31_16.txt
 Mediaport_2018-08-          C              0,77      0,41          0,67              C
 31_16.txt
 Mediaport_2018-08-          C              0,71      0,41          0,61              C
 31_18.txt
 Mediaport_2018-07-          C              0,65      0,47          0,58              C
 01_19.txt
 Mediaport_2018-11-          C              0,73      0,38          0,59              C
 08_20.txt
 ATN_2018-11-09_57.txt       NC             0,61      0,15          0,44             NC
 ATN _2018-11-09_58.txt      C               0,66      0,25             0,48            NC
 ATN_2018-11-09_59.txt       NC              0,61      0,26             0,46            NC
 ATN_2018-11-09_60.txt       C               0,79      0,40             0,62             C
 ATN_2018-09-04_16.txt       C               0,88      0,32             0,70             C
 VKh_2018-11-09_17.txt       C               0,85      0,41             0,72             C
 VKh _2018-11-09_18.txt      C               0,78      0,49             0,69             C
 VKh _2018-11-09_19.txt      C               0,84      0,43             0,71             C
 VKh _2018-11-09_15.txt      C               0,87      0,41             0,71             C
 VKh _2018-09-04_74.txt      NC              0,57      0,21             0,37            NC
 VKh _2018-11-09_62.txt      NC              0,70      0,24             0,54             C
 VKh _2018-11-09_63.txt      NC              0,51      0,17             0,39            NC
 VKh _2018-11-09_64.txt      NC              0,64      0,23             0,43            NC
 VKh _2018-11-09_65.txt      NC              0,53      0,19             0,37            NC

The result of the experiment can be presented in Table 4, where:
    tp (true positive) - texts, which are correctly automatically defined as semantically
close to the “criminal” theme;
   fp (false positives) - texts which are incorrectly automatically defined as semanti-
cally close to the “criminal” theme;
   fn (false negatives) - texts which are incorrectly automatically defined as semanti-
cally not close to the documents of “criminal” theme;
   tn (true negatives) - texts which are correctly automatically defined as not close to
the “criminal” theme.

Table 4. The results of the experiment to determine the semantic similarity of the document to
                                    the “criminal” theme
                          tp = 512                             fp =36
                            fn =8                             tn =518

Based on the above-mentioned values, we calculated the recall, precision and F-meas-
ure of the developed technology for determining the semantic similarity of the docu-
ment to a highly specialized area (on the example of the criminal texts corpus).

          Table 5. The recall, precision and F-measure of the developed technology
           precision                        recall                         F1-measure
            93,4%                           98,5%                            95,95%

Recall obtained from the experiments is a bit more than precision. The practical signif-
icance of these results lies in the fact that when solving this specific task of identifying
criminally significant texts, it is better to have the error of the first type, which creates
redundancy of criminally significant documents, than the error of the second type,
which skips the criminal-contained texts.
6      Conclusions

The evaluation of the semantic similarity of the texts is a rather capacious and extensive
task, which is an integral part of most linguistic tasks, for example, referencing, classi-
fication, creation of question-answering systems, information retrieval, etc. Most mod-
ern researches still focus on the development of this area specifically for the English
language. There are a few available applications for semantic comparison of texts such
as WordNet::Similarity or Alchemy API. All of them have achieved good enough re-
sults, but despite this algorithm for the other languages is still not completed. That's
why the purpose of our research is to determine the thematic domain of Ukrainian and
Russian texts in the absence of predefined classes by estimating semantic similarity.
We consider such thematic field as a criminal-contained text and, accordingly, every-
thing that does not fall under such topics. For the study there was created a special news
corpus, all the texts of which were automatically collected from the news sites of
Kharkiv in a large collection of documents (195,000 texts).
   Machine learning offers many ways to define semantic similarity of texts. VSM is
one of the most common and frequently used methods. But it has its disadvantages,
such as the result may be irrelevant documents or lack of true relevance due to the
wrong weighting of terms.
   In our study, we use PPMI as a weight function to avoid this problem and to get the
best results. In contrast to mutual information it refers to single events, whereas MI
(mutual information) refers to the average of all possible events.
   Standard measures were used to evaluate results: precision, recall and F-measure.
The recall of the document classification as a highly specialized subject is above 98%,
while precision is around 93% and F-measure = 96%. These values indicate good and
accurate results for the application of PPMI in the task of evaluating the semantic sim-
ilarity of texts.


References

 1. Salton, G., Wong, A., & Yang, C.-S.: A vector space model for automatic indexing. Com-
    munications of the ACM, 18 (11), 613–620 (1975).
 2. Yoon, Kim: Convolutional neural networks for sentence classification. In: Proceedings of
    the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.
    1746-1751, Doha, Qatar (2014).
 3. Zaidi, N. A. S. , Mustapha, A., Mostafa, S. A., Razali, M. N. : A Classification Approach
    for Crime Prediction. In: Applied Computing to Support Industry: Innovation and Tech-
    nology, pp. 68-78. Springer, Heidelberg (2019).
 4. Sayali, D. Jadhav, Channe, H.: Comparative Study of K-NN, Naive Bayes and Decision
    Tree Classification Techniques. In: Journal of Physics Conference Series, 1142(1):012011
    (2016).
 5. Rizun, N., Taranenko, Y., Waloszek, W.: Improving the accuracy in sentiment classifica-
    tion in the light of modelling the latent semantic relations. Information MDPI 2018. V.9.
    307 (2018).
 6. Bajwa, A. M., Collarana, D., Vidal, M.-E.: Interaction Network Analysis Using Semantic
    Similarity Based on Translation Embeddings. In: International Conference on Semantic Sys-
    tems. SEMANTiCS 2019: Semantic Systems. The Power of AI and Knowledge Graphs, pp.
    249-255 (2019).
 7. Ehrmann, M., Talmi, J.: Starting from a blank page? Semantic similarity in central bank
    communication and market volatility. ECB Working Paper, No. 2023 (2017).
 8. Araque, O., Zhu, G., Iglesias, C. A.: A semantic similarity-based perspective of affect lexi-
    cons for sentiment analysis. In: Knowledge-Based Systems, Volume 165, pp. 346-359
    (2019).
 9. Ming, Liu, Bo, Lang, Zepeng, Gu: Calculating Semantic Similarity between Academic Ar-
    ticles using Topic Event and Ontology. Published in ArXiv (2017).
10. Ramage, D., Rafferty, A. N., Manning, C. D.: Random walks for text semantic similarity.
    In: Proceedings of the 2009 workshop on graph-based methods for natural language pro-
    cessing, Association for Computational Linguistics, pp. 23–31 (2009).
11. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In:
    Proceedings of the 14th International Join Conference on Artificial Intelligence, 1995, pp.
    448-453, San Mateo, CA (1995).
12. Gad, W.K, Kamel, M.S.: New semantic similarity based model for text clustering using ex-
    tended gloss overlaps. In International Workshop on Machine Learning and Data Mining in
    Pattern Recognition, V.7., 23, pp. 663-677 (2009).
13. Turnay, P.D., Pantel, P.: From frequency to meaning: Vector Space Models of Semantics.
    Journal of Artificial Intelligence Research, 37, pp. 141-188 (2010).
14. Majumder, G., Pakray, P., Gelbukh, A., Pinto, D.: Semantic Textual Similarity Methods,
    Tools, and Applications: A Survey. Comp. Sist. vol.20, no.4, México (2016).
15. Furnas, G.W., Landauer, T.K., Gomez, L.M., & Dumais, S.T.: Statistical semantics: Analy-
    sis of the potential performance of keyword information systems. Bell System Technical
    Journal, 62 (6), pp. 1753–1806 (1983).
16. Pantel, P., Lin, D.: Document clustering with committees. In: Proceedings of the 25th An-
    nual International ACM SIGIR Conference, pp. 199–206 (2002).
17. Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine
    measure: Similarity of features in vector space mode. Computación y Sistemas. V.18, 3, pp.
    491-504 (2014).
18. Khairova, N., Kolesnyk, A., Mamyrbayev, O., Mukhsina, K.: The aligned Kazakh-Russian
    parallel corpus focused on the criminal theme. In: Proceedings Colins 2019, pp. 116-125,
    Ukraine, Kharkiv (2019).