<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text Classification using Term Co-occurrence Matrix</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tetiana Kovaliuk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iryna Yurchuk</string-name>
          <email>i.a.yurchuk@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kseniia Dukhnovska</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oksana Kovtun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasiia Nikolaienko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Taras Shevchenko National University of Kyiv</institution>
          ,
          <addr-line>Bohdan Hawrylyshyn str. 24, Kyiv, UA-04116</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>44</fpage>
      <lpage>53</lpage>
      <abstract>
        <p>Among modern classification methods, the support vector machine occupies a leading place due to its strict theoretical validity. This method is used in the theory of pattern recognition, in data mining, and it is also widely used to build search engines at the stage of classifying text documents. The article considers the support vector machine method by using a kernel based on a term cooccurrence matrix in a corpus of text documents. From fuzzy set theory, the relationship between two terms in a collection of text documents can be defined. From here it is possible to build the kernel for the method support vectors. The study proves that the co-occurrence matrix can be the kernel for the method of support vectors. The work shows that the quality of classification for the support vector machine method on the kernel of the term co-occurrence matrix in a collection of text documents exceeds the quality for the SVM method.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Document classification</kwd>
        <kwd>text corpus</kwd>
        <kwd>text documents</kwd>
        <kwd>term co-occurrence matrix</kwd>
        <kwd>kernel function</kwd>
        <kwd>support vector machines</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Text documents that are saved in electronic storages of global or corporate networks are the basis
for decision-making in government, scientific, and educational institutions, and determine the success
of their work as a whole. The intensive growth in the amount of text information, their universal
accessibility and high dynamism leads to an excess of information and overflow with it. To overcome
these problems, integrated banks of text documents are being built. During the formation of such
repositories, preliminary processing of these documents is carried out for the purpose of their
intellectual analysis. One of the most important stages of preliminary processing of text documents is
their classification.</p>
      <p>Among modern classification methods, the support vector machine occupies a leading place due to
its strict theoretical validity. This method is used in the theory of pattern recognition, in data mining,
and it is also widely used to build search engines at the stage of classifying text documents. But
algorithms from this family are characterized by the problem of scalability – high resource consumption
of memory and computation time at the training stage. This paper presents a strategy for improving the
performance of support vector machines by applying a kernel based on a term co-occurrence matrix
across multiple text documents. As a result of this strategy, the weight of more informative features and
more informative feature combinations increases, which makes the classifier faster and less
resourceintensive.</p>
      <sec id="sec-1-1">
        <title>The purpose of the research is to improve the performance of the support vector machine method by using a kernel based on a term co-occurrence matrix in a corpus of text documents.</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <sec id="sec-2-1">
        <title>Artificial intelligence systems are based on research in text mining. However, the linguistic topic is inherently complex, necessitating the development of new models of text documents and innovative approaches to classification and clustering of text documents. In the scientific study [1], it is shown that a text document can be represented as a vector.</title>
      </sec>
      <sec id="sec-2-2">
        <title>A new model for text documents is introduced in [2]. This model is based on the concept of fractals to deepen the complexity of language. This approach reduces potential noise. Additionally, the authors introduce an innovative activation function to enhance the performance of the neural network. The results of this study are validated through real technical reports.</title>
        <p>Text analysis is the process of extracting and interpreting information from a collection of text
documents to identify and describe their key characteristics and features. It is a crucial tool in the field
of linguistics, enabling the comprehension of text structure, content, features, and main ideas. The
purpose of text analysis is to uncover the content, context, and linguistic and stylistic features of the
text. The meaning of text analysis is to identify entities (terms). As stated in [3], the purpose of entity
recognition is to automatically identify expected knowledge from text. For example, in [4, 5] algorithms
for detecting technical terms are considered.</p>
      </sec>
      <sec id="sec-2-3">
        <title>In [6], a text document is considered as a vector of features:</title>
        <p>= [ 1,  2, … ,   ],
and entities are recognized in the form of:</p>
        <p>[  ,   ,  ,  ],
where  denotes a specific entity;  and  are indexes that highlight the span from   to   , thus
specifying the location of the entity;  represents the type of the entity.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Another task of text analysis is the classification of text documents. The classification of text documents can be performed using various algorithms and models. For example, in [7, 8], it is proposed to assign a category label to the text document T, such as "acceptable, tolerable, investigated, and corrected" for risk assessment.</title>
      </sec>
      <sec id="sec-2-5">
        <title>In the work [9], an ensemble classification method is proposed for multi-label classification of text</title>
        <p>documents. The method combines the random forest algorithm and the semantic vector space of hidden
semantic core co-occurrence. The work shows that random word segmentation increases the diversity
of integration and obtains another orthogonal projection of the low-dimensional space of hidden
semantics.</p>
      </sec>
      <sec id="sec-2-6">
        <title>Latent Dirichlet Allocation (LDA) is often used to define classes. In the article [10], an algorithm</title>
        <p>for clustering short emotional texts based on the LDA algorithm is proposed. The text document is
presented in TF-IDF format. In addition, thematic word pairs and topic relation words are extracted and
inserted into the LDA model for clustering. This approach allows for more accurate semantic
information to be found. The results of this work can be successfully used to analyze texts published
on social media.</p>
      </sec>
      <sec id="sec-2-7">
        <title>The Support Vector Machine (SVM) method is widely used for classification. In the work [11]</title>
        <p>proposes a novel hybrid approach that leverages a gray level co-occurrence matrix and the SVM
classifier to achieve highly accurate segmentation and classification of malignant and benign cells in
breast cytology images under severe noise conditions. In [12], SVM is investigated for sentiment
analysis. In order to improve the classification accuracy for the SVM method, this work uses the particle
swarm method and genetic algorithm.</p>
      </sec>
      <sec id="sec-2-8">
        <title>In [13], the author of a text document is determined using classification. Here, several methods for</title>
        <p>classifying text documents written by several authors are compared. The work compares the results of
classification based on the following methods: artificial neural networks, multi-expression
programming, k-nearest neighbor, support vector machines, and decision trees with C5.0.</p>
        <p>
          Multi-view support vector machines are investigated in [14] to address the problems of multi-view
image classification. The work proposes to introduce a fuzzy assessment to assign a weight to each
sample from multiple images. This assessment combines membership and non-membership functions,
which provides an efficient mechanism for assigning weight coefficients to collections of images with
multiple representations.
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
        </p>
        <p>Recently, researchers have been investigating term-document matrices. For example, in [15, 16], the
relationships between terms and their use in text documents are studied. Here, the input data is a
sequence of events when terms coincide in documents. Taking into account the term-document
matching stream, latent vectors of terms and documents are studied. The goal of this study is search
optimization. For this purpose, the work proposed a dimensionality reduction algorithm for adaptively
learning the latent semantic index of terms and documents in a collection. The results of this study
demonstrate improvements in search performance compared to the baseline method.</p>
      </sec>
      <sec id="sec-2-9">
        <title>The application of network analysis to corpus linguistics is introduced in [17]. The authors</title>
        <p>conducted a comprehensive initial study involving various practical analyses, including frequency,
keyword, collocation, and cluster analysis. The work proposes a novel procedure capable of extracting
diverse intertextual and intratextual aspects from the analyzed documents. This procedure captures the
existing connections between different elements of the corpus, enabling a deeper understanding of the
relationships and dynamics within the sets of texts.</p>
      </sec>
      <sec id="sec-2-10">
        <title>The term-document matrix was used in [18] to detect and track topics in a collection of text</title>
        <p>documents. In this work, a hierarchical non-negative matrix factorization method was proposed for
creating topic hierarchies from text collections. The proposed method can dynamically adjust the topic
hierarchy to adapt to emerging, developing, and fading processes [19]. The work proves that such an
approach can achieve better performance with competitive time savings.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Problem definitions</title>
      <p>
        Any text document can be described as a tuple:
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
structures   ,  :
      </p>
      <p>→  .
where  is the number of text documents in the corpus,   is a statistical measure of the importance of
the  -term of the  -document, 
is the power of the dictionary,  
is additional attributes in the
description of this text document, ℎ is the number of additional attributes.</p>
      <sec id="sec-3-1">
        <title>Corpus is a collection of text documents. Terms (concepts) are the names of mental images that are</title>
        <p>transferred in the process of information exchange. The terms are contained in the dictionary. A
statistical measure of the importance of a term is the ratio of the number of occurrences of a term in a
creation date of the text document, its author, address, links to other text documents, etc.
text document to the number of all terms in the text document. Additional features (  ) can include the</p>
        <p>
          The vector   represented by the tuple (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) is called the profile of this text document. The
classification of text documents consists in dividing a set of these resources into non-intersecting groups
in order to ensure the minimum difference between the resources of one group corresponding to a
certain content topic, and the maximum difference between the resources of other groups.
        </p>
        <p>Let there be a set of text documents: 
= {  |  = 1,  } and set of classes are given</p>
        <p>= {  |  = 1,   }. Each class   is described by some structure   = {  1,   2, … ,   }. The
classification procedure</p>
        <p>consists of performing some transformations on the profile of a text
document, after which a conclusion is made about the correspondence of the resource   to one of the
3.1.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>The relationship between terms in a document</title>
      <p>If we neglect additional parameters, then the set of text document profiles can be represented as
where  ̃ is the number of text documents that contain the  -term,  ̃ is the number of text documents
that contain the  -term,  ̃ is the number of text documents that contain both terms.</p>
      <p>
        However, (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) neglects the frequency of term occurrence in the document. The relationship
coefficient   can have the same value for terms that carry the main content of the document and for
unimportant terms. To overcome this drawback in (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ), we will introduce a normalized term frequency
in the document   :


=
∑
 ̃
      </p>
      <p>,


=</p>
      <p>̃
 ̃ +  ̃ −  ̃
,
where  ̃ is the number of text documents that contain the  -term,  ̃ is the number of text documents
that contain the  -term,  ̃ is the number of text documents that contain both terms.</p>
      <p>As a result, the matrix of relationships between terms will take the form:
 = (
 11
0
⋮
0
 22
0
⋮
0
⋯
⋯
⋱
⋯
0
0
⋮</p>
      <p>
        If both sides of expression (
        <xref ref-type="bibr" rid="ref8">8</xref>
        ) are multiplied on the left by S, and on the right by S-1(this is possible
as SS-1=S-1S=E– identity matrix), then:
      </p>
      <p>=  −1 ∙  ∙  ,
) ,</p>
      <p>1
 =  2 =
(
√ 11
0
⋮
0
1
√ 22
0
⋮
0
⋯
⋯
⋱
⋯
0
0
⋮
√  )
.</p>
      <p>
        All eigenvalues of a positive definite matrix are positive. Hence, all non-zero elements of the
diagonal matrix  are greater than 0, which means there exists  2:
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        )
(
        <xref ref-type="bibr" rid="ref9">9</xref>
        )
      </p>
      <sec id="sec-4-1">
        <title>In essence, it is a correlation matrix of two terms in a corpus, also known as a term co-occurrence matrix. Its coefficients indicate a statistical dependence of the frequencies of two terms, and changes in the values of one or more of these quantities lead to a systematic change in the values of another or other quantities.</title>
        <p>3.2.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Property of term co-occurrence matrix</title>
      <sec id="sec-5-1">
        <title>Property 1. The term co-occurrence matrix  (7) in a corpus is positive definite.</title>
      </sec>
      <sec id="sec-5-2">
        <title>Proof: According to Sylvester's criterion: a symmetric matrix is positive definite if and only if its</title>
        <p>
          leading minors are positive. The proof is based on the use of the Gauss method and reducing matrix (
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
to triangular form (taking into account that the matrix is symmetric and its elements are positive
numbers). With such transformations, the values of the main minors will not change and will be equal
to the product of their diagonal elements.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>Property 2. The co-occurrence matrix of terms in a corpus is quadratic.</title>
      </sec>
      <sec id="sec-5-4">
        <title>Statement 1. For the term co-occurrence matrix, there is a matrix B, which is represented as</title>
        <p>=  2.
1</p>
      </sec>
      <sec id="sec-5-5">
        <title>Proof: From Schur's lemma it follows that if a matrix  is symmetric, then there is an orthogonal</title>
        <p>matrix  , the columns of which are eigenvectors of the matrix  , and a diagonal matrix  , the elements
of which are the eigenvalues of the matrix  , such that:</p>
        <p>A function  :</p>
        <p>→  is called a kernel if it is represented in the form  ( ,  ′) = [ ( ),  ( ′)]
under some mapping  :</p>
        <p>→  , where  is the space with scalar product.</p>
        <p>1
Let  ( ) =  2 ∙  , then its kernel:</p>
        <p>( ,  1) =   ∙  ∙  1,</p>
      </sec>
      <sec id="sec-5-6">
        <title>Statement 2. The function defined by expression (11) is the kernel.</title>
      </sec>
      <sec id="sec-5-7">
        <title>Proof:</title>
        <p>( ,  1) =   ∙  ∙  1 =   ∙ √
∙ √ ∙  1 = [ ( ),  ( 1)].</p>
      </sec>
      <sec id="sec-5-8">
        <title>The term co-occurrence matrix is the kernel.</title>
        <p>3.3.</p>
        <p>Support vector machines
 =  ∙ ( −1 ∙  ∙  ) ∙  −1 =  ∙  ∙  −1 =  ∙  2 ∙  2 ∙  −1 =
1</p>
        <p>
          1
= (  ∙  2 ∙  −1) ∙ (  ∙  2 ∙  −1) =  ∙  =  2.
(
          <xref ref-type="bibr" rid="ref11">11</xref>
          )
(
          <xref ref-type="bibr" rid="ref12">12</xref>
          )
        </p>
        <p>There are many approaches to solving classification problems – this is a probabilistic approach (for
example: the naive Bayes method and its modifications), an algebraic approach (through various
measures of proximity of text document profiles: Euclidean distance and its modifications, Manhattan
distance, Mahalanobis distance, etc.). Today, the support vector machine is a classification method, the
results of which are rated as one of the most effective. It should be noted that the support vector machine
considers the problem of binary classification. If there are a large number of classes in a corpus, then
the classification problem can be solved in a way in which each class is separated from all the others.
In this case, each binary problem does not depend on the others, and they can be solved in parallel on
different machines.</p>
      </sec>
      <sec id="sec-5-9">
        <title>Support vector machines are a set of supervised learning classification algorithms. To implement</title>
        <p>this method, each text document is represented as a point in N-dimensional space. The classes will be
determined by the clusters of these points. A separating hyperplane is drawn between these classes.
That is, a hyperplane is constructed such that the distance between the two closest points from different
classes is maximum. If such a hyperplane exists, then it is called an optimal separating hyperplane.</p>
      </sec>
      <sec id="sec-5-10">
        <title>The equation of the hyperplane has the form:</title>
        <p>∙  −  = 0,
where  is a support vector, i.e. a perpendicular drawn from the class point to the separating hyperplane,
 is the distance from the hyperplane to the origin (Figure 1).</p>
      </sec>
      <sec id="sec-5-11">
        <title>With respect to this hyperplane, all points of the same class lie on the same side. If we construct a hyperplane parallel to the given one and passing through the class point closest to the optimal separating</title>
      </sec>
      <sec id="sec-5-12">
        <title>To find a solution to such a problem, the Lagrange function is composed:</title>
        <p>( ,  ) =  ( ) + ∑   ∙  ( ),

 =1
where   are the Lagrange multipliers.</p>
        <p>
          According to the Kuhn-Tucker theorem, problem (
          <xref ref-type="bibr" rid="ref14">14</xref>
          ) will take the form:
 ( ,  ,  ) =
‖ ‖2 − ∑   ∙ (  ( ∙   −  ) − 1) →
        </p>
        <p>min , max
1
2
{
Moreover, it reduces to an identical problem that contains only dual variables:
hyperplane, then its equation will be  ∙  −  = 1. The equation of the same hyperplane for another</p>
      </sec>
      <sec id="sec-5-13">
        <title>Between these hyperplanes a strip is formed, which must be free from points of one and another class. To exclude all points from the strip, you need to check the condition:</title>
      </sec>
      <sec id="sec-5-14">
        <title>The problem of constructing an optimal separating hyperplane is reduced to the problem of minimizing the length of the support vector w. This is a quadratic optimization problem, which is represented as follows: Task (14) is a mathematical programming problem. If we rewrite it in general form, we can get:</title>
        <p>{
 ∙   −  ≥ 1,
 ∙   −  ≤ 1,
  = 1
  = −1.
{
‖ ‖2 → min
  ( ∙   −  ) ≥ 1.
{
 ( ) → min
 ( ) ≥ 0.

 =1
 =1
1 &lt;  &lt; 


 =1
1
2</p>
        <p />
      </sec>
      <sec id="sec-5-15">
        <title>If the problem is solved, then  and  can be found using the formulas:</title>
      </sec>
      <sec id="sec-5-16">
        <title>As a result, the classification algorithm can be written as:</title>
        <p>= ∑   ∙   ∙   ,
 =  ∙   −   .
 ( ) = sign ( ∑     [  ,  ] −  ) .
 ( ) = sign ( ∑      [  ,  ] −  ) .</p>
      </sec>
      <sec id="sec-5-17">
        <title>Modified support vector machines contain arbitrary kernels instead of scalar products, which eliminates linearity. Replacing the scalar product in (20) with an arbitrary kernel we get:</title>
      </sec>
      <sec id="sec-5-18">
        <title>For a more confident and effective classification in formula (21), matrix (7) was used as the kernel. Problem (21) with kernel (7) takes the form:</title>
        <p>
          (
          <xref ref-type="bibr" rid="ref13">13</xref>
          )
(
          <xref ref-type="bibr" rid="ref14">14</xref>
          )
(
          <xref ref-type="bibr" rid="ref15">15</xref>
          )
(
          <xref ref-type="bibr" rid="ref16">16</xref>
          )
(
          <xref ref-type="bibr" rid="ref17">17</xref>
          )
(
          <xref ref-type="bibr" rid="ref18">18</xref>
          )
(
          <xref ref-type="bibr" rid="ref19">19</xref>
          )
(20)
(21)
        </p>
        <p>The objective function of this problem is quadratic, and the constraint is linear functions, so this
problem is classified as a quadratic programming problem. The most popular methods for solving
problems of this class include gradient methods. Their use in the general case allows one to find the
local extremum point. The algorithm for solving the problem using gradient methods is that, starting
from a certain point, a sequential transition is carried out to other points in the direction of the
antigradient until an admissible solution to the original problem is found. When finding a solution to a
problem using gradient methods, the iterative process continues until the gradient of the function at the
next point becomes equal to zero or until it exceeds some infinitesimal value (the accuracy of the
resulting solution).
by the following steps:</p>
        <p>The practical implementation of teaching this method of classifying text documents can be described
1. Digitization of a text document: removal of various control characters, tags, stop words and
presentation of the test information document in vector form.</p>
      </sec>
      <sec id="sec-5-19">
        <title>2. Compilation of matrix coefficients (7) for the incoming set of text documents.</title>
      </sec>
      <sec id="sec-5-20">
        <title>3. Initial approximation: an arbitrary vector representing a document of one class is selected, and</title>
        <p>the closest vector of another class is searched for it. For this vector, the closest vector from the first
class is searched, etc.</p>
      </sec>
      <sec id="sec-5-21">
        <title>4. Solving problem (22) using the gradient descent method.</title>
      </sec>
      <sec id="sec-5-22">
        <title>The use of support vector machines differs for the better from other methods in that this task can be</title>
        <p>parallelized. Considering the gigantic power of modern data banks, the size of the training sample
should be estimated in the hundreds of thousands. This dimension makes the use of standard numerical
methods of quadratic programming impossible. To date, several algorithms have been proposed to
optimize such problems. One of these algorithms is the sequential optimization method (SOM).
According to SOM, the minimum possible subtask is solved at each iteration. The result of this
partitioning is many simple and independent subtasks, which means that their parallel calculation on
different machines becomes possible.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4. Classification quality characteristics</title>
      <p>Classification quality characteristics are divided into two error levels. A first-level error occurs if a
text document is mistakenly not in the required class. Second-level errors include errors when a
document is mistakenly found to be in a defined class. Let the number of documents in the test set be
 , of which   is the number of documents correctly identified to the class, and   is the number of
documents that are not related to the class. Then, 
=   +   . The metric accuracy ( ) is used to
evaluate the general accuracy of a classifier. It is calculated by dividing the number of correctly
classified instances by the total number of instances in the dataset:</p>
      <p>+  
 =
∙ 100% =
∙ 100%.</p>
      <p>Let the number of false passes   , and false detections   , then the number of correct passes</p>
      <p>=   −   and correct detections   =   −   . The degree of precision ( ) and recall ( ), which
are often used in information retrieval tasks, are calculated based on the   and   characteristics:
 =</p>
      <p>+  
∙ 100%,
 =</p>
      <p>∙ 100%.</p>
      <p />
      <p>+  
∑     = 0.
{  =1

2
∑   

 



(22)
(23)
(24)</p>
      <sec id="sec-6-1">
        <title>Completeness measures the proportion of correct classification across all documents in a given class.</title>
        <p>Precision measures the proportion of correct detections of all identified resources. Completeness and
accuracy are quantities dependent on each other. When developing the architecture of a text document
classifier, you usually have to choose one of two characteristics as the dominant one. If the choice falls
on accuracy, this leads to a decrease in completeness due to an increase in the number of false positive
responses. An increase in completeness causes a simultaneous decrease in accuracy. Therefore, it is
convenient to use one value to characterize the classifier, the so-called F1-score or Van Riesbergen
measure:</p>
        <p>+</p>
        <p>F1-score is one of the most common characteristics for this type of system. There are two main
approaches to calculating F1-score for classification problems: total F1-score (results for all classes are
summarized in one table, from which the F1-score e is then calculated) and average F1-score (for each
class, its own F1-score is formed, then the arithmetic mean is calculated for all classes).</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>5. Research materials and results</title>
      <p>As working material for the experiments, a test sample of text documents in two scientific disciplines
was taken: information retrieval and continuum mechanics. Those. the test collection needed to be
divided into two classes. Each class had approximately the same number of text documents - 200 and
220, which ensured uniformity of results - no one class stood out solely because of the number of
documents in it. It should be noted that the accuracy of document definition for a particular class can
greatly depend on the quality of the resources of that particular class. The total number of text
documents in the sample was 420 documents. They were divided randomly into 2 equal parts of 210
documents each, maintaining approximately equal numbers of resources by class.</p>
      <p>Training was carried out on one of these two sets, and testing was carried out on the other. Next, all
documents from the training set were divided into 5 parts. Removing the first part of documents from
the training set, the classifier was trained on the remaining 80 % of documents. Using the test sample,
indicators of the quality of the classifier's work were determined. Then removing the second part from
the training set; other values of performance indicators were calculated, etc. As a result, 5 values of
classifier performance indicators were obtained. After this, the sets were swapped and the runs were
repeated. Their arithmetic mean was taken as the final results. This averaging made it possible to smooth
out the results, thereby making them more objective.</p>
      <sec id="sec-7-1">
        <title>Software implementation of these classification methods was carried out in the Python environment.</title>
      </sec>
      <sec id="sec-7-2">
        <title>As a result of the work done, the following results were obtained (Figure 2, Table 1).</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>6. Conclusions</title>
      <sec id="sec-8-1">
        <title>Testing of the support vector method and its proposed modification was carried out on the same collection, the documents underwent the same pre-processing and digitization, which gives the right to compare the performance indicators of the classification methods. 51</title>
      </sec>
      <sec id="sec-8-2">
        <title>As seen from the calculation results, the classification quality of the Support Vector Machine (SVM) method using the kernel of the term co-occurrence matrix in a collection of text documents surpasses the quality achieved by the standard SVM method. Furthermore, the first algorithm demonstrates a 13% higher precision and a 6% higher F1-score in classification.</title>
        <p>The assessment of classification quality is based on the detection of regularities for each class, whose
attribute values are the same for most objects of the analyzed class and differ from the attribute values
of other classes. The absence of such regularities indicates that this class is not a homogeneous set of
profiles of text documents. The quality of the classification is considered higher, the closer the profiles
of text documents are located within the class. To analyze the dispersion of classified text documents,
such a qualitative gradation as condensation is introduced, which allows you to determine how close
the profiles of text documents are located within the class in comparison with the location of objects
within the entire original population. In order to recognize the completion of the classification
procedure, it is necessary to achieve the fulfillment of the condition of compliance of the obtained
division into classes with the meaningful concept of condensation. Classes will correspond to the
meaningful concept of condensation in the case when the maximum spread between profiles of text
documents of one class is less than the mean square spread of objects within the entire original
population. This result is achieved thanks to the term co-occurrence matrix.
7. References</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K. K.</given-names>
            <surname>Dukhnovska</surname>
          </string-name>
          ,
          <article-title>Formation of the research dynamic vector space</article-title>
          .
          <source>Artificial Intelligence 3</source>
          <volume>-4</volume>
          (
          <year>2015</year>
          ), pp.
          <fpage>28</fpage>
          -
          <lpage>36</lpage>
          . http://nbuv.gov.ua/UJRN/II_
          <year>2015</year>
          _
          <fpage>3</fpage>
          -
          <issue>4</issue>
          _
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>A new multifractal-based deep learning model for text mining</article-title>
          .
          <source>Information Processing &amp; Management</source>
          (
          <year>2024</year>
          ),
          <volume>61</volume>
          (
          <issue>1</issue>
          ):
          <fpage>103561</fpage>
          . doi:
          <volume>10</volume>
          .1016/j.ipm.
          <year>2023</year>
          .
          <volume>103561</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Why</surname>
            <given-names>KDAC</given-names>
          </string-name>
          ?
          <article-title>A general activation function for knowledge discovery</article-title>
          .
          <source>Neurocomputing</source>
          (
          <year>2022</year>
          ),
          <volume>501</volume>
          , pp.
          <fpage>343</fpage>
          -
          <lpage>358</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.neucom.
          <year>2022</year>
          .
          <volume>06</volume>
          .019.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Text mining of hazard and operability analysis reports based on active learning</article-title>
          .
          <source>Processes</source>
          (
          <year>2021</year>
          ),
          <volume>9</volume>
          (
          <issue>7</issue>
          ):
          <fpage>1178</fpage>
          . doi:
          <volume>10</volume>
          .3390/pr9071178.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Simone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Ansaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Agnello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Patriarca</surname>
          </string-name>
          ,
          <article-title>Industrial safety management in the digital era: Constructing a knowledge graph from near misses</article-title>
          . Computers in Industry (
          <year>2023</year>
          ),
          <volume>146</volume>
          :
          <fpage>103849</fpage>
          . doi:
          <volume>10</volume>
          .1016/j.compind.
          <year>2022</year>
          .
          <volume>103849</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Chouhan</surname>
          </string-name>
          ,
          <string-name>
            <surname>V. Raychoudhury,</surname>
          </string-name>
          <article-title>A machine learning based framework to identify unseen classes in open-world text classification</article-title>
          .
          <source>Information Processing &amp; Management</source>
          (
          <year>2023</year>
          ),
          <volume>60</volume>
          (
          <issue>2</issue>
          ):
          <fpage>103214</fpage>
          . doi:
          <volume>10</volume>
          .1016/j.ipm.
          <year>2022</year>
          .
          <volume>103214</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <article-title>Application of natural language processing in HAZOP reports</article-title>
          .
          <source>Process Safety and Environmental Protection</source>
          <volume>155</volume>
          (
          <year>2021</year>
          ), pp.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.psep.
          <year>2021</year>
          .
          <volume>09</volume>
          .001.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Deloose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gysels</surname>
          </string-name>
          , B. De Baets,
          <string-name>
            <surname>J. Verwaeren,</surname>
          </string-name>
          <article-title>Combining natural language processing and multidimensional classifiers to predict and correct CMMS metadata</article-title>
          .
          <source>Computers in Industry</source>
          <volume>145</volume>
          (
          <year>2023</year>
          ),
          <volume>103830</volume>
          . doi:
          <volume>10</volume>
          .1016/j.compind.
          <year>2022</year>
          .
          <volume>103830</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          <article-title>Sui, Multi label text classification method based on co-occurrence latent semantic vector space</article-title>
          .
          <source>Procedia Computer Science</source>
          <volume>131</volume>
          (
          <year>2018</year>
          ), pp.
          <fpage>756</fpage>
          -
          <lpage>764</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.procs.
          <year>2018</year>
          .
          <volume>04</volume>
          .321.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <article-title>Sentiment word co-occurrence and knowledge pair feature extraction based LDA short text clustering algorithm</article-title>
          .
          <source>Journal of Intelligent Information Systems</source>
          <volume>56</volume>
          (
          <year>2021</year>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10844-020-00597-7.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S. U.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Haseeb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. I. A.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Hanif,</surname>
          </string-name>
          <article-title>A machine learning-based approach for the segmentation and classification of malignant cells in breast cytology images using Gray Level Co-occurrence Matrix (GLCM) and Support Vector Machine (SVM)</article-title>
          .
          <source>Neural Computing and Applications</source>
          (
          <year>2022</year>
          ),
          <volume>34</volume>
          :
          <fpage>8365</fpage>
          -
          <lpage>8372</lpage>
          . doi:
          <volume>10</volume>
          .1007/s00521-021-05697-1.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y. T.</given-names>
            <surname>Arifin</surname>
          </string-name>
          ,
          <article-title>Komparasi fitur seleksi pada algoritma support vector machine untuk analisis sentimen review</article-title>
          .
          <source>Jurnal Informatika</source>
          (
          <year>2016</year>
          ),
          <volume>3</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>191</fpage>
          -
          <lpage>199</lpage>
          . https://ejournal.bsi.ac.id/ejurnal/index.php/ji/article/view/868.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Avram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Oltean</surname>
          </string-name>
          ,
          <article-title>A comparison of several AI techniques for authorship attribution on Romanian texts</article-title>
          .
          <source>Mathematics</source>
          (
          <year>2022</year>
          ),
          <volume>10</volume>
          (
          <issue>23</issue>
          ):
          <fpage>4589</fpage>
          . doi:
          <volume>10</volume>
          .3390/math10234589.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lou</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          <article-title>Xie, Multi-view intuitionistic fuzzy support vector machines with insensitive pinball loss for classification of noisy data</article-title>
          .
          <source>Neurocomputing</source>
          (
          <year>2023</year>
          ),
          <volume>549</volume>
          (
          <issue>7</issue>
          ):
          <fpage>126458</fpage>
          . doi:
          <volume>10</volume>
          .1016/j.neucom.
          <year>2023</year>
          .126458
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.-H.</given-names>
            <surname>Na</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Memory-restricted latent semantic analysis to accumulate term-document cooccurrence events</article-title>
          .
          <source>Pattern Recognition Letters</source>
          (
          <year>2012</year>
          ),
          <volume>33</volume>
          (
          <issue>12</issue>
          ), pp.
          <fpage>1623</fpage>
          -
          <lpage>1631</lpage>
          . https://doi.org/10.1016/j.patrec.
          <year>2012</year>
          .
          <volume>05</volume>
          .002.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Linga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mingqi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hongyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gencai</surname>
          </string-name>
          ,
          <article-title>Hierarchical online NMF for detecting and tracking topic hierarchies in a text stream</article-title>
          .
          <source>Pattern Recognition</source>
          <volume>76</volume>
          (
          <year>2018</year>
          ), pp.
          <fpage>203</fpage>
          -
          <lpage>214</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.patcog.
          <year>2017</year>
          .
          <volume>11</volume>
          .002.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>K.</given-names>
            <surname>Stuart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Botella</surname>
          </string-name>
          ,
          <article-title>Corpus linguistics, network analysis and co-occurrence matrices</article-title>
          .
          <source>International Journal of English Studies</source>
          (
          <year>2009</year>
          ),
          <volume>9</volume>
          (
          <issue>3</issue>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          . https://revistas.um.es/ijes/article/view/99481.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>K.</given-names>
            <surname>Dukhnovska</surname>
          </string-name>
          ,
          <article-title>Search for regularities for classes of text documents. Scientific and practical conference “Current problems of the theory of control systems in computer sciences” (</article-title>
          <year>2021</year>
          ), Ukraine, Slovyansk, December
          <volume>21</volume>
          -
          <issue>24</issue>
          , pp.
          <fpage>32</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kiktev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rozorinov</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Masoud</surname>
          </string-name>
          ,
          <article-title>"Information model of traction ability analysis of underground conveyors drives," 2017 XIIIth International Conference on Perspective Technologies and Methods in MEMS Design (MEMSTECH), Lviv</article-title>
          , Ukraine,
          <year>2017</year>
          , pp.
          <fpage>143</fpage>
          -
          <lpage>145</lpage>
          , doi: 10.1109/MEMSTECH.
          <year>2017</year>
          .
          <volume>7937552</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>