<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Rena R. Timirgaleeva</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Clustering Method for Control Problems Based on a Genetic Algorithm with K-means Mutation Operator*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yaroslav A. Bondarev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ayshe N. Ilyasova</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lomonosov Moscow State University</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>V.I. Vernadsky Crimean Federal University</institution>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1858</year>
      </pub-date>
      <volume>1</volume>
      <issue>2</issue>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Information retrieval is an integral part of the life of the majority of the world's population and one of the indispensable tools in solving various management tasks in business, technology, complex systems. However, meeting modern information needs, using existing technologies, often requires a lot of time. In this paper, we present a system the purpose of which is to search the Internet and group search results taking into account the semantics of the documents found. It is worth noting that the system is designed for the mass user and has an intuitive interface. The proposed algorithm can be used to solve the problems of clustering objects in the process of selecting the optimal impact on a technical object, such as an aircraft or an unmanned aerial vehicle. Methods of cluster analysis are used in the work to obtain the desired results, in particular, a modification of the K-means genetic algorithm is proposed and implemented.</p>
      </abstract>
      <kwd-group>
        <kwd>information retrieval</kwd>
        <kwd>clustering</kwd>
        <kwd>genetic algorithm</kwd>
        <kwd>management task</kwd>
        <kwd>semantic analysis</kwd>
        <kwd>clustering analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In the modern world, information is one of the most important resources, and
information retrieval is an integral part of the life of billions of people. However, the amount
of information stored in digital format is so large that the use of classical search
methods can be quite time-consuming. The part of managers' work can be an example:
searching for clients using the Internet, navigation through an unstructured document
base.</p>
      <p>An information search began to form as a problem in the 19th century. At the same
time, despite modern technical support, the problem has not yet been completely
resolved, especially when it comes to the need for large amounts of information. The
presence of the problem in itself indicates the relevance of research in this area.
*</p>
      <p>
        Today, the problems of processing, storage, and use of information are being solved
at the state level, as evidenced by the adoption and implementation of the national
program “Digital Economy of the Russian Federation” [
        <xref ref-type="bibr" rid="ref11 ref6">6, 11</xref>
        ]. The program is aimed at
implementing a comprehensive digital transformation of the economy and social sphere
of Russia, in connection with which the volumes of processed data will increase many
times. This thesis formulates another argument in support of the relevance of the work.
      </p>
      <p>Thus, the object of research is information retrieval. One of the subtasks of
information retrieval is the problem of clustering a collection of text documents, which acts
as the subject of research.</p>
      <p>The purpose of the work is to build a system that allows us to simplify the search on
the Internet by grouping the results of search results using the genetic algorithm.</p>
      <p>The theoretical value of the work lies in the synthesis of a modified genetic algorithm
in which the mutation operator is based on the K-means method. The study found that
there is only one freely distributed system that works similarly - "Yippy" at the moment.
However, it does not always work correctly with the Russian language. The study is
aimed at building a Russian-language system, which is the practical value.</p>
      <p>
        An analysis of the literature showed that there are many data clustering algorithms
[
        <xref ref-type="bibr" rid="ref1 ref12 ref15 ref4 ref7 ref8">1, 4, 7, 8, 12, 15</xref>
        ] at the moment, but each of them has its drawback (the need to specify
the number of clusters, the complexity/quality ratio, non-deterministic reaction to
various data topology). According to the authors, the genetic algorithm [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], as a heuristic
method, will help smooth out some of them. A hybrid of the genetic algorithm and the
K-means method was chosen as the basis [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        One of the modern directions of information retrieval is the clustering of text
documents, the purpose of which is to automatically split a collection of documents into
semantically similar groups [
        <xref ref-type="bibr" rid="ref13 ref7">7, 13</xref>
        ]. Unlike classification, no signs of these groups are
known in advance. On the other hand, clustering, also known as the task of cluster
analysis belongs to the class of unsupervised learning problems.
      </p>
      <p>
        Using the methods of cluster analysis, it is possible to solve problems such as
building typologies or classifications, investigating data dependencies for grouping by
common features, verifying the truth of statements regarding selected groups in the data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        There are various typologies of clustering methods. It is possible to distinguish
algorithms that take a characteristic description of objects, a similarity matrix, or a matrix
of distances between objects by the type of input data. Regarding the methods used,
there are algorithms based on a probabilistic approach (K-means, EM-algorithm,
FOREL family, discriminant analysis), using artificial intelligence (neural networks,
genetic algorithm, fuzzy clustering of C-means) as a basis or logical approach (decision
trees). Hierarchical and graph-theoretic approaches are also known [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Possible goals of clustering include problems of data compression, detection of
atypical objects or outliers, and understanding of data by highlighting the cluster structure.</p>
      <p>Thus, since almost all known methods work with some formal structures, it is
necessary to redefine the concept of text. By the term “text” we will mean the correct
structures of the natural language, which are unambiguously understood in the context.
The text, drawn up as a completed sequence of sentences, will be called a document.
In our understanding, a class of documents may include, for example, an advertisement,
a company website, etc.</p>
      <p>We clarify the objectives of the study. It is required to filter out the search results
based on the user's request, taking into account the semantics of the original phrase.
For example, for the query “Building stores in N city”, various search engines will offer
resources such as directories, online stores, maps, and promotional offers. Since the
purpose of the work is to facilitate the work of various kinds of managers, it is necessary
to clear the issuance of irrelevant requests (advertising, maps, directories).</p>
      <p>To solve this issue, we formulate the main stages of the solution. These include:
obtaining multiple search results in various systems, conducting clustering, and
automating the determination of the most suitable clusters.</p>
      <p>
        The paper suggests using machine learning methods without a teacher, in particular,
a genetic algorithm for clustering text documents. The classical genetic algorithm (GA)
is a powerful optimization tool, therefore, it is necessary to present the clustering
problem in the form of a global optimum search for some objective function. This method
consists of the iterative use of genetic operators. The initial population - the set of
proposed solutions is formed as an initial approximation. Like the Monte Carlo method,
the original population is formed randomly. Genetic operators are selection,
crossbreeding, and mutation operators. The process stops when some stopping criterion is
met [
        <xref ref-type="bibr" rid="ref3 ref5">3, 5</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>
        Since the basis of information retrieval by the statement of the problem is the clustering
algorithm for text documents, we consider the key approaches to solving the clustering
problem [
        <xref ref-type="bibr" rid="ref2 ref6">2, 6</xref>
        ].
      </p>
      <p>
        We note the main criteria for assessing the suitability of methods for the problem [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
From the end-user, the ratio of speed and accuracy should be noted first. These
parameters are competing quantities. The ideal option is the ability to choose the ratio of
speed and accuracy. The question of “intersectivity” may also arise - the possibility of
getting one document in different categories on similar topics. An important condition
is the amount of preliminary information. The fewer input parameters you need for
clustering, the better. For example, the need to indicate the number of clusters.
      </p>
      <p>On the other hand, it is necessary to take into account the features of the
implementation of the algorithms. We will pay attention to the possibility of using input data of
various types and the need for training algorithms.</p>
      <p>
        It is advisable to give a retrospective of methods such as CustomSearchFolders,
LSA / LSI, STC, K-means [
        <xref ref-type="bibr" rid="ref12 ref13 ref14 ref9">9, 12, 13, 14</xref>
        ].
      </p>
      <p>Before describing these methods, their advantages and disadvantages, we define
some concepts. The differences between classification and clustering need to be
understood. Classification is the assignment of each object to a class with previously known
characteristics obtained at the training stage, moreover, the number of classes is strictly
limited. Clustering is splitting multiple documents into clusters - some subsets of the
original set of objects, the number, and properties of which are not known in advance.
Of the above algorithms, STC splits a collection of documents into an indefinite number
of clusters, the rest require setting the number of clusters.</p>
      <p>The next property, according to which we will distinguish between algorithms, is the
type of text characteristics used. Methods can be numerical or non-numerical. The
former uses the numerical characteristics of documents (for example, the adjacency
matrix or measure tf-idf), and the latter use directly the words and phrases of the text.
Of the methods considered, STC is non-numerical, the rest are numerical.</p>
      <p>
        For further discussion, we introduce several definitions. We will call meaningful
words, that is, words that directly affect which cluster the document will be assigned
to, terms. And lexical tokens that do not affect semantics will not be terms. Each term
is an elementary sign, and all of them together form a space of terms. A set of
documents (in a term space) is a set of points or vectors of a given space. The coordinates
of the point are the degrees of the significance of each term for a particular document
[
        <xref ref-type="bibr" rid="ref15 ref2">2, 15</xref>
        ]. Here are some ways to calculate the significance of terms. These are metrics
such as:
      </p>
      <p>Binary (1 indicates that the term appears in the document, 0 - otherwise);
TF (TermFrequency) is the number of occurrences of the term in the document;
TF-IDF (TermFrequency – InversedDocumentFrequency) is an integral
characteristic, which can be computed as follows:
tf(t, d) = nt ,</p>
      <p>∑k nk
where nt is the number of occurrences of term t in document d, and the denominator
represents the total number of words in the document;
| |
ⅈdf(t, D) = ln ,</p>
      <p>|{di ∈ D|t ∈ di}|
where | | is the number of documents in the collection;
|{di ∈ D|t ∈ di}| is the number of documents from the collection D that contain t (if
nt ≠ 0),</p>
      <p>− ⅈ ( ,  ,  ) =  ( ,  ) ⋅ ⅈ ( ,  ).</p>
      <p>It is worth noting that with this method of calculating measures, words with a high
frequency within a specific document and with a low frequency of use in others will
gain a lot of weight. The coordinates of the documents are written to the tf-idf matrix.</p>
      <p>
        All the methods except STC work with tf-idf or proximity matrices. By the
proximity of documents, we mean the value of the semantic similarity of two documents,
which is calculated like the Euclidean distance between points or the cosine of the angle
between vectors. All proximity values are placed in a triangular proximity matrix [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>We also introduce the concept of the centroid of a cluster which is a vector that is
calculated as the arithmetic mean of the vectors of all documents in the cluster.</p>
      <p>Currently, several methods are most often used to solve this problem:
CustomSearchFolders, LSA / LSI, SuffixTreeClustering, and K-means.</p>
      <p>The idea behind the CustomSearchFolders method is to narrow down your search
results to folders. By selecting one of the directories, the user gradually narrows the
scope of the search. In this method, directories are centroids of clusters. Custom search
folders technology is implemented in the NothernLight search server. The system does
a good job of finding information on a common topic but works exclusively with
English.</p>
      <p>The LSA / LSI method has long been known as a way to search for latent connections
and is used in various fields of science. It is based on the principles of factor analysis
and can help identify the latent structure of phenomena or objects. The advantages of
LSA / LSI include unnecessary training. The disadvantages are significant
computational complexity and, in the general case, the absence of the names of the main factors,
i.e. the names of the clusters.</p>
      <p>The SuffixTreeClustering method was developed primarily for finding a substring.
These structures consist of vertices, branches, and suffix pointers - special pointers that
allow you to search in O(n) time and with the same memory usage. A letter or letter
combination is assigned to each branch. To get a suffix in a tree node, you need to
combine the contents of all branches from the root to this node. The advantages of the
SuffixTreeClustering method are high speed, interpretable results, and no training
needs. The weak points of the approach include vulnerability to homonymy and the
need for multiple word processing.</p>
      <p>
        To date, K-means is the most popular algorithm [
        <xref ref-type="bibr" rid="ref1 ref11">1, 11</xref>
        ]. It is based on the sequential
stabilization of centroid clusters. The method consists of several steps: the selection of
initial centroids, distribution of all documents in clusters depending on the nearest
centroid, recalculation of cluster centroids according to the new partition. The algorithm
requires the time of the order On, where n is the number of documents. This is the main
advantage of the K-means method. Also, the algorithm does not need training and is
quite universal. The disadvantage is the need to specify the number of clusters.
      </p>
      <p>Thus, it is possible to formulate the basic requirements for the clustering algorithm
for search problems in the interests of solving business problems:
─ difficulty no higher than O(n);
─ no need to predetermine the number of clusters;
─ ability to work without prior training;
─ interpretability of results.</p>
      <p>
        It has been shown [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] that the problem cannot be solved in polynomial time; therefore,
the application of classical algorithms is not advisable. At the same time, the
complexity of the problem allows you to resort to machine learning methods. In this paper, it
is proposed to use a genetic algorithm to solve the clustering problem in the above
formulation.
      </p>
      <p>
        A genetic algorithm is a heuristic search algorithm that is effectively used to solve
optimization and modeling problems by randomly selecting, combining, and varying
the desired parameters using mechanisms similar to natural selection in nature. This
method was proposed by J. Holland [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] as a powerful optimization tool. The genetic
algorithm belongs to the class of machine learning methods without a teacher.
      </p>
      <p>To apply the genetic algorithm, the task must be set so that it is possible to present
the solution in the form of a vector of genes - a genotype. The classic genetic algorithm
usually works with fixed length genotypes.</p>
      <p>There are various methods for constructing the initial population (several solutions).
These include the so-called blanket strategy, shotgun strategy, and focusing. The
“blanket” method is the formation of a population that contains all possible solutions. To use
the shotgun strategy, it is necessary to consider a sufficiently large random subset of
solutions. The focus is on varying one of the most likely solutions.</p>
      <p>
        The degree of fitness of each genotype also called an individual, is assessed using a
fitness function. This mechanism shows how well the object described by the genotype
solves the proposed problem. Thus, the genetic algorithm is aimed at optimizing the
fitness function (target function) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Further, the population transforms using genetic operators. First of all, the selection
operator is used to select the most adapted individuals. There are various variations,
such as roulette selection, tournament selection, ranking. To use the roulette method, it
is necessary to build the probability distribution of the choice of a particular individual
for selection. The ratio of the fitness function of the selected individual to the total
value of the fitness function for the entire population is usually used.</p>
      <p>The next step in the classical genetic algorithm is the use of the mutation operator.
The idea behind this step is to prevent the algorithm from converging to a local
optimum. By analogy with the animal world, the probability of mutation is usually quite
low. The most common variant of the described operator is a variation of a random
gene of an individual. For example, the inverse of a random bit in binary coding.</p>
      <p>The final step is to check the stopping criteria. As such a condition, you can choose,
for example, the number of iterations, or generations. If any information about the
object under study is known, a working option is to compare the fitness function with
some preliminary assessment.</p>
      <p>The most difficult part of the genetic algorithm in terms of the amount of
computation is finding a fitness function. However, taking into account the fact of independence
of the calculation of the fitness function on different individuals, it is worth noting that
the use of parallel computing at this stage is rather rational.</p>
      <p>So, since one of the objectives of this work is to optimize the classical genetic
algorithm, let us turn to the consideration of the modifications made and the synthesis of
the algorithm that meets the basic requirements presented above.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Results and discussion</title>
      <p>For a clear understanding of the need to change the classical structure of the genetic
algorithm, we consider the positive and negative aspects of GA and compare it with the
requirements put forward above.</p>
      <p>
        The advantages of GA include the use of a combination of probabilistic and
deterministic approaches, consideration of several points in the search space at once, as well
as robustness and resistance to local optima [
        <xref ref-type="bibr" rid="ref10 ref8">8, 10</xref>
        ].
      </p>
      <p>The main disadvantages are the high complexity in the case of using a non-trivial
fitness function and the possibility of the absence of a critically accurate result.</p>
      <p>
        It is easy to see compliance with the requirements for the clustering algorithm
specified above. Among the considered algorithms, the On complexity has the K-means
algorithm, therefore, to reduce the complexity of the genetic algorithm, we will build a
hybrid of the classical genetic algorithm and the K-means method - the K-means
genetic algorithm (GCA) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Let {xi, ⅈ = 1, 2, … , n} be the set of objects and xij the j-th feature of the object xi.
For ⅈ = 1,2, … , n and k = 1, 2, … , K we define
wik = {
1, ⅈf the objectⅈ belongsto the clusterk;</p>
      <p>0, otherwⅈse.</p>
      <p>Thus, the matrix W = ||wik|| has the following property:</p>
      <p>Let ck = (ck1, ck2, … , ckd) be the centroid of the k-th cluster (d is the dimension of
space), and</p>
      <p>Next, we introduce such concepts as an intracluster distance (3) and cumulative
intracluster distance (4):
wik ∈ {0,1} и ∑kK=1 wik = 1
сkj =
∑in=1 wikxij
∑in=1 wik</p>
      <p>2
S(k)(W) = ∑in=1 wik ∑jd=1(xij − ckj) ,</p>
      <p>S(W) = ∑kK=1 S(k)(W).</p>
      <p>W∗ = argmⅈnS(W).</p>
      <p>W
(1)
(2)
(3)
(4)
(5)</p>
      <p>
        The value (4) is also known as the quadratic error. According to the construction,
the main task is to find the matrix W∗ = ||wi∗k|| that minimizes S(W), i.e.
The algorithm does not guarantee convergence to the global optimum. The result may
depend on the initial clusters. As the algorithm is usually fast, it is common to run it
multiple times with different starting conditions. However, worst-case performance can
be slow: in particular certain point sets, even in two dimensions, converge in
exponential time, which is 2Ω(n).[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] These point sets do not seem to arise in practice: this is
corroborated by the fact that the smoothed running time of k-means is polynomial.[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
      </p>
      <p>The "assignment" step is referred to as the "expectation step", while the "update step"
is a maximization step, making this algorithm a variant of the generalized
expectationmaximization algorithm. This choice of measure is because the K-means algorithm is
the most popular method of minimizing exactly the quadratic error.</p>
      <p>Now we will consider a method of coding objects, the formation of an initial
population, and genetic operators for GCA.</p>
      <p>Coding. In our case, the search space is all matrices W meeting the condition (1).
We will use the K-ary code, that is, a representation in the form of a string sW of length
n containing numbers from the set {1,2, ..., K}. According to the construction, each
character of the string assigns a cluster label to the object xi. Such a code is uniquely
decodable thanks to (1).</p>
      <p>Initialization. It is proposed that the initial population of SW is built randomly. That
is, for each individual, each gene is randomly selected from {1,2, ..., K}. However, it
is necessary to consider the correctness of the received lines.</p>
      <p>For example, the code "11111222233333" for K = 4, that is, the cluster with the label
"4" remained empty. In this case, it is necessary to correct invalid lines - replace
random characters with missing cluster labels.</p>
      <p>Selection. For selection, we will use the roulette wheel strategy. Formally, the
probability distribution is as follows:</p>
      <p>P(si) =</p>
      <p>F(si)
∑jN=1 F(sj)
,
where F(si) is the value of the fitness function on the individual si.</p>
      <p>Since the task is to minimize S (W), and the implementation of the roulette method
involves maximizing the objective function, we define some auxiliary functions.</p>
      <p>
        Let f(sW) = −S(W), g(sW) = f(sW) − (f̅ − c ⋅ σ), where f̅ and σ are the mean and
standard deviation of f(sW) in the current population, respectively, and c ∈ [
        <xref ref-type="bibr" rid="ref1 ref3">1,3</xref>
        ] is a
constant. Thus, the measure of the fitness of an individual sW is expressed as
F(sW) = {
g(sw), ⅈf g(sw) ≥ 0,
      </p>
      <p>0, otherwⅈse.</p>
      <p>Mutation. In the general case, a mutation is an exclusively stochastic process,
however, given the features of the problem, it is possible to increase the degree of
convergence of the algorithm by adding a certain amount of determinism. Based on this idea,
we define the mutation operator so that the probability of assigning a label to a specific
cluster gene is the higher, the closer the object described by this gene to the centroid of
the cluster is:
(6)
(7)
(8)
pj = P{sW(ⅈ) = j} =</p>
      <p>cmdmax−dj ,
∑iK=1(cmdmax−di)
where, dj = d(xi, cj) is the Euclidean distance from the object xi to the centroid of
the j-го кластера, dmax = maxdj, and cm ≥ 1 is a constant, the purpose of which we
j
will consider later</p>
      <p>During the operation of the algorithm, situations may arise when a cluster consists
of one and only one object. In such cases, there is a non-zero probability that the
mutation method described above will reassign the cluster label to this object, and the old
cluster, as a result, will remain empty. Such situations can be quickly recognized by the
distance from the object to the centroid of the cluster. If dSW(i) = 0, then, to avoid the
appearance of empty clusters, the mutation operator cannot be applied to the current
gene.</p>
      <p>
        The K-means Operator. An algorithm using the selection and mutation operators
described above requires more generations than the classical genetic algorithm.
Moreover, a high degree of mutability promotes the acquisition of the oscillatory nature of
the algorithm behavior. To improve the situation, instead of the recombination operator,
it is proposed to use one step of the K-means method [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This stage consists of two
steps:
      </p>
      <p>Calculation of cluster centroids for W using (3);</p>
      <p>Overriding the ownership of the object to the cluster by assigning the object the label
of the cluster the centroid of which is closest. As a result, the matrix W̃ is formed.</p>
      <p>However, due to the simplicity of the K-means operator, empty clusters can be
created. Let us take the cluster with the maximum intracluster distance and assign the
object farthest from the centroid to the empty cluster, thus solving the problem.</p>
      <p>Stop criterion. Empirically, it was found that 8 to 15 generations are needed for the
convergence of the constructed algorithm. Therefore, it is proposed to use the number
of generations that have passed since the start of the algorithm as a stopping criterion.
This will not reduce the quality of clustering at a critical scale but will reduce the
required number of calculations.</p>
      <p>So, the model of the classical genetic algorithm is considered and its modification is
proposed for the maximum approximation to the properties of the ideal clustering
algorithm.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>To implement the constructed system, the Python 3.7 programming language and the
PyQt framework for developing a graphical user interface have been used.</p>
      <p>The program was developed in two stages: the first solved the problem of collecting
and preprocessing data, the second stage involved the construction of an algorithm for
clustering a collection of text documents, and its integration into a graphical interface.</p>
      <p>It is worth noting that the technologies used are aimed at implementing a
cross-platform program.</p>
      <p>The first part of the data acquisition phase is to receive the text of the search query
from the user. After that, links are formed for search services.</p>
      <p>Figure 1 shows the link headers for obtaining search results on Google, Yandex,
Bing, and Yahoo. A full-fledged working link consists of both the title and, in fact, the
text of the search.</p>
      <p>To insert a search into a link, all “white spaces” must be replaced with a “+” sign.
At this stage, such a problem as the requirement of the search engine to confirm that
the request was entered by a person using the CAPTCHA service (Completely
Automated Public Turing test to tell Computers and Humans Apart) may occur. One of the
possible solutions may be to use browser emulation tools, which is implemented using
a separate framework (Phantom.js, Casper.js). As an alternative, you can inform the
search engine of the user agent string, which includes the type and version of the
browser, the type, version, and language of the operating system, as well as the type of
user device. This solution works because CAPTCHA is issued by the search service
only in case of a suspicious user agent. In this context, suspicious refers to the user
agent of the robot. Here is an example of such a line: "Mozilla / 5.0 (Windows NT 10.0;
Win64; x64) AppleWebKit / 537.36 (KHTML, like gecko) Chrome / 73.0.3683.103
Safari / 537.36." The key information is that Chrome version 73 and the Windows 10
operating system are used.</p>
      <p>Another obstacle to receiving search results may be that requests are sent from the
same IP address. You can change the address using a proxy server. That is, sending a
request through an intermediate network node.</p>
      <p>For smooth operation, you need a set of user-agent strings for several proxy servers.
To get links, you need to analyze the layout of pages with search results. For sending
HTTP searches, the search package was used, and for working with markup, the
Beautiful Soup 4 module was used.</p>
      <p>Next, you need to get the text of the websites located at the links found. The
uploaded documents will be a collection; however, it is still necessary to do the cleaning
and preprocessing of the received data.</p>
      <p>First of all, you need to clear the text from HTML tags. A feature of the procedure
is that tags can be paired and unpaired. Paired ones include, for example, body, i, ul,
ol tags, and unpaired (single) tags include meta, br, link, source.</p>
      <p>During clustering, the syntactic structure of sentences is not taken into account (only
their vocabulary is important). The next step is to remove all punctuation marks,
translate all letters into lower case and, in general, transform the text into a set of words.</p>
      <p>Note that the conversion of text into a matrix of feature objects will be based on the
number of unique words in the collection and their occurrence. The quality of such a
measure can be interfered with by finding the same word in different forms, that is, in
different cases, tenses. Thus, it is necessary to normalize the text.</p>
      <p>There are two most popular ways to normalize a text. The first of them is the
statement of each word in a dictionary form: nominative, singular, masculine, for verbs
the infinitive. The second method, stemming, consists of highlighting the basis of each
word, for example, the word "building" will be replaced by "build".</p>
      <p>An important feature of the described approaches to normalizing text is the use of a
pre-marked set of texts - the corpus. The work uses the case for the Russian language
from the Natural LanguageToolkit (nltk) tool and the SnowBall stemmizer, which is
also distributed with the nltk package.</p>
      <p>Also, noise is added to the frequency characteristics, the so-called stop words - the
words that do not carry a certain semantic load themselves, that is, various prepositions,
conjunctions, interjections, particles, free-standing numbers. The work uses a set of
stop words included in the nltk tool.</p>
      <p>
        Since there is no dependence on the data during the implementation of these
procedures, the optimal solution will be to resort to parallel computing methods. So, it is
possible to distribute the loading of documents and text preprocessing between the
processor cores of the machine [
        <xref ref-type="bibr" rid="ref1 ref11">1, 11</xref>
        ].
      </p>
      <p>The final step in the data collection phase is to build a matrix of feature objects. This
is done using the TF-IDF statistical measure:
tf(t, d) =</p>
      <p>nt ,
∑k nk
where nt is the number of occurrences of the term t in document d, and the
denominator is the total number of words in the document;
ⅈdf(t, D) = ln
|D|</p>
      <p>,
|{di ∈ D|t ∈ di}|
where |D| is the number of documents in the collection, |{di ∈ D|t ∈ di}| is the number
of documents from the collection D in which t occurs (for nt ≠ 0),</p>
      <p>tf − ⅈdf(t, d, D) = tf(t, d) ⋅ ⅈdf(t, D).</p>
      <p>It is worth noting that with this method of calculating measures, words with a high
frequency within a particular document and with a low frequency of use in others will
gain a lot of weight. The coordinates of the documents are recorded in TF-IDF matrix.
Consider the next phase of the program. As noted above, it consists primarily of
clustering a collection of documents, which is represented by a matrix of feature objects.</p>
      <p>According to the description of the K-means genetic algorithm, the search space is
all matrices W satisfying condition (1). It was also indicated that a string consisting of
cluster labels is used for encoding.</p>
      <p>Let us consider the selection operator, although this requires a fitness function. As
already mentioned, the total intracluster distance was chosen as the basis of the fitness
function. Moreover, it was noted that this stage is the most difficult for computing, so
the use of parallel computing technologies will be a rational approach.</p>
      <p>The two most common parallel computing models are the use of multiple threads
and several separate processes. Since the Python 3 programming language was chosen
for implementation, it would be reasonable to consider the peculiarity of the interpreter
when executing parallel programs.</p>
      <p>The classic Python 3 interpreter, CPython, has a mechanism called Global Interpreter
Lock (GIL). GIL is a synchronization method, which is the easiest cure for conflicts
while simultaneously accessing different threads to the same memory locations. When
one thread captures an area of memory, the GIL blocks the rest. The lock itself occurs
according to the mutex principle. Thus, when using streams, there is a rather strong
restriction on the parallelism of calculations, so we will use streams using the Pool
object of the multiprocessing package of Python 3. This tool allows you to execute a
higher-order map function in parallel.</p>
      <p>After finding the fitness function, you can begin to implement the selection operator.
The selection is based on the roulette method.</p>
      <p>Implementation of the mutation operator. This step of the algorithm was modified
so that in each mutating individual, each gene changes depending on the distance
between the object corresponding to the gene and the nearest centroid of the cluster. This
increases the number of calculations. It was decided to carry out the mutation using
several processes. Let dsW(i) be the distance from the i-th object (xi) to the centroid of
the cluster sW(ⅈ), where sW is the decisive line (individual of the population). Then the
line is correct if dsW(i) &gt; 0 and, therefore, the mutation operator is used. Generally
speaking, incorrect lines can occur at every step of the algorithm.</p>
      <p>Now we will consider the K-means operator. It consists of two steps: finding cluster
centroids and reassigning object labels if the centroid of the current cluster is not the
closest to the selected point. At this stage, invalid lines occur most frequently.
Moreover, the K-means operator is final in the iteration of the calculations. Therefore, we will
correct incorrect lines after applying the operator. To do this, we define the set of
clusters that remain empty and assign the objects from the cluster with the largest
intracluster distance to the labels of empty clusters.</p>
      <p>
        Let us evaluate the results of clustering. An experiment was conducted to compare
the convergence rate of classical GA and GCA. As a result, the following dependence
of the quadratic error on the number of iterations was established (Fig. 2) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
The accuracy of clustering was directly tested on two well-known data sets: Fisher
irises and 20 newsgroups. The Fisher Iris dataset consists of 150 objects. Each object
is described by four attributes: sepal length, sepal width, and petal length, petal width
(sizes of sepals and petals, respectively). The data are quite simple to visualize,
therefore, as a result of the experiment, the following scattering diagrams were constructed
(Fig. 3) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
The 20 newsgroups set contains approximately 20,000 documents, divided into 20
classes. The set contains texts on science (cryptography, electronics, medicine, space),
information technologies (operating systems, graphics, machine components), politics,
religion, etc. For the algorithm test, 3758 documents were selected from 4 categories:
atheism, hockey, computer graphics, and space.
      </p>
      <p>As a result of the experiment, a V-measure value of 0.792, which satisfies the
algorithm requirements described above, was obtained.</p>
      <p>The interface of the constructed system consists of a window where there is a field
for entering a search, the names of groups — the results of clustering, and, directly,
links — the results of searches on the Internet. Using the window menu bar, you can
set the parameters (number of clusters, probability of mutation, size of the initial
population), save the search results in JSON format.</p>
      <p>Thus, the procedure of constructing a software implementation of the proposed
Valgorithm and integrating it into a single system is considered. The program has a
graphical interface, easy to configure and use. The results of clustering are evaluated and
comply with the requirements of the investigated problem.</p>
      <p>As the most important task for further research in this direction, it is advisable to
propose a methodology for the formation of the optimal number of clusters depending
on the specific business task being solved.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In the process of performing applied scientific research, the goals and objectives of the
research were identified, a review of the scientific and technical literature on this topic
was carried out. The problem of information retrieval was formulated, and possible
solutions were considered.</p>
      <p>A review and comparative analysis of the existing methods of clustering a collection
of text documents was carried out.</p>
      <p>The K-means genetic algorithm was modified to cluster the collection of text
documents, and a system that allowed us to group search results on the Internet-based on the
content of search results was implemented.</p>
      <p>The program is written in Python 3 using the PyQt 5 framework and libraries for
mathematical processing of matrices, using the parallel computing model, working with
the network, and the natural language. At the beginning of the session, the user is
prompted with a dialog box that contains a string for entering a search query, a category
of results, and, directly, the results of a search on the Internet in the form of a list of
web page names.</p>
      <p>Thus, an information retrieval system that meets the goals and objectives of the study
has been built.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgment</title>
      <p>The reported study was funded by RFBR, projects number 19-29-06081.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ya</surname>
            .
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bondarev</surname>
          </string-name>
          .
          <article-title>Clustering of Text Documents Based on a Genetic Algorithm // Materials of the XXVI International scientific conference of students, graduate students and young scientists "Lomonosov - 2019"</article-title>
          . Sevastopol:
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>L.A.</given-names>
            <surname>Gladkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.V.</given-names>
            <surname>Kureichik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.M.</given-names>
            <surname>Kureichik</surname>
          </string-name>
          .
          <source>Genetic Algorithms. Moscow, LLC Publishing Company "Physics and Mathematics"</source>
          ,
          <year>2010</year>
          . 366 p.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Зенкина</surname>
            <given-names>О</given-names>
          </string-name>
          .Н.,
          <string-name>
            <given-names>O.N.</given-names>
            <surname>Zenkina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.A.</given-names>
            <surname>Simonov</surname>
          </string-name>
          .
          <article-title>The Use of Genetic Algorithms in the Optimization of Information Processes // Actual problems of the hydrolytosphere (diagnostics, prognosis, management, optimization</article-title>
          and automation).
          <source>Collection of reports</source>
          .
          <year>2015</year>
          . pp.
          <fpage>315</fpage>
          -
          <lpage>323</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>S.B.</given-names>
            <surname>Kartiev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.M.</given-names>
            <surname>Kureichik</surname>
          </string-name>
          .
          <article-title>Development and Research of an Algorithm for Solving the Clustering Problem for the Implementation of Question-Answer Search in the InformationAnalytical Forecasting System // Izvestiya SFU</article-title>
          .
          <source>Technical science</source>
          .
          <year>2016</year>
          . No.
          <volume>7</volume>
          (
          <issue>180</issue>
          ). pp.
          <fpage>18</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>S.V.</given-names>
            <surname>Semenikhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.A.</given-names>
            <surname>Denisova</surname>
          </string-name>
          .
          <source>Automation of Information Retrieval Based on Multicriteria Optimization and Genetic Algorithms // Dynamics of systems, mechanisms and machines</source>
          .
          <source>2014. No. 3</source>
          . pp.
          <fpage>224</fpage>
          -
          <lpage>227</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>R.R.</given-names>
            <surname>Timirgaleyeva</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.Yu. Grishin. Digital</surname>
          </string-name>
          <article-title>Technologies in the Organization of Effective Activities of Financial and Credit Institutions // Development of finance, accounting and audit in modern management concepts</article-title>
          .
          <source>Materials of the I International Scientific-Practical Conference</source>
          .
          <year>2018</year>
          . pp.
          <fpage>86</fpage>
          -
          <lpage>88</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.V.</given-names>
            <surname>Chekina</surname>
          </string-name>
          .
          <source>Genetic Clustering of Technical Documentation in the CAD Project Repository // Thirteenth National Conference on Artificial Intelligence with international participation. Conference proceedings. 2012</source>
          . pp.
          <fpage>82</fpage>
          -
          <lpage>89</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>I.A.</given-names>
            <surname>Shcherbatov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.O.</given-names>
            <surname>Belyayev</surname>
          </string-name>
          .
          <article-title>The Use of Cluster Analysis for Processing Documents in the Information Retrieval System // Bulletin of the Astrakhan State Technical University</article-title>
          . Series: Management,
          <source>Computing and Informatics</source>
          .
          <source>2012. No. 2</source>
          . pp.
          <fpage>161</fpage>
          -
          <lpage>166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Chen</surname>
            <given-names>K.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>K.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            <given-names>J.X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsieh</surname>
            <given-names>W.L.</given-names>
          </string-name>
          <string-name>
            <surname>Renewable Power Output Forecasting Using Least-Squares Support</surname>
          </string-name>
          Vector Regression and Google Data // Sustainability.
          <year>2019</year>
          . V.
          <volume>11</volume>
          ,
          <string-name>
            <surname>Iss</surname>
          </string-name>
          .11. Paper # UNSP 3009. DOI:
          <volume>10</volume>
          .3390/su11113009.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Das</surname>
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pratihar D.K. A Directional</surname>
          </string-name>
          <article-title>Crossover (DX) Operator for Real Parameter Optimization Using Genetic Algorithm /</article-title>
          / Applied Intelligence.
          <year>2019</year>
          . V.
          <volume>49</volume>
          ,
          <string-name>
            <surname>Iss</surname>
            . 5,
            <given-names>P.</given-names>
          </string-name>
          <year>1841</year>
          -
          <fpage>1865</fpage>
          . DOI:
          <volume>10</volume>
          .1007/s10489-018-1364-2.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Grishin</surname>
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Timirgaleeva</surname>
            <given-names>R.</given-names>
          </string-name>
          <article-title>The Digital Economy of the Region: a Distributed Infrastructure of the Industry Ecosystem</article-title>
          // Conference of Open Innovations Association, FRUCT.
          <year>2019</year>
          . V. 24. P.
          <volume>624</volume>
          -
          <fpage>631</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Krishna</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Narasimha</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Murty Genetic</surname>
          </string-name>
          K-means
          <source>Algorithm // IEEE Transactions on Systems, Man, and Cybernetics</source>
          .
          <year>1999</year>
          . V. 3
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Luo</surname>
            <given-names>T.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            <given-names>G.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            <given-names>N.J.</given-names>
          </string-name>
          <article-title>Research on manufacturing productivity based on improved genetic algorithms under internet information technology // Concurrency and ComputationPractice</article-title>
          &amp; Experience.
          <year>2019</year>
          . V.
          <volume>31</volume>
          ,
          <string-name>
            <surname>Iss</surname>
          </string-name>
          . 10, SI. Paper # e4859. DOI:
          <volume>10</volume>
          .1002/cpe.4859.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Ma</surname>
            <given-names>X.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            <given-names>X.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>Q.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liang</surname>
            <given-names>Z.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            <given-names>W.X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu Z.X. A Survey on Cooperative</surname>
          </string-name>
          Co-Evolutionary Algorithms // IEEE Transactions on Evolutionary Computation.
          <year>2019</year>
          . V.
          <volume>23</volume>
          ,
          <string-name>
            <surname>Iss</surname>
          </string-name>
          . 3. P.
          <volume>421</volume>
          -
          <fpage>441</fpage>
          . DOI:
          <volume>10</volume>
          .1109/TEVC.
          <year>2018</year>
          .
          <volume>2868770</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Wang</surname>
            <given-names>L.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suo</surname>
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            <given-names>L.X.</given-names>
          </string-name>
          <article-title>DDmap: a MATLAB package for the double digest problem using multiple genetic operators /</article-title>
          / BMC Bioinformatics.
          <year>2019</year>
          . V.
          <volume>20</volume>
          . Paper #348. DOI:
          <volume>10</volume>
          .1186/s12859-019-2862-x.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>