<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Parallel Text Document Clustering Based on Genetic Algorithm</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Madina Mansurova</string-name>
          <email>mansurova01@mail.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vladimir Barakhnin</string-name>
          <email>bar@ict.nsc.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sanzhar Aubakirov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yerzhan Khibatkhanuly</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aigerim Mussina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Al-Farabi Kazakh National University</institution>
          ,
          <addr-line>Almaty</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Computational Technologies SB RAS</institution>
          ,
          <addr-line>Novosibirsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Novosibirsk State University</institution>
          ,
          <addr-line>Novosibirsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>218</fpage>
      <lpage>232</lpage>
      <abstract>
        <p>This work describes parallel implementation of the text document clustering algorithm. The algorithm is based on evaluation of the similarity between objects in a competitive situation, which leads to the notion of the function of rival similarity. Attributes of bibliographic description of scientific articles were chosen as the scales for determining similarity measure. To find the weighting coeficients which are used in the formula of similarity measure a genetic algorithm is developed. To speed up the performance of the algorithm, parallel computing technologies are used. Parallelization is executed in two stages: in the stage of the genetic algorithm, as well as directly in clustering. The parallel genetic algorithm is implemented with the help of MPJ Express library and the parallel clustering algorithm using the Java 8 Streams library. The results of computational experiments showing benefits of the parallel implementation of the algorithm are presented.</p>
      </abstract>
      <kwd-group>
        <kwd>clustering algorithm</kwd>
        <kwd>genetic algorithm</kwd>
        <kwd>parallel computing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The volume of the digital content increases every day. This impedes the process
of selection of the most appropriate material, when searching for the necessary
information. Clustering is one of the instruments that allows to perceive large
volumes of information. Clustering is a process of dividing a set of text
documents of the electronic database into classes when the elements united into one
class (called a cluster) have a greater similarity than the elements referring to
different classes. The process of text document clustering is resource intensive; the
problem gets more complicated with the increase in the volume of the data being
processed. To solve this problem, researches apply the different technologies of
parallel computing.</p>
      <p>The aim of this work is development of parallel FRiS-Tax algorithm for
clustering of scientific articles. For clustering, the measure of rival similarity was
taken as the proximity measure. To automate the search for the most appropriate
weighting coefficients in the formula of similarity measure, a genetic algorithm
was developed.</p>
      <p>The paper is organized as follows. The relevance of research is substantiated
in Section 1. Section 2 describes the problem of text document clustering and
the accepted proximity measure. Also, Section 2 presents FRiS-Tax clustering
algorithm. Section 3 describes a genetic algorithm to find the most appropriate
weighting coefficients in the formula of similarity measure. Section 4 presents
parallel versions of genetic and clustering algorithms. Then, the results of
computational experiments and the obtained data analysis are given. In conclusion,
the results of the performed work are summarized.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Clustering algorithm with competitive similarity function</title>
      <p>
        This work deals with the problem of clustering publications from bibliographic
databases which allows automating the process of choosing publications for a
concrete researcher or a group of working together researchers. In the clustering
problem, each cluster is described with the help of one or several identifiers called
centroids. These are centers of gravity or central objects of clusters. FRiS-Tax
algorithm ([
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]) is chosen as a clustering algorithm. Comparison of
FRiSTax algorithm with the existing analogs is presented in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and the results of
FRiS-Tax exceed the results of competitors. The experiments with FRiS-Tax
algorithm showed its high efficiency when solving the clustering problem, and
demonstrated the usefulness of rival similarity functions in different problems of
data analysis. To measure similarity, we propose to take the attributes of the
bibliographic descriptions of documents as scales.
      </p>
      <p>Let</p>
      <p>
        be a set of documents. Similarity measure  on the  set is defined
as follows:
 : 
× 
→ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ],
and in the case of complete similarity, function m has the value 1, in case of
complete difference - 0. Calculation of the similarity measure is performed by
the formula of the type:
(1)
(2)
 ( 1,  2) =
︁∑

 =1     ( 1,  2),
attributes.
formula:
where  is the index of the element (attribute) of the bibliographic description,
  are weighting coefficients,   ( 1,  2) is the measure of similarity by the 
th element (in other words, by  -th scale), and  is the number of considered
      </p>
      <p>The measure of rival similarity is introduced as follows. In the case of the
given absolute value of similarity  (,</p>
      <p>) between two objects, the rival similarity
of object  with object  on competition with  is calculated by the following
 / ( ) =
 (,  ) −  (,  )
 (,  ) +  (,  )
,
(3)
where  is called a function of rival similarity or FRiS-function. The values of
 change within the range from +1 to −1. This is what we call the function
of rival similarity or FRiS-function. Function 
agrees well with the mechanism
of perception of similarity and difference which are used by a person when he
compares a certain object with two other objects.</p>
      <p>Similarity between the object and cluster is assigned by the same principle.
In order to evaluate the rival similarity of object 
with the first cluster, the
absolute similarity  (, 1) of</p>
      <p>with this cluster and similarity  (, 2) with
the cluster-competitor are taken into account. We use the value of similarity of
object</p>
      <p>with the nearest or typical representative of the given cluster as the
value of similarity of object  with the cluster. In this case, the value of the rival
similarity is calculated by the formula:
 1/2( ) =
 (, 1) −  (, 2)
 (, 1) +  (, 2)
.</p>
      <p>The clustering method can be described as follows. Let a set of objects of
sampling</p>
      <p>
        be given. The similarity of objects united into clusters is taken as a
rival similarity with the central object of the cluster. Such objects were called
pillars of clusters [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The peculiarity of the problem of dividing a set into clusters
is that at the initial stage the reference of the objects of sampling to this or other
cluster in unknown. All the objects of set 
are likely to refer to one cluster.
      </p>
      <p>If we fix a set of centroids of this cluster 
= { 1,  2, . . . ,   }, then for each
object 
∈</p>
      <p>it is possible to find the distance  (,   1) (from the object to
the nearest centroid from set  ). But the absence of a cluster-competitor does
not allow to determine the distance of the object to the nearest pillar of the
cluster-competitor. In this regard, in the first stage, a virtual cluster-competitor
is introduced the pillar of which is placed from each object of sampling at a
fixed distance equal to  *. Then, the value of rival similarity of object 
with
the nearest to it pillar   1 from  in comparison with the virtual competitor is
written as:
  * 1 ( ) =
 (,   1) −  *
 (,   1) +  *
.</p>
      <p>The number of pillars in the clustering problem will be chosen in such a way
so that the value of competitive similarity of each object of sampling 
with the
nearest to it pillar from  is maximum:
 ( ) =
︁∑
 ∈
  * 1 ( ) → max .</p>
      <p />
      <p>The proposed algorithm chooses the number of clusters automatically. The
user only assigns the limit number of clusters  , among which he would like to
have the best variant of clustering. The algorithm subsequently seeks for solution
(4)
of the problem of clustering for all values  = 1, 2, . . . ,  , so that to choose the
best of them (Fig. 1).</p>
      <p>
        The advantages of using FRiS-Tax algorithm are shown in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Firstly, the
use of FRiS-compactness as a criterion of information capability of features at
random distributions of images showed a significant advantage in comparison
with a widely used criterion of minimum of errors when recognizing the test
sampling by Cross Validation or One Leave Out methods. Secondly, at
normal distributions, FRiS-algorithm first chooses the pillars located in the area of
mathematical expectation, and, if distributions are polymodal and images are
linearly inseparable, the pillars will be in the centres of modes. In the process of
recognition, the decision is made in the favor of that image the pillar of which is
similar to the control object most of all and the value of the function of
similarity of the object with the chosen image allows to judge about the reliability of
the taken decision. And finally, the use of FRiS-function for solution (at
international contest Data Mining Cup 2009) of the problem of predicting the values
of variables measured in absolute scale allowed it creator to hold the 4th place
among 321 teams. Thus, the efficiency of using FRiS-function in algorithms for
solution of problems of predicting quantitative variables is demonstrated.
      </p>
    </sec>
    <sec id="sec-3">
      <title>A genetic algorithm for adjustment of coeficients in the formula of similarity measure</title>
      <p>
        In this work, we have chosen:
– the year of issue;
– code UDC;
– key words;
– authors;
– series;
– annotation;
– title
as attributes of division of articles from bibliographic databases into clusters. To
choose weighting coefficients which are used in the formula of similarity measure
(2), a genetic algorithm was developed. The genetic algorithm refers to
heuristic algorithms of search which is used for solving the problems of optimization
and modeling by random selection, combination and variation of the
soughtfor parameters using the mechanisms similar to natural selection in nature [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
It should be noted that earlier the weighting coefficients were adjusted
manually by an experimental way (see [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) and the change of the problem domain of
documents required a new series of experiments. The use of genetic algorithm
allows automating the search for the most acceptable weighting coefficients in
the formula of similarity measure.
      </p>
      <p>
        The genetic algorithm is executed in the following stages ([
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]):
1) Creation of initial population.
2) Selection.
3) Choice of parents.
4) Crossover.
5) Mutations.
      </p>
      <p>The description of realization of genetic algorithm stages as applied to the
problem of clustering is presented below (Fig. 2).
3.1</p>
      <sec id="sec-3-1">
        <title>Creation of initial population</title>
        <p>
          To create the initial population and its further evolution, it is necessary to have
an ordered chain of genes or a genotype. According to [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], in some, usually
random, way a set of genotypes of initial population is created. These genotypes
are estimated using a ”fitness-function” as a result of which each genotype is
associated with a definite value (”fitness”) that determines how well the genotype
described by it solves the set-up task. For this task, a chain of genes has a fixed
length equal to 13 and presents a set of parameters made up on the basis of
attributes of bibliographic description of documents.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>The structure of a chromosome</title>
        <p>In genetic algorithms, the individuals entering the population are presented by
ordered subsequent genes or chromosomes with coded in them sets of the problem
parameters. Figure 3 presents the structure of a chromosome consisting of 13
genes. Abbreviations in Figure 3 present the first letters of the gene’s name, for
example, UseAbstract = UAb.</p>
        <p>The values which can be taken by genes are presented in the right column
of Table 1. Genes from the given genotype are used as follows. Let us consider
the genes the values of which vary within the range from 0 to 3. If the value of
gene is equal to 0, it is not used in creation of population. If the value is more
than 0, this value presents the corresponding weight of gene: authorsWeight,
keywordsWeight, titleWeight, abstractTextWeight. These weights are used further,
when calculating proximity measure m according to formula (2).</p>
        <p>
          Genes from Table 1 which end in the word Equality : AuthorEquality,
TitleEquality, KeywordsEquality, AbstractEquality define the way of comparing the
attributes of documents and are used in creation of population only if the
corresponding values of genes UseAuthors, UseTitle, UseKeyWords, UseAbstract
are positive. If the values AuthorEquality, TitleEquality, KeywordsEquality,
AbstractEquality are equal to 0, we use usual comparison by the method Equals to
compare lists of authors, names of articles and keywords. If the values of
AuthorEquality, TitleEquality, KeywordsEquality are equal to 1, we use Levenstein
distance for evaluation of proximity measure of attributes [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. If the value of gene
AbstractEquality is equal to 1, the algorithm of shingles [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is used for evaluation
of proximity measure of annotations.
        </p>
        <p>The values of genes UseUdk, UseJournaSeria, UseYear are binary, i.e.
depending on the values the genes are either used or not used, in case of being
used, +1 is added to the measure m. The gene POSSIBLE DIFFERENCES is
a threshold value, when evaluating proximity by Levenstein distance. The value
of this gene varies from 0 to 3. If POSSIBLE DIFFERENCES = 0, the
compared names, authors, keywords must completely coincide. If, when comparing,
the calculated Levenstein distance is less than the threshold value, the
corresponding weight: AuthorsWeight, titleWeight or KeywordsWeight is added to
the proximity measure  . If Levenstein distance exceeds the threshold value, it
is concluded that the attributes are different.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Selection</title>
        <p>In genetic algorithm, a set of individuals, each with its own genotype, is a certain
solution of the clustering problem. Let us suppose that we have generated an
individual, that is a set of weighting coefficients is given to determine the measure
of similarity.</p>
        <p>
          At the stage of selection, the parents of the future individual are determined
with the help of Roulette Selection [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], Tournament Selection [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], and Elitism
Selection [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] methods. Selection by Roulette Selection proceeds as follows. The
values of fitness of all individuals are summed and we obtain a certain value
sum, then choose a random number between 0 and the sum. A cycle is started
according to the number of individuals, their fitnesses are summed and as soon
as the sum exceeds the random number, we return the index of the individual
which was the last to take part in summing. When using Tournament Selection, n
tournaments are realized to choose n individuals. When using Elitism Selection,
the individuals with the greatest fitness securely pass on to a new population.
The use of elitarism usually allows to accelerate convergence of the genetic
algorithm. The disadvantage of the strategy of elitarism is that the probability of
getting into the local minimum increases.
3.4
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Crossover</title>
        <p>The survived individuals take part in reproduction. The crossover operator
combines two chromosomes (parents) to produce a new chromosome. The new
chromosome may be better than both of the parents if it takes the best characteristics
from each of the parents. For this, the following methods are used: One point
crossover, Two point crossover, Uniform crossover, and Variable to Variable
crossover. Figure 4 presents the stage of crossover of the genetic algorithm.
The stage of mutation is necessary not to let the solution of the problem get into
a local extremum. It is supposed that, after the crossover is completed, part of
the new individuals undergo mutations. The essence of mutation operator is as
follows. In the chromosome under study, a random number of genes is picked out
randomly. The coefficient of mutation determines the intensity of mutations. It
determines the fraction of genes subjected to mutation on the current iteration
taking into consideration their total amount. In our case, 25% of all individuals
are selected which are subjected to mutation (Fig. 5).</p>
        <p>Thus, the genetic algorithm for the clustering problem is executed in two
stages: a stage of initialization and a stage of iterations.
The stage of initialization:
The first generation is formed.</p>
        <p>The stage of iterations:
1) Clustering is performed by FRiS-Tax algorithm.
2) The value of the fitness-function is calculated.</p>
        <p>3)The value of the fitness-function is compared with the threshold value
of quality. For this problem, the threshold value is equal to 0.8. If the
predetermined value of clustering quality is reached, the algorithm stops.</p>
        <p>4)If not, a new generation is formed: selection of individuals, reproduction,
and mutations are performed.</p>
        <p>
          5)Transition to step 1 is carried out.
In the algorithm, a fitness-function is given which allows to determine how well
the clustering problem is solved. In this work, the quality of the obtained clusters
is evaluated using the measures of estimation - Purity [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and Root mean square
The measure Purity is an external criterion of the quality of clustering which

=

1 ∑︁


 | 
︁⋂   |,
where
        </p>
        <p>=  1,  2, . . . ,   is the result of clustering performed by an expert,
 =  1,  2, . . . ,   is the result of clustering performed by the program. Then, in
order to determine which of the individuals was not selected and is dying and
which of them survived and will take part in reproduction, we take the threshold
for the values of fitness-function. The individual dies if the function returns the
value which is less than the taken threshold. Root mean square deviation is
also used for evaluation of the quality of clustering. The lower the value of this
function, it is the better the quality of clustering.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Development of the parallel clustering algorithm</title>
      <p>
        With the increase in the amount of documents to be processed, the time of the
clustering process realization increases exponentially, therefore, the aim of
developing a parallel algorithm of clustering is justified. Parallelization is carried
out in two stages of the algorithm of clustering. Firstly, during selection of
individuals in the genetic algorithm when clustering is performed with different
sets of weighting coefficients. The program is written in Java, and this stage of
the parallel algorithm is performed using MPJ Express, an implementation of
an MPI -like API which can execute on a variety of parallel platforms ranging
from multicore processors to compute clusters [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Secondly, it is directly in the
course of performing the clustering algorithm. In FRiS-Tax algorithm, the most
complex computing process is traversal of all objects of selection and testing
each of them for the role of a pillar. Parallelization of this stage is implemented
with the help of technology Java 8 Streams [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
4.1
      </p>
      <sec id="sec-4-1">
        <title>Parallelization of the genetic algorithm</title>
        <p>The steps of a parallel version of the genetic algorithm are presented below (Fig.
6).</p>
        <p>1)  processes are started using    . The number  depends on the
number of individuals in the first generation, i.e. if we increase the number of
individuals up to 64, then 64 processes are started. Each process is started on a
separate computing node.</p>
        <p>2) Each process reads-out a file with articles which are to be divided into
clusters.</p>
        <p>3) The master-process generates  random chromosomes and sends them to
the rest processes.</p>
        <p>4) Each process takes one chromosome and creates an individual, computes
the value of fitness-function and sends it to the master-process.</p>
        <p>5) Master-process checks whether there is an individual with the value of
fitness-function greater or equal to the given threshold value (0.8). If such
individual is found, the master-process informs all the rest processes that the
individual is found and can stop the work.</p>
        <p>6) If not, the master-process stars selection. Crossover takes place within
selection.</p>
        <p>7) After the parents are determined, a new individual is born. The old
generation does not take part in the further work.</p>
        <p>8) After that, mutation is performed, random values are assigned to
random genes. The mutation coefficient is 25%. When the master-process performs
selection, crossover and mutation, the rest processes wait.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Parallelization of the clustering algorithm</title>
        <p>The load test revealed the two slowest stages in the FRiS-Tax clustering
algorithm. They are the methods of finding the first pillar and finding the next pillar,</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>The results of the computing experiment</title>
      <p>To study the efficiency of the developed parallel algorithm, we carried out
computing experiments in the Laboratory of computer sciences of RIMM at al-Farabi
KazNU on the cluster including 16 computing nodes.</p>
      <p>For performing analysis and clustering, the journal ”Vestnik KazNU” of
20082015 was used as initial data. Sampling includes 95 pdf documents. The total
number of articles is 2837. The choice of the initial data is conditioned by the fact
that all documents were divided into series (mathematics, biology, philosophy,
etc.) and further divisions do not cause difficulties, when using measures of
similarity based on only bibliographic descriptions or titles of the articles. In order to
evaluate the quality of division of sampling, this body was divided into clusters
with the help of an expert into the problem domain. The time of execution was
determined as follows. We made measurements of the time of clustering processes
for the clusters being formed on one computer node and several computer nodes
for parallel realization. Figure 7 presents the dependency of time for
realization of the clustering algorithm on the number of processes. Figure 8-9 present
acceleration and efficiency of parallel realization. As is seen in the constructed
diagrams, with the increase in the number of processes, acceleration increases
to a certain value which is related to the expenditure of communication. The
optimum number of processes proved to be 8 at which the maximum value of
acceleration was observed but the highest value of efficiency was achieved with 4
processes. Figures 10-11 presents distribution of the documents to the resulting
clusters.</p>
      <p>
        The second initial sampling consists of 522 scientific articles of the journal
”Siberian mathematical journal”. Each article has a code of classifier MSC2010
which was taken as a reference. This allowed to objectively evaluate the quality of
clustering when using fitness Purity, the initial sampling was divided in advance
by the code of classifier into 8 large clusters. The computing experiment showed
the following best gene - [
        <xref ref-type="bibr" rid="ref1 ref1 ref2 ref2 ref2 ref2 ref4 ref4 ref8">8, 2, 4, 4, 2, 2, 1, 0, 1, 2, 0</xref>
        ] at which the value of
fitness-function was equal to 80%.
      </p>
      <p>
        When using the fitness of Root mean square deviation, the best result was
shown by the following gene - [
        <xref ref-type="bibr" rid="ref1 ref1 ref1 ref1 ref4 ref4">45, 4, 0, 1, 0, 4, 1, 1, 0, 0, 1</xref>
        ] and the value of
fitness-function was equal to 0.0043489306387577773.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>The proposed methods for clustering of documents in the electronic form allow
to realize processing on the systems consisting of more than one computer node.
The attributes of bibliographic description of documents were chosen as scales
for determination of the similarity measure. Parallel processes are realized at
the stage of preliminary analysis of documents including calculation of similarity
measures between the documents as well as directly at the stage of clustering.
The use of the genetic algorithm allowed to determine the values of attributes
at which clustering of documents gives the best results.</p>
      <p>Acknowledgments. This work was supported in part under grant of
Foundation of Ministry of Education and Science of the Republic of Kazakhstan
”Development of intellectual high-performance information-analytical search system
of processing of semi-structured data” (2015-2017).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Borisova</surname>
            <given-names>I.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zagoruiko</surname>
            <given-names>N.G.</given-names>
          </string-name>
          :
          <article-title>Functions rival similarity in the problem of taxonomy</article-title>
          .
          <source>In: Proceedings of Conference with international participation ”Knowledge - Ontology - Theory”. Novosibirsk</source>
          , Vol.
          <volume>2</volume>
          . pp.
          <fpage>67</fpage>
          -
          <lpage>76</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Borisova</surname>
            <given-names>I.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zagoruyko</surname>
            <given-names>N.G.</given-names>
          </string-name>
          :
          <article-title>Using FRiS-functions to solve the problem</article-title>
          SDX // Proceedings of the International Conference ”Classicfiation, Forecasting,
          <string-name>
            <surname>Data Mining” CFDM</surname>
          </string-name>
          <year>2009</year>
          . Varna. pp.
          <fpage>110</fpage>
          -
          <lpage>116</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Barakhnin</surname>
            <given-names>V.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nekhaeva</surname>
            <given-names>V.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fedotov</surname>
            <given-names>A.M.</given-names>
          </string-name>
          :
          <article-title>On the statement of the similarity measure for the clustering</article-title>
          of text documents // Bulletin of Novosibirsk state University. Series:
          <article-title>Information technology</article-title>
          . Vol.
          <volume>6</volume>
          , No. 1. pp.
          <fpage>3</fpage>
          -
          <lpage>9</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Zagoruiko</surname>
            <given-names>N.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Borisova</surname>
            <given-names>I.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dyubanov</surname>
            <given-names>V.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kutnenko</surname>
            <given-names>O.A.</given-names>
          </string-name>
          :
          <article-title>Functions of rival similarity in algorithms of recognition of combined type /</article-title>
          / Bulletin of Siberian State Aerospace University named after
          <string-name>
            <given-names>M.F.</given-names>
            <surname>Reshetnev</surname>
          </string-name>
          . Vol.
          <volume>5</volume>
          . pp.
          <fpage>19</fpage>
          -
          <lpage>21</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gladkov</surname>
            <given-names>L.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kureichik</surname>
            <given-names>V.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>V.M.</surname>
          </string-name>
          <article-title>Kureichik: Genetic algorithms</article-title>
          .Ed. V.M.
          <year>Kureichik</year>
          . 2nd ed. Moscow, FIZMATLIT (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Navarro</surname>
            ,
            <given-names>Gonzalo.</given-names>
          </string-name>
          <article-title>A guided tour to approximate string matching // ACM Computing Surveys</article-title>
          . Vol.
          <volume>33</volume>
          (
          <issue>1</issue>
          ). pp.
          <fpage>31</fpage>
          -
          <lpage>88</lpage>
          . (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Andrei Z. Broder</surname>
          </string-name>
          <article-title>: Identifying and Filtering Near-Duplicate Documents</article-title>
          .
          <source>In: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching (COM '00)</source>
          . Springer-Verlag London. pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Back Thomas:
          <source>Evolutionary Algorithms in Theory and Practice</source>
          . Oxford Univ. Press. P.
          <volume>120</volume>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Chetan</given-names>
            <surname>Chudasama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.M.</given-names>
            <surname>Shah</surname>
          </string-name>
          , Mahesh Panchal:
          <article-title>Comparison of Parents Selection Methods of Genetic Algorithm for</article-title>
          TSP // International Conference on Computer Communication and Networks CSI-COMNET-
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <article-title>Evaluation of clustering</article-title>
          . http://nlp.stanford.edu/IR-book/html/ htmledition/evaluation-of-clustering-1.html
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Piyatida</surname>
            <given-names>Rujasiri</given-names>
          </string-name>
          ,
          <source>Boonorm Chomtee: Comparison of Clustering Techniques for Cluster Analysis Kasetsart J. (Nat. Sci.)</source>
          . Vol.
          <volume>43</volume>
          . pp.
          <fpage>378</fpage>
          -
          <lpage>388</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>MPJ-Express</surname>
          </string-name>
          . http://mpj-express.org/
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <article-title>Processing Data with Java SE 8 Streams</article-title>
          . http://www.oracle.com/technetwork/ articles/java/ma14-java-se-8
          <string-name>
            <surname>-</surname>
          </string-name>
          streams-2177646.html
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>