<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Full-Text Clustering Methods for Current Research * Directions Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>© Dmitry Devyatkin</string-name>
          <email>devyatkin@isa.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Proceedings of the XVII International Conference «Data Analytics and Management in Data Intensive Domains» (DAMDID/RCDL'2015)</institution>
          ,
          <addr-line>Obninsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ilya Tikhomirov © Alexander Shvets Institute for systems analysis of RAS</institution>
          ,
          <addr-line>Moscow</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Konstantin Popov Engelhardt Institute of Molecular Biology of RAS</institution>
          ,
          <addr-line>Moscow</addr-line>
        </aff>
      </contrib-group>
      <fpage>152</fpage>
      <lpage>156</lpage>
      <abstract>
        <p>The paper contains a brief overview of full-text clustering methods for current research directions detection. A novel full-text clustering method is proposed. A dataset is created and experimental results are verified by problem domain “Regenerative medicine” experts with PhD degrees. The proposed method is well applicable for research directions detection according to experimental results. Finally, prospects and drawbacks of the proposed method are discussed. *The research is supported by Russian Foundation for Basic Research, project 14-29-05008-ofi_m</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Commonly methods for research directions detection
are based on different clustering methods [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The
problem is that all these methods require tuning to a
research area and dataset. Common clustering
evaluation metrics are not applicable for directions
detection. They are often based on degree of insularity
of clusters [
        <xref ref-type="bibr" rid="ref1 ref12">1, 12</xref>
        ]. But criteria for cluster building are
weakly formalized in directions detection task. Several
approaches use empirical estimates, determined by
experts, for solving this problem [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However,
obtaining of these estimates is complicated and
nondeterministic process, which requires interaction
between data-mining experts and experts of analyzed
research area. Moreover, criteria for inclusion of some
scientific paper to a research direction depends on
research area of this paper, thus an opinion of experts in
this research area should be taken into consideration.
Therefore, according to specialty of the problem a
semiautomatically approach, that uses a small labeled part of
analyzed dataset for training clustering methods, is
proposed in this paper. Let D {d1, d2 ,..,dn} is a
corpus of scientific papers; where n is a number of
documents in this collection and C
{c1,c2 ,..,cn} is a
set of previously known research directions in D .
Suppose Dc D is a small incomplete set of papers
that are in C . The goal is to get full set of research
directions C f and allocation of papers to this set. The
approach consists in applying a quality function that can
estimate difference between clustering method results
and given distribution C . Then methods of constrained
optimization are used for tuning parameters of the
clustering method. Unlike well-known classifier
training task, the complete directions set C f is a priori
unknown. So it is necessary to find a method, which can
be tuned in appliance with this approach properly.
      </p>
      <p>
        The primary goal of this paper is to present an
improved method for research directions detection,
which can process datasets semi-automatically. Besides,
we present results of comparison between proposed
method and the state-of-the-art clustering methods such
as Birch [19], affinity propagation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and their
combinations with topical latent Dirichlet allocation
model [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 Related work</title>
      <p>
        Consider some methods for science directions
detection. There are three groups of methods, used for
research direction detection. All of them use clustering
methods, but differ in the applied measure of similarity
detection: the co-citation measure [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the text measure
or the hybrid measure. The last one usually is based on
an assessment of co-citations of papers, on the
intelligent analysis of texts and on allocation of
significant papers [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] definition of the prospective research
directions is carried out using co-word analysis, i.e.
papers similarity. This approach is close to generative
topic distribution clustering methods. Application of
this method for detection of regularities and trends in
the area of information security is shown. The SCI base
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] is used as data source; keywords are collected from
the list of the terms allocated by authors of papers.
These keywords were normalized. To check dynamic
changes in the area of information security, authors
offer the whole set of methods for marking of papers.
They conclude that in the area of information security
there is constant directions set, and at the same time
new directions emerge regularly. The main
disadvantage of this method is that it uses only
structured authors’ keywords. This approach leads to
decreasing of methods objectivity; because authors’
keywords often distinguish from the real terms of a
paper.
      </p>
      <p>In this work we will use full-text clustering
approaches because of their universality: they do not
need citation databases for proper results. Let us discuss
several clustering methods.</p>
      <p>
        For example, affinity propagation is one of common
clustering methods. In this method a set of papers is
considered as a network of connected nodes, and the
similarity of papers corresponds to the weight of edges
in this network. The method is based on mechanism of
“passing messages” between nodes of the network [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Affinity propagation detects “exemplars“ - papers in the
input dataset that are representative for clusters. The
network nodes send to each other messages until a set
of exemplars and clusters gradually emerges. Potential
drawback of this method is initially exemplars
selection. One cannot provide initially exemplars
distribution for all directions, so the exemplars can be
selected inaccurately. In addition, the approach can lead
to inaccuracies in cases when it is impossible to identify
an exemplar which describes all the papers in the cluster
properly [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Unlike affinity propagation, proposed
method uses terms descriptors to describe clusters.
These terms correspond to all papers of a cluster (due to
the using of the topic importance measure for the terms
weighting) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] that leads to better quality of clustering.
Another common clustering method is Birch [19]. This
algorithm builds a weighted balanced tree (CF-tree) in a
single pass through the data. The tree stores information
about sub-clusters in leaves of the tree. During the
clustering each paper is added to the existing leaf, or a
new leaf is created. Birch also can be used in stream
clustering when the number of documents to be
processed can be theoretically infinite.
      </p>
      <p>
        Specialized methods for text clustering are worth
mentioning. In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] well-scalable full-text clustering
approach for detecting research directions was
proposed. Hash functions are widely used in this
method in order to get good performance. This method
has insufficient coverage of a dataset: there are large
amount of papers, which do not belong to any of the
clusters. In this study we improve classification
decision rule for better coverage. We also refine
extracting of cores of clusters for better clustering
quality.
      </p>
      <p>
        The methods of topic modelling based on the
generating models are often applied for full-text
clustering. For example, Latent Dirichlet Allocation
(LDA) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is applied for clustering of large amounts of
papers. The method is an improvement of Latent
Semantic Indexing. It assumes that each object is
presented in several classes which distribution is
included into parametrical family of Dirichlet
distribution. A drawback of the method is instability to
input data, that makes the results uninterpretable [
        <xref ref-type="bibr" rid="ref18 ref8">8, 18</xref>
        ].
We use LDA as a preliminary step for clustering by
well-known methods. This applies to significantly
reduce a feature space length and so to increase the
common clustering methods performance and quality.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3 Method description</title>
      <p>3.1</p>
      <sec id="sec-3-1">
        <title>Fast algorithm for similar document search</title>
        <p>
          The cornerstone of proposed full-text clustering
method is similar document search, previously
discussed in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. This algorithm use inverted index
structures for input data. Most of the full-text search
algorithms apply similar structures for improving search
performance [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Let w is a term (a word or a phrase)
from a paper, t – weight of the term. We use
wellknown tf idf measure for weighting terms in papers.
It is a product of term frequency function tf and inverse
document frequency function idf .
        </p>
        <p>tf (w, d )
ni
nk
paper d ,
in d .</p>
        <p>k
k
Where ni is number of times term w occurs in the
nk is a number of all terms occurrences
idf (w, D) log</p>
        <p>| D |
| di
w |
| di</p>
        <p>Where | D | is the number of papers in corpus D,
w | is the number of papers from D that have
term w and the logarithm base does not matter.</p>
        <p>Define direct index of papers as a function that
returns a set of different terms and their weights from
the paper d .</p>
        <p>Didx(d) { w1,t1 , w2 ,t2 ,.., wn ,tn }
Where n is the number of different terms in the
document.</p>
        <p>Define inverted index of papers as a function that
returns a set of different papers, in which given term w
is occurred and weights of this term in these papers:
Iidx(w) { d1,t1 , d2 ,t2 ,.., dn ,tn } . The algorithm of
the fast similar document search is the following.
1. Get terms of the paper d from its direct index
Didx(d ) .</p>
        <p>Didx(d ).
2. Retrieve list of papers containing the terms got
in step 1 from inverted index Iidx(w) .
3. Filter papers that are weakly intersected by the
terms of the input paper.
4. Get lists of terms for retrieved papers from</p>
        <p>
          Calculate Manhattan [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] distance between the
lists of terms got in step 4. We chose this
distance because it is fast to calculate and
provides quality comparable to more
complicated metrics like cosine distance [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
6. Filter papers with high distance to input paper
according to predefined threshold H.
        </p>
        <p>In step 3 of this algorithm we cut weakly similar
papers without calculating a distance measure that
significantly improves the algorithm. Steps 1-4 can be
performed in parallel, which improves the performance.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Clustering algorithm</title>
        <p>Suppose a descriptor of a cluster is a set of terms,
relevant to this cluster. Let an initial core of a cluster is
a subset of highly similar papers, which is used for
generation of a descriptor. A descriptor is built based on
full-text of these papers. Let Core(d ) is a function that
returns a tuple
c, m</p>
        <p>for given paper d , where c is an
identifier for initial core of cluster and m is a numeric
characteristic of paper d in initial core c. Characteristic
m is used for decreasing of distance threshold H during
building of initial core, that prevents merging of initial
cores. If paper d is not presented in any initial core, the
function Core(d ) returns empty tuple. Direct index of
descriptors is a function that returns a set of different
terms and their weights from the descriptor of cluster</p>
        <sec id="sec-3-2-1">
          <title>CDidx(c)</title>
          <p>{ w1,t1 , w2 ,t2 ,.., wn ,tn } where n is count
of different terms in the descriptor. Inverted index of
descriptors is a function that returns a set of descriptors
in which given term w is occurred and weights of this
term in these descriptors:</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>CIidx(w)</title>
          <p>{ c1,t1 , c2 ,t2 ,.., cn ,tn }.</p>
          <p>We
use topic
importance measure I for weighting terms of
descriptors. This measure for term w from corpus D and
cluster c is calculated as follows:</p>
          <p>(w, c, D) idf (w, D) idf (w, Dc )</p>
          <p>I (w,c, D) (w,c, D) X ( (w,c, D)) ,
where Dc is a set of papers from cluster c and X () is a
Heaviside step function. Due to the topic importance
measure, descriptor of cluster consists mostly of terms
specific for papers of the cluster.</p>
          <p>Let H a and are clustering thresholds, affecting on
insularity of clusters.</p>
          <p>The proposed full-text clustering method contains
two parts. The first part could be performed
independently on different nodes of computer network.
This part conventionally can be called “initial cores of
clusters detection”. The steps are the following.
1. Index the input corpus D .
2. While D is not empty, exclude a paper d from
D . If D is empty, turn to step 6. If Core(d )
returns empty tuple, then turn to step 3, else turn
to step 4.
3. Generate new cluster core index c. Add new
tuple c,1 with m 1 to Core . Then turn to
step 5.
4. Get identifier c for current initial core and m for</p>
          <p>current paper. Reduce m m .
5. Search similar papers by the fast algorithm with
threshold H H a
current value of m. Then turn to step 2.
6. Build descriptors CDidx (c) and
m . Add it to Core with</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>CIidx (w) for</title>
          <p>each initial core c .</p>
          <p>The second part is performed in a selected
“supervisor” node and consists of the following steps.
1. Search similar descriptors by the fast algorithm
using CDidx and CIidx . Then merge these
descriptors. In practice this step can be useful, if
the first section executes on several computer
nodes.
2. Classify all papers. For each retrieved descriptor
use steps 2-6 from the fast similar search
algorithm and resulting CDidx as a list of target
terms. To prevent fuzzy clustering each paper is
labelled by identifier of its cluster and by value
of its similarity to descriptor of this cluster.
Classifier uses these labels to include each paper
in the most similar cluster.</p>
          <p>The advantage of the proposed method is its
distributed nature thereby full-text clustering of large
amounts of papers can be implemented. That is
necessary for detecting current research directions
represented in wide research area. Another benefit
consists in using indexes which have incremental
structure, so adding a number of papers to the corpus
D does not involve full re-clustering.
4
4.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiment</title>
      <sec id="sec-4-1">
        <title>Dataset description</title>
        <p>
          For experiment a dataset of papers of the research
area “Regenerative medicine” is created and verified by
experts with PhD degree. The dataset contains 112
wellcited publicly accessible papers from 2000 to 2014
distributed into 4 research directions. We apply widely
used linguistic analysis library Freeling [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] to retrieve
normalized terms (words and noun phrases without stop
words) from text of papers in our study. These terms are
included in the dataset. The dataset is available on
demand, and can be extended and used according to the
BSD license.
        </p>
        <p>Descriptor
Brain stroke
Chimeric antigen receptor cell
therapy
Induced pluripotent stem cells
Wound healing burns
Size
Brain stroke
Chimeric antigen receptor cell
therapy
Induced pluripotent stem cells
Wound healing burns
Terms
stroke, brdu test, behavioral, endothelial cell,
cell brdu, psa-ncam, cortical cell, reactive,
regeneration neuronal, endogenous, neurogenesis,
endostatin, cortical, expansion nonhematopoietic,
ischemic neuron
pbls, receptor chimeric, cd19-speci, immunity,
antitumor, adoptive ifn, malignancy b-cell,
aapcs, protein fusion, infusion cell
gene signature, photoreceptor, epithelium retinal,
retinal, suppressor, tumor, cell ips, technology
ips, retinal cell, ips-derived, ipscs pigmented
wound chronic, wound heal, keratinocyte, epidermal
cell, fusenig keratin, epidermis region,
We use the semi-automatic approach for tuning
clustering methods that can avoid empirical parameters
where the parameter setting precision priority over
recall. In our experiment, 1 . We use a controlled
Measure
Precision</p>
      </sec>
      <sec id="sec-4-2">
        <title>Recall</title>
        <p>F-measure</p>
        <p>Affinity
propagation
0.85
0.28
0.42</p>
        <p>Birch
setup. Clustering parameters are calculated by
maximization of the quality function with boundaries
0 Ha 1 and 0.7 0.99 (for the proposed
method). For other methods we set boundaries
according to constraints of these methods. We apply
commonly used F-measure function for classification
methods assessment. This function based on precision
and recall.</p>
        <p>Recall R is defined as the ratio of the number of
correctly classified papers to size of class in the train
set:
where t is a number of correctly classified papers; c is a
number of papers which the method carried to other
classes.</p>
        <p>Precision P is defined as the ratio of number of
correctly classified papers to the total of classified
papers:
where t is a number of correctly classified papers; i is a
number of incorrectly classified papers.</p>
        <p>Then the f-measure calculates as follows:</p>
        <p>R
P</p>
        <p>t
t c</p>
        <p>,
t
t i</p>
        <p>,
F
( 2</p>
        <p>1)PR
2 P R
,
random search method [7] for optimization of quality
function.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Experiment result</title>
        <p>Thus, we show that proposed method can be tuned
properly using a part of analyzed dataset. Results show
that proposed method has sufficient quality of clustering
(Table 2). Generated terms represent research directions
from the dataset. Table 3 shows experiment results
using cross-validation technique. Both qualitative and
quantitative assessment shows that the proposed method
produces relatively good results. In opposite, the
affinity propagation method returns unsatisfactory
results. Obviously this method is not suitable for the
solution of research direction detection problem.
Possibly it relates to incorrect exemplars selection due
to the missing of the initial distribution for exemplars.</p>
        <p>We suggest that our method overcomes other
examined methods due to applying Manhattan distance
for paper similarity estimation, to using of topic
importance measure for determination of terms of
clusters and due to implementation of linguistic analysis
for terms extraction as well. That leads to more precise
estimation of similarity measure between papers and,
consequently, to better quality of research directions
detection.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and future work</title>
      <p>In this study we investigate various methods for
detection of current research directions, which are based
on clustering. We found that it is necessary to create a
method that can be tuned successfully in a small
labelled part of analyzed dataset.</p>
      <p>In this paper we present the improved full-text
clustering method for detection of research directions.
Experiment results show that proposed method is more
applicable for research directions detection, than the
other widely used methods. The semi-automatic
approach for tuning of clustering method demonstrates
its performance also. Besides, it was shown that
fulltext clustering methods work imprecisely for
thematically heterogeneous research directions. This
paper is considered as an initial study that will be
extended in a further research. Moreover, the further
development of proposed method consists in
implementation of hybrid metrics for clustering. These
metrics can be used not only texts similarity, but also
co-citations and co-authorships of papers in dataset. It is
necessary to continue work with experts for extension
of our experimental dataset, because hybrid clustering
methods cannot process small datasets properly.</p>
      <p>A.
//</p>
      <p>Additive
Machine
[19] Zhang T., Ramakrishnan R., and Livny M.</p>
      <p>BIRCH: an efficient data clustering method for
very large databases. // ACM SIGMOD Record.
ACM, 1996. – Vol. 25(2). – pp. 103-114.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Aletras</surname>
            <given-names>N.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Stevenson</surname>
            <given-names>M.</given-names>
          </string-name>
          <article-title>Evaluating topic coherence using distributional semantics //</article-title>
          <source>Proceedings of the 10th International Conference on Computational Semantics (IWCS</source>
          <year>2013</year>
          )
          <article-title>-</article-title>
          Long
          <string-name>
            <surname>Papers</surname>
          </string-name>
          .
          <article-title>-</article-title>
          <year>2013</year>
          . - pp.
          <fpage>13</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Black</surname>
            <given-names>P. E.</given-names>
          </string-name>
          <string-name>
            <surname>Manhattan</surname>
            distance // Dictionary of Algorithms and
            <given-names>Data</given-names>
          </string-name>
          <string-name>
            <surname>Structures</surname>
          </string-name>
          .
          <article-title>-</article-title>
          <year>2006</year>
          . - Vol.
          <volume>18</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Blei</surname>
            <given-names>D. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            <given-names>A. Y.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Jordan</surname>
            <given-names>M. I.</given-names>
          </string-name>
          <article-title>Latent dirichlet allocation //</article-title>
          <source>The Journal of machine Learning research. - 2003</source>
          . - Vol.
          <volume>3</volume>
          . - pp.
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Cobo</surname>
            <given-names>M. J.</given-names>
          </string-name>
          et al.
          <article-title>Science mapping software tools: Review, analysis, and cooperative study among tools</article-title>
          .
          <source>// Journal of the American Society for Information Science and Technology. - 2011</source>
          . - Vol.
          <volume>62</volume>
          (
          <issue>7</issue>
          ). - pp.
          <fpage>1382</fpage>
          -
          <lpage>1402</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Frey</surname>
            <given-names>B. J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Dueck</surname>
            <given-names>D.</given-names>
          </string-name>
          <article-title>Clustering by passing messages between data points</article-title>
          . // Science. -
          <year>2007</year>
          . - Vol.
          <volume>315</volume>
          (
          <issue>5814</issue>
          ). - pp.
          <fpage>972</fpage>
          -
          <lpage>976</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Glanzel</surname>
            <given-names>W.</given-names>
          </string-name>
          <article-title>Bibliometric methods for detecting and analysing emerging research topics</article-title>
          . // El profesional de la informacion.
          <source>- 2012</source>
          . - Vol.
          <volume>21</volume>
          (
          <issue>2</issue>
          ). - pp.
          <fpage>194</fpage>
          -
          <lpage>201</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          optimization // J. Optim.
          <source>Theory Appl</source>
          . -
          <year>2006</year>
          . - Vol.
          <volume>130</volume>
          (
          <issue>2</issue>
          ). - pp.
          <fpage>253</fpage>
          -
          <lpage>264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Koltcov</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koltsova</surname>
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikolenko</surname>
            <given-names>S.</given-names>
          </string-name>
          <article-title>Latent dirichlet allocation: stability and applications to studies of user-generated content //</article-title>
          <source>Proceedings of the 2014 ACM conference on Web science. - ACM</source>
          ,
          <year>2014</year>
          . - pp.
          <fpage>161</fpage>
          -
          <lpage>165</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Lee</surname>
            <given-names>W. H.</given-names>
          </string-name>
          <article-title>How to identify emerging research fields using scientometrics: An example in the field</article-title>
          of information security // Scientometrics. -
          <year>2008</year>
          . - Vol.
          <volume>76</volume>
          (
          <issue>3</issue>
          ). - pp.
          <fpage>503</fpage>
          -
          <lpage>525</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Leone</surname>
            <given-names>M.</given-names>
          </string-name>
          et al.
          <article-title>Clustering by soft-constraint affinity propagation: applications to geneexpression data //Bioinformatics</article-title>
          . - 2007. - Vol.
          <volume>23</volume>
          (
          <issue>20</issue>
          ). - pp.
          <fpage>2708</fpage>
          -
          <lpage>2715</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Padró</given-names>
            <surname>Lluís</surname>
          </string-name>
          and
          <string-name>
            <given-names>Stanilovsky</given-names>
            <surname>Evgeny</surname>
          </string-name>
          .
          <source>FreeLing 3</source>
          .0: Towards Wider Multilinguality //
          <source>Proceedings of the Language Resources and Evaluation Conference (LREC</source>
          <year>2012</year>
          ). - Istanbul,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Rousseeuw P. J. Silhouettes</surname>
          </string-name>
          <article-title>: a graphical aid to the interpretation and validation of cluster analysis //</article-title>
          <source>Journal of computational and applied mathematics. - 1987</source>
          . - Vol.
          <volume>20</volume>
          . - pp.
          <fpage>53</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Science</given-names>
            <surname>Citation</surname>
          </string-name>
          Index by Thomson Reuters http://thomsonreuters.com/science-citation-indexexpand,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Shvets</surname>
            <given-names>A.</given-names>
          </string-name>
          et al.
          <source>Proceedings of the Science and Information Conference // Detection of Current Research Directions Based on Full-Text Clustering. - London</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Sochenkov</surname>
            <given-names>I.</given-names>
          </string-name>
          <article-title>Relational-situational data structures, algorithms and methods for search and analytical tasks solving [in Russian]</article-title>
          .
          <source>PhD thesis</source>
          .
          <source>Institute for systems analysis of RAS</source>
          , Moscow,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Suvorov</surname>
            <given-names>R. E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Sochenkov</surname>
            <given-names>I. V.</given-names>
          </string-name>
          <article-title>Method for detecting relationships between sci-tech documents based on topic importance characteristic</article-title>
          .
          <source>[In Russian] ISA RAS</source>
          , Moscow,
          <year>2013</year>
          . - Vol.
          <volume>1</volume>
          . - pp.
          <fpage>33</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Stein</surname>
            <given-names>B.</given-names>
          </string-name>
          <article-title>Principles of hash-based text retrieval //Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval</article-title>
          .
          <source>- ACM</source>
          ,
          <year>2007</year>
          . - .
          <volume>527</volume>
          -
          <fpage>534</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Vorontsov</surname>
            <given-names>K.</given-names>
          </string-name>
          and
          <article-title>Potapenko regularization of topic models Learning</article-title>
          .
          <source>- 2014</source>
          . - pp.
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>