<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Similarity Metric оf Categorical Distributions for Topic Modeling  Problems with Akin Categories </article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Serhiy Shtovba</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mykola Petrychko</string-name>
          <email>petrychko.myckola@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olena Shtovba</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Vasyl' Stus Donetsk National University</institution>
          ,
          <addr-line>600-richchia str., 21, Vinnytsia, 21021</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vinnytsia National Technical University</institution>
          ,
          <addr-line>Khmelnytske Shose, 95, Vinnytsia, 21021</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>   Estimating a level of similarity of two objects is a common problem in pattern recognition, clustering, and classification. Among these problems can be reviewer recommendation, similar text documents analysis, human pose detection in video, species distribution clustering, recommendation in internet-shops etc. In case of categorical attributes an object is described as a distribution of membership degrees over categories. Similarity metrics of such distributions are usually defined as a superposition of objects' similarities for each category. Most often it is a sum of similarities in separate categories. In addition to that each category is considered independently and in isolation from the others. Some practical problems have categories that are akin. Therefore, it is expedient to consider objects' similarity not only directly, as a similarity between equivalent categories, but it is also necessary to consider an indirect similarity, cross-similarity through akin categories. It is such similarity metric of two categorical distributions that accounts for the kinship of different categories is proposed in this paper. The metric has two components. The first component is defined as Czekanowski metric. It defines a direct similarity of categorical distributions as a sum of intersection of distributions' membership degrees of two objects. After the intersection the remaining residuals are accounted for in the second component of the metric. The second metric's component is defined as element-wise product of two matrices: matrix of residuals composition from memberships of two categorical distributions and matrix of categories' paired kinship. It is assumed that kinship indices for each pair of categories are known. As a result, with a large number of categories the overall noisy contribution from weakly akin categories is prominent. Therefore, it is proposed to filter the noise and account only for contribution from strongly akin categories.</p>
      </abstract>
      <kwd-group>
        <kwd> 1  Categorical distribution</kwd>
        <kwd>akin categories</kwd>
        <kwd>kinship coefficient</kwd>
        <kwd>similarity metric</kwd>
        <kwd>Czekanowski metric</kwd>
        <kwd>topic modeling</kwd>
        <kwd>reviewer recommendation</kwd>
        <kwd>generalized Pareto distribution</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction </title>
      <p>Topic modeling is a machine learning technology for summarization of huge volumes of
information [1]. A popularity of topic modeling is increasing drastically over the last decade. Annual
number of research papers is doubling every 3-4 years.</p>
      <p>Estimating the level of similarity between two objects is a common task in topic modeling. For
estimating the level of similarity it is necessary to describe each object as a vector of attributes. If
objects are defined in a metric space, then each attribute is defined in a numerical scale. For example,
in Fisher’s Iris dataset, each flower is described with four attributes, namely the petal width, the petal
length, the sepal width, and the sepal length. Object’s attributes can be categorical as well, in this case
they are defined as a distribution of membership degrees over categories. Such a categorical
representation of objects is often used in topic modeling problems. For the Fisher’s Iris dataset, the
result of flower classification can be represented as a categorical distribution, for example, some
flower is classified as Iris Setosa with a membership degree of 0.7, as Iris Virginica with a
membership degree of 0.1, and as Iris Versicolor with a membership degree of 0.2.</p>
      <p>Depending on the type of object representation different metrics of similarity are used. For objects
defined in a metric space, similarity is defined as a quantity that is inversed to the distance between
two points. Coordinates of each point are numerical values of the object’s attributes. The smaller the
distance between the analyzed objects the more similar they are. In paper [2] around 50 different
metrics are analyzed, the most popular among them are particular cases of Minkowski’s metric:
Euclidean distance, City Block distance, and Chebyshev metric. The cosine similarity metric is often
used as well. It calculates the cosine angle between two vectors that start at the origin and end at the
analyzed objects.</p>
      <p>In categorical space, the similarity between two objects is usually defined as similarity
superposition of objects over each category. Most often it is the sum of similarities over each
separated category. For this approach, each category is considered independently and in isolation
from others. There is also an inversed approach when the object divergence is first found for each
category and then aggregates to find the overall similarity. One of the popular variants of such a
metric is proposed in [3] to find the similarity of fuzzy sets. The authors define the divergence of
objects as a module of membership degree difference. All metrics from the survey paper [2] and other
relevant publications, for example, [4, 5, 6, 7], do not consider the kinship between categories. But for
some practical problems, the categories are akin. This leads to the fact that it is better calculate the
similarity between objects not only directly as the similarity between equivalent categories, but also
consider indirect cross-similarity through akin categories. Developing of such a metric that
additionally considers the similarity of objects through akin categories is, therefore, the purpose of
this paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Objects representation in the space of akin categories </title>
      <p>We present here an example of problems in which objects are described in the space of akin
categories.</p>
      <p>Let us consider task of finding similar researchers, for example, for reviewer recommendation
system. Based on research papers, each researcher can be categorized into a few research specialties
(research fields) according to some research classification system. For example, researcher A is
categorized to System Analysis with a membership degree of 0.4 and to Information Systems and
Information Technologies with a membership degree of 0.6. Researcher B is categorized to System
Analysis with a membership degree of 0.7 and to Computer science with a membership degree of 0.2.
Researcher C is categorized to System Analysis with a membership degree of 0.4 and to Marketing
with a membership degree of 0.6.</p>
      <p>Finding the similarity between each pair of these researchers using known metrics will only take
into account their membership degrees to System Analysis. Membership degrees in other categories
are not taken into account because they are different for each researcher. The similarity between
researchers A and B is defined only by their membership degrees to System Analysis category, which
equals 0.4 and 0.7, respectively. If the similarity is defined as the common part of membership
degrees using min operation, then between researchers A and B, it equals to
FitA, B  min0.4, 0.7  0.4 . By the same reasoning, the similarity between researchers A and C
equals to</p>
      <p>FitA, C  min0.4, 0.4  0.4 , and
between
researchers</p>
      <p>B
and</p>
      <p>C
equals to
FitB, C   min0.7, 0.4  0.4 . This shows that the similarity of all pairs of researchers is the same.
But the research domains are such that Information Systems and Information Technologies is
significantly closer to Computer Science than to Marketing. Also, Computer Science is significantly
closer to System Analysis than to Marketing. Therefore, the similarity between researchers A and B
has to be stronger than the similarity between researchers A and C, or between researchers B and C.
But the well-known similarity metrics do not take into account the kinship of categories; therefore, it
is impossible to take into account such peculiarities using them.</p>
    </sec>
    <sec id="sec-3">
      <title>3. The proposed metric for akin categories </title>
      <p>Let us denote the number of categories as m. Then, objects X and Y, the similarity of which is to be
estimated, we represent with the following membership distributions to categories:
1X ,  2 X , ..., m X  and 1Y ,  2 Y , ...,  m Y  . The distributions are considered to satisfy
the following conditions:
i X 0; 1, i  1, m;
i Y 0; 1, i  1, m;
i X   1;
i1, m
i Y   1.</p>
      <p>i1, m</p>
      <p>The problem is that the level of similarity is to be calculated for objects X and Y. The domain
research’s peculiarity lies in the fact that some categories are akin. Therefore, it is necessary to
consider not only the similarity of identical categories but also the similarity of akin categories. Below
we propose a metric that accounts for the semantic kinship of categories.</p>
      <p>The similarity between two objects X and Y is proposed to be calculated in the following way:</p>
      <p>FitX , Y   F X , Y   F X , Y  ,
where F X , Y  denotes addend that assesses the direct similarity between the objects X and Y over
identical categories;
F X , Y  denotes addend that considers the similarity of objects X and Y over akin categories.</p>
      <p>
        We calculate the first addend of the formula (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) by the simplified version of Chekanowski metric
for the case when membership degrees are in the 0; 1 and their sum equals 1. The resulting form of
the first addend in (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) is as follows:
      </p>
      <p>F X , Y  </p>
      <p>
         mini X , i Y 
i1, m
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
where i X  denotes membership degree of object X to the i-th category, i  1, m ;
i Y  denotes membership degree of object Y to the i-th category, i  1, m .
      </p>
      <p>
        The formula (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) can be interpreted as a sum of membership degrees of intersection of fuzzy sets
X and Y . In formula (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), it is implied that the overall similarity of two objects is a sum of their
similarities by each category. The similarity by category is defined as both objects’ minimum
memberships to the category. Thereby, one of the objects contributes the entire value of the
membership degree to a category, and the other one – only a part of it.
      </p>
      <p>
        After applying the formula (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) we get the following residuals of membership degrees:
ri X   max0, i X   i Y  ;
ri Y   max0, i Y   i X , i  1, m .
      </p>
      <p>Let us consider the contribution of residuals to the similarity of two objects using kinship of
categories. We assume that the information about categories’ pair kinship is known, and denote it
with the following binary relation:</p>
      <p>K  kij ,
where kij  0; 1 denotes the kinship coefficient of the i-th and j-th categories, i  1, m , j  1, m .</p>
      <p>The categories are more akin, the higher the kinship coefficient. Kinship relation is symmetric and
reflexive, therefore, kij  k ji and kii  1 .</p>
      <p>We denote the composition of residuals in the form of a matrix as follows:
where eij  minri X , rj Y , i  1, m , j  1, m .</p>
      <p>
        The contribution of residuals to metric (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) using paired kinship of categories is calculated as
follows:
      </p>
      <p>
        Example. Two objects are defined with the following membership degrees to categories
A, B, C, D: X  0.4 0.3 0.2 0.1 and Y  0.7 0.1 0.1 0.1 . Kinship of the categories are
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
F X , Y   
      </p>
      <p>eij  kij .</p>
      <p>i1, m j1, m
1.0
0.5
0.0
0.0
0.5
1.0
0.1
0.0
0.0
0.1
1.0
0.3
0.0
0.0
0.3
1.0
.
described by the following matrix: K </p>
      <p>
        Let us calculate the similarity between objects X and Y using the proposed metric (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ).
      </p>
      <p>
        To calculate the first addend of the similarity metric (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) let us find the intersection of two
distributions. The intersection is shown in Figure 1 with a hatch.
      </p>
      <p>The numerical value of the first addend is as follows:</p>
      <p>F X , Y   min0.4, 0.7  min0.3, 0.1  min0.2, 0.1  min0.1, 0.1  0.4  0.1  0.1  0.1  0.7 .
Residuals after the intersection are:
eX   0
eY   0.3
0.2
0.1</p>
      <p>0 ;
. From this matrix it is observed, that the</p>
    </sec>
    <sec id="sec-4">
      <title>4. Computational Experiments </title>
      <p>In the above example, taking into account the kinship of categories increased the similarity metric
by 0.1, which is 14% of the initial value obtained by the Chekanowski metric. Let us conduct
computational experiments to establish how sensitive the proposed metric is to taking into account the
kinship of categories.</p>
      <p>We perform 18 series of experiments. All the experiments within a series is carried out with the
same number of categories. The number of categories (m) from series to series increases from 2 to 70
with a step of 4. In each series, the similarity of 5000 pairs of objects X and Y is calculated.
Attributes of objects X and Y as well as the kinship matrix of categories are randomly generated
using the generalized Pareto law [8]:
 1 1
f ( x)  1 1  k x     k  .</p>
      <p>   </p>
      <p>
        The choice of this law is due to the fact that in topic modeling, the categories’ similarity
distribution often is similar to the Pareto distribution (see, for example, [9, 10]). For each experiment,
we firstly randomly generate parameters of the distributions. The shape parameter k is generated
from the range 0.15; 0.5 , the scale parameter  is generated from the range [0.001; 0.01]; we set
the bias as zero:   0 . In each experiment, using synthesized distributions, we randomly generate the
coordinates of the vectors X and Y , as well as kinship matrix K . In matrix K the maximum value
of the elements is limited to 0.4; vectors X and Y are normalized by 1. Then, we calculate the
similarity between X and Y using the proposed metric (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ).
      </p>
      <p>The results of the experiments in the form of box-plot diagrams are shown in Figure 2. It can be
seen from the figure that the similarity values are in the range 0; 1 . As the number of categories
increase, the spread of results decreases. Median of distributions for m  6 does not depend on the
number of categories.</p>
      <p>On Figure 3 box-plot diagrams of the distribution of the term F X , Y  are shown, which takes
into account the similarity of objects X and Y over akin categories. The median of this value
increases from 0.001 to 0.066 with an increase in the number of categories. The spread of the
distributions also increases, while the number of outliers decreases from 550 to 7. The increase in the
median probably indicates that as the number of categories increases, a large number of weak ties
between akin categories make a significant contribution to F X , Y  . As the number of categories
increase, the number of pairs of akin categories increases quadratically. The contribution of many
pairs of akin categories is tiny. It looks as a noise. But the sum of huge number of noise contributions
turns out to be large. The decrease of the number of outliers is also a negative factor. The need for a
new metric is primarily due to the need to identify specific cases with strong cross-over effects due to
akin categories. Also, the decrease in the number of outliers indicates that noise kinships make it
difficult to detect such pairs of objects.</p>
      <p>Let us reduce the noise contribution of categories with weak kinship. We assume the kinship to be
noisy if the kinship coefficient is less than 0.05. Box-plot diagrams of the noise kinship distribution’s
contribution are shown in Figure 4. It shows that the median noise contribution increases linearly and
reaches a value of 0.056 for 70 categories. The maximum value of the contribution from noise kinship
is fixed at the level of 0.122.</p>
      <p>Figure 4: Distributions of noise contribution from consideration of akin categories </p>
      <p>As for the relative noise contribution (shown on Figure 5), its median exceeds 10% in the series of
experiments for a large number of categories. In each series of experiments, there are numerous cases
when the noise contribution exceeds 15%. The noise contribution exceeds 25% in 194 cases out of
90,000 (Figure 6).</p>
      <p>Figure 5: Distributions of the relative value of the noise contribution from akin categories </p>
      <p>After noised component elimination, the box-plot diagrams of the akin categories contribution’s
distribution are shown in Figure 7. The median of this value increases only from 0 to 0.005 as the
number of categories increases. This is 13 times less than without noise elimination. At the same time,
a large number of outliers are observed – their number is from 5-20%. This indicates that the metric is
able to identify cases of strong interaction across akin categories. The box-plots of the proposed
metric after removing the contribution from the noisy kinship of the categories are shown in Figure 8.
Median of the refined metric lies in range [0.43; 0.69]. It is still high, especially for cases with large
number of categories. Probably, it is caused by the noised memberships to many categories in source
distributions in for X and Y . Hence, it is better to eliminate the noised memberships before
assessing the similarity of two categorical distributions. It is reasonable to do so for such topical
modeling problems when it is known that any object may belong just to a small number of categories.
Figure 8: Similarity distributions according to the proposed metric with noise filtering </p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions </title>
      <p>A new similarity metric of categorical distributions is proposed. A feature of the proposed metric
consists of taking into account the kinship of categories. The proposed metric has two components.
The first component is defined as Chekanowski metric. It calculates the direct similarity of
distributions by categories as the sum of the intersection of distributions’ membership degrees of two
objects. The second component of the metric takes into account the similarity of objects through akin
categories. It is assumed that the kinship coefficients of each pair of categories are known.</p>
      <p>Computational experiments have shown that in the case of Pareto distributions of objects
memberships to categories and Pareto distributions of kinship coefficients of categories, the proposed
metric takes values from the interval [0; 1]. It was established that with an increase in the number of
categories, the contribution to the proposed metric of the term, which takes into account the kinship of
the categories, strongly increases. This is due to the fact that the number of weak ties between akin
categories increases quadratically, each of which adds some contribution to the value of the metric.
And although the contribution from many pairs of akin categories is tiny, roughly speaking - noisy,
but the sum of the contributions turns out to be large.</p>
      <p>To eliminate the noise impact, we suggest ignoring the noise kinship of the categories. A simple
filter with kinship coefficient threshold at the level of 0.05 eliminated this drawback. After such noise
filtering, the distributions of the second component of the metric, which takes into account the kinship
of the categories, have a significant number of outliers. A significant number of outliers is observed in
all the series of experiments, both with a small number of categories and with a large one. This fact
indicates that the proposed metric makes it easy to identify pairs of objects whose similarity is largely
determined by membership to akin categories.</p>
      <p>The proposed metric can be used for topic modeling tasks, in which, when evaluating the
similarity of two objects, it is necessary to take into account their membership to akin categories.
Such tasks can be the selection of reviewers of research papers and theses, the analysis of the
similarity of text documents, the identification of poses of people in a video stream, the clustering of
species distribution, the formation of recommendations in online shops, etc.</p>
      <p> 
6. References </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdelrazek</surname>
          </string-name>
          , E. Yomna, G. Ema,
          <string-name>
            <given-names>M.</given-names>
            <surname>Walaa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <article-title>Topic modeling algorithms and applications: A survey</article-title>
          ,
          <source>Information Systems</source>
          <volume>112</volume>
          (
          <year>2023</year>
          )
          <article-title>102131</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.is.
          <year>2022</year>
          .
          <volume>102131</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Sebe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Amores</surname>
          </string-name>
          ,
          <article-title>A new study on distance metrics as similarity measurement</article-title>
          ,
          <source>in: Proceedings of IEEE International Conference on Multimedia and Expo</source>
          , Toronto,
          <year>2006</year>
          , pp.
          <fpage>533</fpage>
          -
          <lpage>536</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICME.
          <year>2006</year>
          .
          <volume>262443</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.-J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>New similarity measures on fuzzy sets and on elements</article-title>
          ,
          <source>Fuzzy Sets and Systems</source>
          <volume>85</volume>
          (
          <year>1997</year>
          )
          <fpage>305</fpage>
          -
          <lpage>309</lpage>
          . doi:
          <volume>10</volume>
          .1016/
          <fpage>0165</fpage>
          -
          <lpage>0114</lpage>
          (
          <issue>95</issue>
          )
          <fpage>00365</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.-H.</given-names>
            <surname>Cha</surname>
          </string-name>
          ,
          <article-title>Comprehensive survey on distance/similarity measures between probability density functions</article-title>
          ,
          <source>International Journal of Mathematical Models and Methods in Applied Sciences</source>
          <volume>1</volume>
          (
          <year>2007</year>
          )
          <fpage>300</fpage>
          -
          <lpage>307</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Amores</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sebe</surname>
          </string-name>
          ,
          <article-title>Toward robust distance metric analysis for similarity estimation</article-title>
          ,
          <source>in: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)</source>
          , New York,
          <year>2006</year>
          , pp.
          <fpage>316</fpage>
          -
          <lpage>322</lpage>
          , doi: 10.1109/CVPR.
          <year>2006</year>
          .
          <volume>310</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Louisa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Colubi</surname>
          </string-name>
          ,
          <article-title>New metrics and tests for subject prevalence in documents based on topic modeling</article-title>
          ,
          <source>International Journal of Approximate Reasoning</source>
          <volume>157</volume>
          (
          <year>2023</year>
          )
          <fpage>49</fpage>
          -
          <lpage>69</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.ijar.
          <year>2023</year>
          .
          <volume>02</volume>
          .009.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Weller-Fahy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Borghetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Sodemann</surname>
          </string-name>
          ,
          <article-title>A survey of distance and similarity measures used within network intrusion anomaly detection</article-title>
          ,
          <source>IEEE Communications Surveys &amp; Tutorials</source>
          ,
          <volume>17</volume>
          (
          <year>2015</year>
          )
          <fpage>70</fpage>
          -
          <lpage>91</lpage>
          . doi:
          <volume>10</volume>
          .1109/COMST.
          <year>2014</year>
          .
          <volume>2336610</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Feldstein</surname>
          </string-name>
          ,
          <article-title>The generalized Pareto law as a model for progressively censored survival data</article-title>
          ,
          <source>Biometrika</source>
          ,
          <volume>66</volume>
          (
          <year>1979</year>
          )
          <fpage>299</fpage>
          -
          <lpage>306</lpage>
          . doi:
          <volume>10</volume>
          .1093/biomet/66.2.299.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shtovba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Petrychko</surname>
          </string-name>
          ,
          <article-title>An algorithm for topic modeling of researchers taking into account their interests in Google Scholar profiles</article-title>
          ,
          <source>in: Proceedings of the Fourth International Workshop on Computer Modeling and Intelligent Systems</source>
          , Zaporizhzhia (
          <year>2021</year>
          ),
          <source>CEUR Workshop Proceedings</source>
          ,
          <volume>2864</volume>
          (
          <year>2021</year>
          )
          <fpage>299</fpage>
          -
          <lpage>311</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2864</volume>
          /paper26.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shtovba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Petrychko</surname>
          </string-name>
          ,
          <article-title>Jaccard index-based assessing the similarity of research fields in Dimensions</article-title>
          ,
          <source>in: Proceedings of the First International Workshop on Digital Content &amp; Smart Multimedia</source>
          ,
          <string-name>
            <surname>Lviv</surname>
          </string-name>
          (
          <year>2019</year>
          ),
          <source>CEUR Workshop Proceedings</source>
          ,
          <volume>2533</volume>
          (
          <year>2019</year>
          )
          <fpage>117</fpage>
          -
          <lpage>128</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2533</volume>
          /paper11.pdf.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>