<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Obtaining the Minimal Terminologically Saturated Document Set with Controlled Snowball Sampling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hennadii Dobrovolskyi[</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nataliya K</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Zaporizhzhya National University</institution>
          ,
          <addr-line>Zaporizhzhya, Zhukovskogo st. 66, 69600</addr-line>
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Collecting the scienti c papers to write the Related Work section, keeping up-to-date expertise in the topic of interest, or studying new scienti c direction is the ill-de ned information need that does not allow certainty about the completeness of search results. The controlled snowball method suggested by authors in the previous papers was extended with the objective criterion of the result completeness that allows stopping the search. The criterion is based on the assumption that the complete document set contains all terms describing the topic of interest. So, appending new document to the complete collection does not extend the list of terms. In the experiments, we compare our method of gathering the scienti c papers describing the topic "Ontologies (computer science)" with other three common approaches: search by automatical detected topic in "Microsoft Academic" database, a keyword search in Google Scholar database, and query ACM digital library with author</p>
      </abstract>
      <kwd-group>
        <kwd>terminological saturation</kwd>
        <kwd>minimal saturated document set</kwd>
        <kwd>citation network</kwd>
        <kwd>controlled snowball sampling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The research of search behaviour of scientists [2] show that in addition to the
related work review typical tasks are the research of new trends, the support of
awareness, search for reviewers and/or colleagues for joint scienti c projects. All
the aforementioned tasks are characterized by low speci city, the high volume
of results and, consequently, long search time. For example, a scienti c search
for the task of studying a new theory can last months or even years. Analysis
of modern search engines showed the lack of tools to increase the speci city and
reduce the volume of results[25].</p>
      <p>Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>The speci city of the search task is a characteristic of its de niteness. For
example, a task with high speci city is to search for the meaning of a known word
in the dictionary, and a search engine user can accurately say that the search
is successful. The low speci city of the task, such as the study of a new theory,
does not make it possible to state with certainty that the search is completed
and does not need to be continued to re ne the results. Therefore, having a stop
search criterion is an important way of handling the low speci city. Increasing
speci city can also be achieved through diversity { the ability of the information
system to discover relevant documents that are signi cantly di erent from those
already known to the user. For example, in the case of keyword searches, a
high-diversity system should include relevant documents that do not contain the
words listed in the search query or their synonyms in the search results.</p>
      <p>Previously, the authors of this paper proposed the method of controlled
snowball [9], in which low speci city is overcome by building and analyzing a citation
network. The purpose of this work is to complement the method developed by
the authors in previous works by the criterion of stopping the search, which will
reduce the number of documents found while maintaining a su cient level of
completeness.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>Systematic review [14, 28]</title>
        <p>Keyword search [31]
Content ltration method [29]
Systems of collaborative ltering[32]
Neighbor-based recommendations [30]
Graph-based recommendations [20]
Citation network analysis, Ahad et.al [1]
Citation network analysis, Lecy et.al [21]</p>
      </sec>
      <sec id="sec-2-2">
        <title>Stop</title>
        <p>criterion Diversity</p>
        <p>Minimal
volume
detection
{
{
{
{
{
{
{</p>
        <p>Depends on expert
{
{
{
{
+
+
+
{
{
{
{
{
{
{</p>
        <p>The method of systematic review has a stop search criterion as well as results
completeness criteria. In [28] it is proposed to stop at the moment when a
researcher understands that incorporation of new publications does not in uence
the conclusions made. The focus of concepts and not on publications [14] allows
selection of the most important and helps to decide how to group and to analyze
the selected publications. The disadvantage of the systematic review method is
its informality and lack of automation { such methods do not o er automatic
search and numerical quality measures.</p>
        <p>Keyword search [31] is provided by well-developed search tools, but it has
been shown that the keyword set is often inaccurate or/and incomplete [28].
To improve keywords, Petticrew and Gilbody [28] recommend to interview
researchers working in a chosen eld of study; if the interview cannot be conducted,
it is recommended that a researcher [16, 8] examine the documents found
carefully and change the set of key ones based on that knowledge. Another
disadvantage of keyword search is the low variety of search results { the search engine
does not include relevant documents in the search results that do not contain
the keywords or their synonyms speci ed in the query.</p>
        <p>Insu cient variety of search results is attempted to be overcome by
recommender systems [6]: content ltration methods, collaborative ltering,
neighbourand graph-based recommendations.</p>
        <p>Content-based ltering (CBF) systems [29] o er the user documents similar
to those that the user has already viewed, but they have low diversity and ignore
the quality and popularity of documents[12].</p>
        <p>Collaborative ltering (CF) is based on the assumption that the user will nd
useful documents that similar users select [32]. The recommendations obtained
are varied because they are based not on the similarity of the documents but the
similarity of the preferences. [27]. However, the collaborative ltering of scienti c
publications is complicated by their large number compared to the number of
readers [35], which does not allow reliable statistical estimates.</p>
        <p>Neighbourhood recommendations include documents that are often found
alongside some speci ed documents [30]. The advantage of such
recommendations is to concentrate on relationships instead of similarities. Neighbourhood
recommendations o er related but inconsistent documents and thus approach
collaborative ltering.</p>
        <p>Graph-based recommendation systems use existing links or assume their
existence and build. For example, a citation network is a graph in which document
nodes are connected by directed citation relationships [3]. Depending on the
modeling objects edges are considered as citations [3, 20], relationship
&lt;&lt;published in&gt;&gt; [3, 20, 39], authorship [3, 39]. Some authors build graphs creating
arti cial links [39]. To identify the most relevant recommendations, the
numerical properties of the nodes are calculated on the constructed graph. Most often,
a random walk is used to search for popular objects starting with one or more
random nodes [20].</p>
        <p>Building of a citation network with a snowball method and its analysis [34,
22, 21, 1] is close to graph-based recommender systems. The essence of the
approach lies in the creation and analysis of a directed graph { citation network,
where nodes are scienti c publications, and an edge linking a node A with a
node B means that A references to B. The advantage of the approach is that
references in each publication are carefully selected by authors, The
disadvantage of a list of references is its incompleteness and systematic bias. Due to the
restrictions on publication size, authors have to provide only a general and
limited description of the publications most relevant to their research [14]. It was
shown [17] that citation analysis allows to create more complete publication sets
than keyword-based search, makes formal description possible, and also smooths
out the individual weaknesses of the researcher.</p>
        <p>High search speed is ensured by the presence of hubs [18] { most cited
publications. Their number is small because about 90% of scienti c publications
are never cited [24]. Additionally, high search speed and search completeness
are ensured by a \small world" property, that is a proven property of citation
networks[4]. That is why an average length of a path between any two random
nodes is much less than the whole network size. Simulation of P2P networks of a
similar structure shows that [26] in most cases it is enough to perform 2-3
iterations of controlled snowball [1, 21, 9]: for each publication from a current queue
all the documents referenced and belonging to the selected topic are added to
the next level queue. To select the documents for a given topic Ahad with
colleagues [1] use vector document model and cosine similarity measure, Lecy et
al. [21] used PageRank from Google Scholar to select important publications. In
the previous work of the authors [9] probabilistic topic model was used.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>The Method of Collection Gathering</title>
      <p>The goal of the presented method is to retrieve from all available publications
D the subset B D, that contains elements matching the users information
need, where the information need is an informal and sometimes implicit set of
requirements to search results [31] and the user stands for a person that performs
one of the scienti c search activities [2, 25]. Following common practice [31], the
publication is considered as relevant if it, from the users point of view, matches
the users information need.</p>
      <p>The method used in the presented study is based on several assumptions.
Assumption 1. Information need consists of several informal requirements:
1. all the publications from the B belong to a given subject area [2];
2. all the publications from the B are important in a given subject area [2];
3. the size of the B that allows detailed study in an acceptable time [2];
4. the presence in the B of all the important terms of a subject area [13].</p>
      <p>Assumption 2. In what follows we assume that an information need is
partially represented with a set of publications each of which is related to the given
subject area [38].</p>
      <p>Assumption 3. Below we assume that due to low speci city of the information
need [2, 25] the user may know some keywords from the subject area and can
select the relevant publications, but does not have su cient quali cation to
evaluate their importance and completeness of the collected publication set [2].</p>
      <p>Assumption 4. Each publication d 2 D can be mapped to a set of sentences
S(d), and each sentences 2 S(d) { into a set of collocations C(s), that is a subset
of all collocations C, that can be found in D, where collocation c is a word or
a tuple of words, sentence s { is an ordered set of collocations, document { is
an ordered set of sentences, and publication is a structure consisting of texts,
key words, metainformation and references list. Also, the term is a collocation
labelling a concept in a given subject area.</p>
      <p>De nition 1. [33, 5] Citation mapping is de ned over a set of publications D
as</p>
      <sec id="sec-3-1">
        <title>REF : fvg ! fu 2 D j v cites ug; v 2 D:</title>
        <p>By applying citation mapping to a certain publication u, one can obtain a set of
referenced publications. By applying inverse citation mapping</p>
        <p>REF 1 : fug ! fv 2 D j v cites ug; u 2 D;
to a certain publication u, one can obtain a set of publications referencing to u.</p>
        <p>Repeated citation mapping REF k is de ned as a result of multiple application
of REF .</p>
        <p>
          The mapping (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) de nes a directed graph { citation network [33, 5]
N = (D; E)
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
with edges E = fvu; 8v 2 D; u 2 REF (fvg)g and nodes d 2 D.
        </p>
        <p>
          Assumption 5. [5] Citation network (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) is almost acyclic:
d 2 D j 9k 2 N; d 2 REF k(fdg)
d 2 D j 8k 2 N; d 2= REF k(fdg) ; (
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
where k 2 N { is a path length in a citation network.
        </p>
        <p>Assumption 6. [19] The necessary condition for presence in a B all the
important terms of a given subject area is terminological saturation of an ordered
set of publications.</p>
        <p>Assumption 7. Full text of a publication is not available for automatic access.
Often copyright restrictions make it di cult to automatically access the full
text of a publication. For example, search system Scopus requires a registration,
taking several steps with the usage of e-mail, and buying access to a publication
{ the operation that may not be automatized. That is why in the proposed
information technology full texts of publications are used at the very last steps
when the list of selected publications is minimal.</p>
        <p>Formal hybrid mathematical model of the process of bibliographic detection
and selection { is a tuple</p>
        <p>M = hD; REF; PTM; DocDiff; ; Snowball; B0; DocListDiff; !;</p>
        <p>
          SPC; MaxRank; Terms; Cvalue; thdi ; (
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
{ D - publications available for analysis;
{ REF - citation mapping;
{ PTM - presentation of the content of the publication;
{ DocDiff - publication di erence measure;
{ - marginal di erence in publications;
{ Snowball - snowball iteration mapping;
{ B0 - snowball iteration starting point;
{ SPC - publication weigth in a subject area;
{ DocListDiff - closeness measure of ordered sets of publications;
{ ! - marginal closeness measure of ordered sets of publications;
{ MaxRank - maximal rank of publication;
{ Terms - mapping of D into set of terms T;
{ Cvalue( ) - term weigth ;
{ thd - di erence measure of term sets.
        </p>
        <p>
          A subject area description in a model is de ned with a set of seed publications
B0 (B0 D, jB0j O(
          <xref ref-type="bibr" rid="ref10">10</xref>
          )), which at the same time is a starting point of snowball
iterations. Seed publications should obey such conditions:
{ publication theme is relevant;
{ publication age - 2-14 years;
{ publication is often cited in relevant publications.
        </p>
        <p>It is important to note that the last item di ers from typical recommendations
[34] on how to select the seed publications for snowball, providing a better start
for snowball iterations, however requiring more e orts from a user.</p>
        <p>Document relevance to the subject area is calculated with the help of a
probabilistic topic model of text documents (PTM) [37, 40, 36]. PTM presents a
content of each publication d 2 D as conditional probabilities</p>
        <p>p(tjd) = PTM (d) ;
showing probabilities of belonging of publication d to a topic t. Each topic t
is de ned with probabilities p( ijt) of belonging of collocation i to the topic
t, and a-priori probability p(t). In the presented model an modi ed PTM is
used, which is based on restoring distributions p( ijt) and p(t) from collocations
co-occurrence frequencies
p( i; k) =</p>
        <p>X p( ijt)p(t)p( kjt);</p>
        <p>t
which is calculated by counting the sentences s, where both i and k are found.</p>
        <p>Mapping publications to conditional probabilities allows the application of
the statistical measures [7] to calculate the di erence DocDiff between
publications.</p>
        <p>In our experiment, we use Kullbach-Leibler divergence and its threshold
that is chosen to keep the top 30% of the relevant publications during the rst
controlled snowball iterations.</p>
        <p>
          Snowball iteration mapping is de ned as:
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
Bi+1 = Snowball (Bi)
= [ fvg [ REF (fvg) [ REF 1(fvg) DocDiff(v;B0)&lt; ; (
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
where Bi D. The equation (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ) di ers from others [1, 21] (i) by the usage of topic
model of text documents for calculation of di erence between publications and
(ii) by traversing the citation graph both in the direction provided by references
and in the inverse direction.
        </p>
        <p>Publication weight in the subject area</p>
        <p>
          SPCi : v ! N; v 2 Bi;
is de ned as search path count (SPC) measure [23, 5] calculated in subgraph Ni 2
N citation network (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ), built on the edges E = fvu; 8v 2 Bi; u 2 Bi \ REF (fvg)g
and nodes d 2 Bi after transformation of cycles into acyclic fragments using
preprint transformation [23, 5].
        </p>
        <p>SPCi allows to nd a rank Ranki(v) of each publication and de ne an ordered
publication set we look for:</p>
        <p>Li (MaxRank) = (vk)jkB=ij1 ;Ranki(vk) &lt; MaxRank;</p>
        <p>Ranki(vk)</p>
        <p>
          Ranki(vk+1);
where maximal publication rank MaxRank restricts a number of items in a
ordered publication set and is de ned by the requirement of xed point of
iterations(
          <xref ref-type="bibr" rid="ref8">8</xref>
          ) achievement and terminological saturation.
        </p>
        <p>
          Within the framework of the developed model, the degree of closeness of
ordered sets of publications DocListDiff is calculated with Spearman rank
correlation (Li; Li+1), and the xed point of iterations (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ) is
j (Li; Li+1)
1j &lt; !; i &gt; i0;
where ! { marginal closeness measure of ordered sets of publications, (
          <xref ref-type="bibr" rid="ref10">10</xref>
          ) is a
parameter setting a level of variability of ordered publications set.
        </p>
        <p>
          Terminological saturation of ordered publications set is de ned with the
following condition: adding publication into the end of the list (
          <xref ref-type="bibr" rid="ref10">10</xref>
          ) leaves the
term list almost unchanged.
        </p>
        <p>thd(Ti(MaxRank); Ti(MaxRank +</p>
        <p>))
i
&lt; 1;
&gt; 0:</p>
        <sec id="sec-3-1-1">
          <title>Mapping of publications Li into set of terms Ti</title>
          <p>Ti = Terms(Li)
X
is conducted by application to the combined text of publications a procedure of
automatic term extraction, proposed in K. Frantzi, S. Ananiadou H. Mima [15]
and improved in V. Ermolayev et al., [19] that de nes a term weigth Cvaluei( )
in a publication set Li(MaxRank), marginal value i of term weigth and the
measure of terms sets di erence [13].</p>
          <p>thd(Ti; Tj ) =
jCvaluei( )</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Cvaluej ( )j +</title>
        <p>
          jCvaluei( )j (
          <xref ref-type="bibr" rid="ref14">14</xref>
          )
X
2Ti Tj
(
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
(
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
(
          <xref ref-type="bibr" rid="ref11">11</xref>
          )
(
          <xref ref-type="bibr" rid="ref12">12</xref>
          )
(
          <xref ref-type="bibr" rid="ref13">13</xref>
          )
        </p>
        <p>
          Minimal terminologically saturated publication set if described with the
equation (
          <xref ref-type="bibr" rid="ref10">10</xref>
          ), where
        </p>
        <p>MaxRank = min</p>
        <p>M
thd(Ti(M ); Ti(M +</p>
        <p>
          ))
i
&lt; 1
(
          <xref ref-type="bibr" rid="ref15">15</xref>
          )
        </p>
        <p>
          The overall model quality measure (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) is a number of publications jLij in the
nal ordered publication set, restricted with (
          <xref ref-type="bibr" rid="ref10">10</xref>
          ), (
          <xref ref-type="bibr" rid="ref11">11</xref>
          ), (
          <xref ref-type="bibr" rid="ref12">12</xref>
          ) and (
          <xref ref-type="bibr" rid="ref15">15</xref>
          ).
        </p>
        <p>Figure 1 shows the general work ow of the controlled snowball
implementation as UML activity diagram.</p>
        <p>The general work ow was introduced in [10] and details of the restricted
snowball sampling and probabilistic topic model construction are discussed in [11].</p>
        <p>Terminological saturation of the ordered publication
set obtained with controlled snowball method
The Spearman's rank correlation coe cient mentioned above allows simple
detection of the convergence of controlled snowball iterations[10], however it does
not address the completeness of the collected publication set.</p>
        <p>The main idea of the presented experiment is comparison of minimal
terminologically saturated ordered publication sets produced with di erent search
methods and in di erent scienti c databases and answer the following questions:
1. Do all common search methods produce the terminologically saturated
ordered publication sets?
2. Which of the common search methods produce the smaller terminologically
saturated ordered publication set?
3. Is the suggested controlled snowball method more e ective than selection by
topic?</p>
        <p>
          In our experiments we concentrated on the existence of the terminological
saturation and the size of the minimal terminologically saturated ordered
document set that is de ned as minimal value of MaxRank when (
          <xref ref-type="bibr" rid="ref12">12</xref>
          ) becomes
true.
        </p>
        <p>Starting from the uncertain information need of seminal scienti c
publications on the topic \Ontologies (computer science)", four collections were
considered:
1. abstracts of the seminal publications selected from the ONTO-KL citation
network that was gathered from the \Microsoft Academic Search" database
using the controlled snowball method described above, starting from seed
publications on the topic \Ontologies (computer science)";
2. abstracts of the publications indexed by the \Microsoft Academic Search"
service having an automatically assigned category \ontologies" and arranged
in descending order of citation index;
3. abstracts of the publications stored in the \ACM digital library" electronic
library, having the \ontologies" label assigned by the authors and lined up
in descending order of citation index;
4. abstracts of the publications found on Google Scholar Search by keyword
\ontologies" and ranked by descending relevance calculated with Google's
internal algorithms.</p>
        <p>The 2nd, 3rd and 4th collections represent the common and wide spread search
approaches that do not provide the formal criteria of stopping the search and
thus can produce huge sets of publications. To enable comparison we have
extended them with automatic term extraction and with method of terminological
saturation detection.</p>
        <p>For each of the found publications we searched for full text in PDF format.
The PDF les were downloaded from di erent sources: \ACM digital library"
provides full publication texts to registered users; \Microsoft Academic Search"
and \Google Scholar" often provide links to full-text PDF publications that
can be automatically found and saved 1. Also the PDF les were searched in
SemanticScholar and ResearchGate databases. Publications for which the full
text was not found were excluded from consideration and the text of the next
publication was searched.</p>
        <p>
          To study terminological saturation of ordered document set D we follow the
work of Kosa et al. [19]. First, the nite sequence of texts Di, (i = 1; 2; 4; :::; 11)
is composed where each text Di contains the concantenated full texts of the
rst 20 i documents of D. Then all Di are processed with the automatic term
extraction method. The corresponding sets of terms Ti were compared with thd,
de ned by (
          <xref ref-type="bibr" rid="ref14">14</xref>
          ). The saturation criterion used is thd(Ti; Ti+1)= &lt; 1 where i
MaxRank. Thus we can calculate minimal MaxRank for any of used collections
of publications.
        </p>
        <p>The obtained values of minimal MaxRank shown in the Table 2 are the
quality measure for the proposed model. The Figure 2 shows the dependence of
thd(Ti; Ti+1)= from a number of publications incuded in Di.</p>
        <p>We can see that terminological saturation is observed for the collection
gathered with the controlled snowball and selected from \Microsoft Academic".
Publications gathered from \ACM digital library" do not provide saturation and set
of publications taken from \Google Scholar" may exhibit saturation when
extended.</p>
        <p>The Table 2 shows that used in the paper controlled snowball method leads
to smaller terminologically saturated publications set than studied analogues.</p>
        <p>Source Search method MaxRank
\Microsoft Academic" \Snowball" 160
\Microsoft Academic" automatic label 180
\ACM digital library" author label &gt; 200
\Google Scholar" kew word 220
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>The presented study introduces the formal criterion of stopping the search and
search result completeness to overcome the common issue of scienti c
information retrieval when information need has low certainty. The suggested formal
criterion is based on automatic term extraction and terminological saturation
detection.
1 Software library Puppeteer for NodeJS, https://github.com/GoogleChrome/puppeteer
a. Controlled Snowball</p>
      <p>b. ACM Digital Library
6
=d 4
h
t</p>
      <p>2
=d 4
h
t
6
2
50</p>
      <p>100 150 200
c. Google Scholar
thd=
thd=
6
4
2
6
4
2</p>
      <p>thd=
50</p>
      <p>100 150 200
d. Microsoft Academic</p>
      <p>thd=
threshold
50 100 150 200
Number of publications</p>
      <p>50 100 150 200</p>
      <p>Number of publications</p>
      <p>In our experiment we have extended with terminological saturation detection
the following search approaches: controlled snowball method; search by
automatically assigned topic; keyword search; search by author keywords.</p>
      <p>The objectives of the experiment were the existence of the terminological
saturation and the size of the minimal terminologically saturated ordered document
set for each of the search approaches.</p>
      <p>Starting from the uncertain information need of seminal scienti c
publications on the topic \Ontologies (computer science)", four collections were
considered:
1. publications gathered from the \Microsoft Academic Search" database using
the controlled snowball method suggested by authors;
2. publications indexed by the \Microsoft Academic Search" service having
an automatically assigned category \ontologies" and arranged in descending
order of citation index;
3. publications stored in the \ACM digital library" electronic library, having
the \ontologies" label assigned by the authors and lined up in descending
order of citation index;
4. publications found on Google Scholar Search by keyword \ontologies" and
ranked by descending relevance calculated with Google's internal algorithms.</p>
      <p>The experiment have shown that terminological saturation for a collected
ordered publication set, created from \Microsoft Academic" with the controlled
snowball method, is achieved for 160 publications { 9% faster than for the ordered
publication set created from \Microsoft Academic" with category \ontology"
automatically set (180 publications). Sets consisting of 200 publications with
the keyword \ontology" from \Google Scholar" and with label \ontology" from
\ACM digital library", do not possess terminological saturation.</p>
      <p>
        So we can conclude that both the controlled snowball method and topic
search in \Microsoft Academic" produce the small terminologically saturated
publication sets of almost equal size. However, this conclusion must be supported
with search on other topics. Also, in the future studies, the term-based precision
and recall should be calculated that, in turn, requires the creation of the dataset
of terms evaluated by experts.
19. Kosa, V., Chaves-Fraga, D., Dobrovolskyi, H., Ermolayev, V.: Optimized term
extraction method based on computing merged partial c-values. In: Ermolayev,
V., Mallet, F., Yakovyna, V., Mayr, H., Spivakovsky, A. (eds.) Information and
Communication Technologies in Education, Research, and Industrial Applications.
ICTERI 2019. Communications in Computer and Information Science, vol. 1175,
pp. 24{49. Springer Berlin Heidelberg (2020).
https://doi.org/10.1007/978-3-03039459-2 2, https://link.springer.com/chapter/10.1007/978-3-030-39459-2 2
20. Lao, N., Cohen, W.W.: Relational retrieval using a combination of
path-constrained random walks. Machine learning 81(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), 53{67 (2010).
https://doi.org/10.1007/s10994-010-5205-8
21. Lecy, J.D., Beatty, K.E.: Representative literature reviews using constrained
snowball sampling and citation network analysis. Available at SSRN 1992601 (2012).
https://doi.org/10.2139/ssrn.1992601
22. Liu, J.S., Lu, L.Y., Lu, W.M., Lin, B.J.: Data envelopment analysis
1978{2010: A citation-based literature survey. Omega 41(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), 3{15 (2013).
https://doi.org/10.1016/j.omega.2010.12.006
23. Lucio-Arias, D., Leydesdor , L.: Main-path analysis and path-dependent
transitions in histcite-based historiograms. Journal of the American Society for
Information Science and Technology 59(
        <xref ref-type="bibr" rid="ref12">12</xref>
        ), 1948{1962 (2008)
24. Meho, L.I.: The rise and rise of citation analysis. Physics World 20(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), 32 (2007).
      </p>
      <p>
        https://doi.org/10.1088/2058-7058/20/1/33
25. Nedumov, Y., Kuznetsov, S.: Exploratory search for scienti c
articles. Programming and Computer Software 45(
        <xref ref-type="bibr" rid="ref7">7</xref>
        ), 405{416 (2019).
https://doi.org/10.15514/ISPRAS-2018-30(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )-10
26. Nicolini, A.L., Lorenzetti, C.M., Maguitman, A.G., Chesn~evar, C.I.:
Intelligent algorithms for improving communication patterns in thematic
p2p search. Information Processing &amp; Management 53(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), 388{404 (2017).
https://doi.org/10.1016/j.ipm.2016.12.001
27. Palopoli, L., Rosaci, D., Sarne, G.M.: A multi-tiered recommender system
architecture for supporting e-commerce. In: Intelligent Distributed Computing VI, pp.
71{81. Springer (2013). https://doi.org/10.1007/978-3-642-32524-3 10
28. Petticrew, M., Gilbody, S.: Planning and conducting systematic reviews. Health
psychology in practice pp. 150{179 (2004)
29. Ricci, F., Rokach, L., Shapira, B.: Introduction to recommender systems
handbook. In: Recommender systems handbook, pp. 1{35. Springer (2011).
https://doi.org/10.1007/978-0-387-85820-3 1
30. Rodriguez-Prieto, O., Araujo, L., Martinez-Romo, J.: Discovering related scienti c
literature beyond semantic similarity: a new co-citation approach. Scientometrics
120(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), 105{127 (2019). https://doi.org/10.1007/s11192-019-03125-9
31. Schutze, H., Manning, C.D., Raghavan, P.: Introduction to information retrieval,
vol. 39. Cambridge University Press (2008)
32. Shi, Y., Larson, M., Hanjalic, A.: Collaborative ltering beyond the user-item
matrix: A survey of the state of the art and future challenges. ACM Computing
Surveys (CSUR) 47(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), 3 (2014). https://doi.org/10.1145/2556270
33. de Solla Price, D.J.: Networks of scienti c papers. Science 149(3683), 510{515
(1965)
34. Varela, A.R., Pratt, M., Harris, J., Lecy, J., Salvo, D., Brownson, R.C., Hallal,
P.C.: Mapping the historical development of physical activity and health research:
A structured literature review and citation network analysis. Preventive medicine
111, 466{472 (2018). https://doi.org/10.1016/j.ypmed.2017.10.020
35. Vellino, A.: Usage-based vs. citation-based methods for recommending scholarly
research articles. arXiv preprint arXiv:1303.7149 (2013)
36. Vorontsov, K., Potapenko, A.: Tutorial on probabilistic topic modeling: Additive
regularization for stochastic matrix factorization. In: International Conference on
Analysis of Images, Social Networks and Texts x000D . pp. 29{46. Springer (2014).
https://doi.org/10.1007/978-3-319-12580-0 3
37. Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In:
Proceedings of the 22nd international conference on World Wide Web. pp. 1445{
1456. ACM (2013). https://doi.org/10.1145/2488388.2488514
38. Zarrinkalam, F., Kahani, M.: Semcir: A citation recommendation system
based on a novel semantic distance measure. Program 47(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), 92{112 (2013).
https://doi.org/10.1108/00330331311296320
39. Zhou, M., Zhao, S.: Learning question paraphrases from log data (Feb 14 2008),
uS Patent App. 11/500,224
40. Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution
for short and imbalanced texts. Knowledge and Information Systems 48(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), 379{
398 (2016). https://doi.org/10.1007/s10115-015-0882-z
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ahad</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fayaz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>A.S.:</given-names>
          </string-name>
          <article-title>Navigation through citation network based on content similarity using cosine similarity algorithm</article-title>
          .
          <source>Int. J. Database Theory Appl</source>
          <volume>9</volume>
          (
          <issue>5</issue>
          ),
          <volume>9</volume>
          {
          <fpage>20</fpage>
          (
          <year>2016</year>
          ). https://doi.org/10.14257/ijdta.
          <year>2016</year>
          .
          <volume>9</volume>
          .5.
          <fpage>02</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Athukorala</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoggan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehtio</surname>
            <given-names></given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Ruotsalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Jacucci</surname>
          </string-name>
          , G.:
          <article-title>Informationseeking behaviors of computer scientists: Challenges for electronic literature search tools</article-title>
          .
          <source>In: Proceedings of the 76th ASIS&amp;T Annual Meeting: Beyond the Cloud: Rethinking Information Boundaries</source>
          . p.
          <fpage>20</fpage>
          . American Society for Information Science (
          <year>2013</year>
          ). https://doi.org/10.1002/meet.14505001041
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Baez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mirylenka</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parra</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Understanding and supporting search for scholarly knowledge</article-title>
          .
          <source>Proceeding of the 7th European Computer Science</source>
          Summit pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Barabasi</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          :
          <article-title>Scale-free networks: a decade and beyond</article-title>
          .
          <source>Science</source>
          <volume>325</volume>
          (
          <issue>5939</issue>
          ),
          <volume>412</volume>
          {
          <fpage>413</fpage>
          (
          <year>2009</year>
          ). https://doi.org/10.1126/science.1173299
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Batagelj</surname>
          </string-name>
          , V.:
          <article-title>E cient algorithms for citation network analysis</article-title>
          .
          <source>arXiv preprint cs/0309023</source>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Beel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gipp</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Langer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Breitinger</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Paper recommender systems: a literature survey</article-title>
          .
          <source>International Journal on Digital Libraries</source>
          <volume>17</volume>
          (
          <issue>4</issue>
          ),
          <volume>305</volume>
          {
          <fpage>338</fpage>
          (
          <year>2016</year>
          ). https://doi.org/10.1007/s00799-015-0156-0
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>S.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cha</surname>
            ,
            <given-names>S.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tappert</surname>
          </string-name>
          , C.C.
          <article-title>: A survey of binary similarity and distance measures</article-title>
          .
          <source>Journal of Systemics, Cybernetics and Informatics</source>
          <volume>8</volume>
          (
          <issue>1</issue>
          ),
          <volume>43</volume>
          {
          <fpage>48</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Colicchia</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strozzi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Supply chain risk management: a new methodology for a systematic literature review</article-title>
          .
          <source>Supply Chain Management: An International Journal</source>
          <volume>17</volume>
          (
          <issue>4</issue>
          ),
          <volume>403</volume>
          {
          <fpage>418</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Dobrovolskyi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keberle</surname>
          </string-name>
          , N.:
          <article-title>Collecting the seminal scienti c abstracts with topic modelling, snowball sampling and citation analysis</article-title>
          .
          <source>In: Proceedings of the 14th International Conference on ICT in Education, Research and Industrial Applications. Integration, Harmonization and Knowledge Transfer</source>
          . vol.
          <volume>1</volume>
          , pp.
          <volume>179</volume>
          {
          <fpage>192</fpage>
          . Springer (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Dobrovolskyi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keberle</surname>
          </string-name>
          , N.:
          <article-title>On convergence of controlled snowball sampling for scienti c abstracts collection</article-title>
          .
          <source>In: International Conference on Information and Communication Technologies in Education, Research, and Industrial Applications</source>
          . vol.
          <volume>1007</volume>
          , pp.
          <volume>18</volume>
          {
          <fpage>42</fpage>
          . Springer (
          <year>2018</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -13929-2 2
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Dobrovolskyi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keberle</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Todoriko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Probabilistic topic modelling for controlled snowball sampling in citation network collection</article-title>
          .
          <source>In: International Conference on Knowledge Engineering and the Semantic Web</source>
          . pp.
          <volume>85</volume>
          {
          <fpage>100</fpage>
          . Springer (
          <year>2017</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -69548-8 7
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tokarchuk</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
          </string-name>
          , A.:
          <article-title>Digging friendship: paper recommendation in social network</article-title>
          .
          <source>In: Proceedings of Networking &amp; Electronic Commerce Research Conference (NAEC</source>
          <year>2009</year>
          ). pp.
          <volume>21</volume>
          {
          <issue>28</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Ermolayev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batsakis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keberle</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tatarintseva</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antoniou</surname>
          </string-name>
          , G.:
          <article-title>Ontologies of time: Review and trends</article-title>
          .
          <source>International Journal of Computer Science &amp; Applications</source>
          <volume>11</volume>
          (
          <issue>3</issue>
          ) (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Fisch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Block</surname>
          </string-name>
          , J.:
          <article-title>Six tips for your (systematic) literature review in business and management research</article-title>
          .
          <source>Management Review Quarterly</source>
          <volume>68</volume>
          (
          <issue>2</issue>
          ),
          <volume>103</volume>
          {106 (Apr
          <year>2018</year>
          ). https://doi.org/10.1007/s11301-018-0142-x, https://doi.org/10.1007/s11301-018- 0142-x
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Frantzi</surname>
            ,
            <given-names>K.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ananiadou</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The c-value/nc-value domain-independent method for multi-word term extraction</article-title>
          .
          <source>Journal of Natural Language Processing</source>
          <volume>6</volume>
          (
          <issue>3</issue>
          ),
          <volume>145</volume>
          {
          <fpage>179</fpage>
          (
          <year>1999</year>
          ). https://doi.org/10.5715/jnlp.6.
          <fpage>3</fpage>
          <lpage>145</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Friday</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ryan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sridharan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Collins,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Collaborative risk management: a systematic literature review</article-title>
          .
          <source>International Journal of Physical Distribution &amp; Logistics Management</source>
          <volume>48</volume>
          (
          <issue>3</issue>
          ),
          <volume>231</volume>
          {
          <fpage>253</fpage>
          (
          <year>2018</year>
          ). https://doi.org/10.1108/IJPDLM01-2017-0035
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. Gar eld, E.:
          <article-title>From computational linguistics to algorithmic historiography</article-title>
          . In: Symposium in Honor of Casimir Borkowski at the University of Pittsburgh School of Information Sciences (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Harris</surname>
            ,
            <given-names>J.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beatty</surname>
            ,
            <given-names>K.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lecy</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyr</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shapiro</surname>
            ,
            <given-names>R.M.</given-names>
          </string-name>
          :
          <article-title>Mapping the multidisciplinary eld of public health services and systems research</article-title>
          .
          <source>American journal of preventive medicine 41(1)</source>
          ,
          <volume>105</volume>
          {
          <fpage>111</fpage>
          (
          <year>2011</year>
          ). https://doi.org/10.1016/j.amepre.
          <year>2011</year>
          .
          <volume>03</volume>
          .015
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>