=Paper= {{Paper |id=Vol-2740/20200087 |storemode=property |title=Obtaining the Minimal Terminologically Saturated Document Set with Controlled Snowball Sampling |pdfUrl=https://ceur-ws.org/Vol-2740/20200087.pdf |volume=Vol-2740 |authors=Hennadii Dobrovolskyi,Natalya Keberle |dblpUrl=https://dblp.org/rec/conf/icteri/DobrovolskyiK20 }} ==Obtaining the Minimal Terminologically Saturated Document Set with Controlled Snowball Sampling== https://ceur-ws.org/Vol-2740/20200087.pdf
                          Obtaining the Minimal Terminologically
                          Saturated Document Set with Controlled
                                    Snowball Sampling

                               Hennadii Dobrovolskyi[0000−0001−5742−104X] , and Nataliya
                                            Keberle[0000−0001−7398−3464]

                   Zaporizhzhya National University, Zaporizhzhya, Zhukovskogo st. 66, 69600 Ukraine,
                                      gen.dobr@gmail.com, nkeberle@gmail.com



                          Abstract. Collecting the scientific papers to write the Related Work
                          section, keeping up-to-date expertise in the topic of interest, or studying
                          new scientific direction is the ill-defined information need that does not
                          allow certainty about the completeness of search results. The controlled
                          snowball method suggested by authors in the previous papers was ex-
                          tended with the objective criterion of the result completeness that allows
                          stopping the search. The criterion is based on the assumption that the
                          complete document set contains all terms describing the topic of interest.
                          So, appending new document to the complete collection does not extend
                          the list of terms. In the experiments, we compare our method of gath-
                          ering the scientific papers describing the topic ”Ontologies (computer
                          science)” with other three common approaches: search by automatical
                          detected topic in ”Microsoft Academic” database, a keyword search in
                          Google Scholar database, and query ACM digital library with author
                          keywords. For each of the collected sets, the automatic term extraction
                          was performed, and the size of the minimal saturated ordered document
                          set is found. It was shown that terminological saturation is observed for
                          the sets collected with controlled snowball method and with topic search
                          in ”Microsoft Academic” database. Moreover, the proposed controlled
                          snowball provides the 10% smaller document set.

                          Keywords: terminological saturation, minimal saturated document set,
                          citation network, controlled snowball sampling.


                  1     Introduction
                  The research of search behaviour of scientists [2] show that in addition to the
                  related work review typical tasks are the research of new trends, the support of
                  awareness, search for reviewers and/or colleagues for joint scientific projects. All
                  the aforementioned tasks are characterized by low specificity, the high volume
                  of results and, consequently, long search time. For example, a scientific search
                  for the task of studying a new theory can last months or even years. Analysis
                  of modern search engines showed the lack of tools to increase the specificity and
                  reduce the volume of results[25].




Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    The specificity of the search task is a characteristic of its definiteness. For
example, a task with high specificity is to search for the meaning of a known word
in the dictionary, and a search engine user can accurately say that the search
is successful. The low specificity of the task, such as the study of a new theory,
does not make it possible to state with certainty that the search is completed
and does not need to be continued to refine the results. Therefore, having a stop
search criterion is an important way of handling the low specificity. Increasing
specificity can also be achieved through diversity – the ability of the information
system to discover relevant documents that are significantly different from those
already known to the user. For example, in the case of keyword searches, a
high-diversity system should include relevant documents that do not contain the
words listed in the search query or their synonyms in the search results.
    Previously, the authors of this paper proposed the method of controlled snow-
ball [9], in which low specificity is overcome by building and analyzing a citation
network. The purpose of this work is to complement the method developed by
the authors in previous works by the criterion of stopping the search, which will
reduce the number of documents found while maintaining a sufficient level of
completeness.


2    Related Work


Table 1. The suitability of detection and selection methods to solve scientists’ search
problems.

                                                                     Minimal
                                                    Stop
    Method                                                  Diversity volume
                                                  criterion
                                                                     detection
    Systematic review [14, 28]                          Depends on expert
    Keyword search [31]                               –        –          –
    Content filtration method [29]                    –        –          –
    Systems of collaborative filtering[32]            –        –          –
    Neighbor-based recommendations [30]               –        –          –
    Graph-based recommendations [20]                  –        +          –
    Citation network analysis, Ahad et.al [1]         –        +          –
    Citation network analysis, Lecy et.al [21]        –        +          –


    The method of systematic review has a stop search criterion as well as results
completeness criteria. In [28] it is proposed to stop at the moment when a re-
searcher understands that incorporation of new publications does not influence
the conclusions made. The focus of concepts and not on publications [14] allows
selection of the most important and helps to decide how to group and to analyze
the selected publications. The disadvantage of the systematic review method is
its informality and lack of automation – such methods do not offer automatic
search and numerical quality measures.
     Keyword search [31] is provided by well-developed search tools, but it has
been shown that the keyword set is often inaccurate or/and incomplete [28].
To improve keywords, Petticrew and Gilbody [28] recommend to interview re-
searchers working in a chosen field of study; if the interview cannot be conducted,
it is recommended that a researcher [16, 8] examine the documents found care-
fully and change the set of key ones based on that knowledge. Another disad-
vantage of keyword search is the low variety of search results – the search engine
does not include relevant documents in the search results that do not contain
the keywords or their synonyms specified in the query.
     Insufficient variety of search results is attempted to be overcome by recom-
mender systems [6]: content filtration methods, collaborative filtering, neighbour-
and graph-based recommendations.
     Content-based filtering (CBF) systems [29] offer the user documents similar
to those that the user has already viewed, but they have low diversity and ignore
the quality and popularity of documents[12].
     Collaborative filtering (CF) is based on the assumption that the user will find
useful documents that similar users select [32]. The recommendations obtained
are varied because they are based not on the similarity of the documents but the
similarity of the preferences. [27]. However, the collaborative filtering of scientific
publications is complicated by their large number compared to the number of
readers [35], which does not allow reliable statistical estimates.
     Neighbourhood recommendations include documents that are often found
alongside some specified documents [30]. The advantage of such recommenda-
tions is to concentrate on relationships instead of similarities. Neighbourhood
recommendations offer related but inconsistent documents and thus approach
collaborative filtering.
     Graph-based recommendation systems use existing links or assume their ex-
istence and build. For example, a citation network is a graph in which document
nodes are connected by directed citation relationships [3]. Depending on the
modeling objects edges are considered as citations [3, 20], relationship ¡¡pub-
lished in¿¿ [3, 20, 39], authorship [3, 39]. Some authors build graphs creating
artificial links [39]. To identify the most relevant recommendations, the numeri-
cal properties of the nodes are calculated on the constructed graph. Most often,
a random walk is used to search for popular objects starting with one or more
random nodes [20].
     Building of a citation network with a snowball method and its analysis [34,
22, 21, 1] is close to graph-based recommender systems. The essence of the ap-
proach lies in the creation and analysis of a directed graph – citation network,
where nodes are scientific publications, and an edge linking a node A with a
node B means that A references to B. The advantage of the approach is that
references in each publication are carefully selected by authors, The disadvan-
tage of a list of references is its incompleteness and systematic bias. Due to the
restrictions on publication size, authors have to provide only a general and lim-
ited description of the publications most relevant to their research [14]. It was
shown [17] that citation analysis allows to create more complete publication sets
than keyword-based search, makes formal description possible, and also smooths
out the individual weaknesses of the researcher.
     High search speed is ensured by the presence of hubs [18] – most cited pub-
lications. Their number is small because about 90% of scientific publications
are never cited [24]. Additionally, high search speed and search completeness
are ensured by a “small world” property, that is a proven property of citation
networks[4]. That is why an average length of a path between any two random
nodes is much less than the whole network size. Simulation of P2P networks of a
similar structure shows that [26] in most cases it is enough to perform 2-3 iter-
ations of controlled snowball [1, 21, 9]: for each publication from a current queue
all the documents referenced and belonging to the selected topic are added to
the next level queue. To select the documents for a given topic Ahad with col-
leagues [1] use vector document model and cosine similarity measure, Lecy et
al. [21] used PageRank from Google Scholar to select important publications. In
the previous work of the authors [9] probabilistic topic model was used.


3   The Method of Collection Gathering
The goal of the presented method is to retrieve from all available publications
D the subset B ⊆ D, that contains elements matching the users information
need, where the information need is an informal and sometimes implicit set of
requirements to search results [31] and the user stands for a person that performs
one of the scientific search activities [2, 25]. Following common practice [31], the
publication is considered as relevant if it, from the users point of view, matches
the users information need.
   The method used in the presented study is based on several assumptions.
Assumption 1. Information need consists of several informal requirements:
 1. all the publications from the B belong to a given subject area [2];
 2. all the publications from the B are important in a given subject area [2];
 3. the size of the B that allows detailed study in an acceptable time [2];
 4. the presence in the B of all the important terms of a subject area [13].
    Assumption 2. In what follows we assume that an information need is par-
tially represented with a set of publications each of which is related to the given
subject area [38].
    Assumption 3. Below we assume that due to low specificity of the information
need [2, 25] the user may know some keywords from the subject area and can
select the relevant publications, but does not have sufficient qualification to
evaluate their importance and completeness of the collected publication set [2].
    Assumption 4. Each publication d ∈ D can be mapped to a set of sentences
S(d), and each sentences ∈ S(d) – into a set of collocations C(s), that is a subset
of all collocations C, that can be found in D, where collocation c is a word or
a tuple of words, sentence s – is an ordered set of collocations, document – is
an ordered set of sentences, and publication is a structure consisting of texts,
key words, metainformation and references list. Also, the term τ is a collocation
labelling a concept in a given subject area.
   Definition 1. [33, 5] Citation mapping is defined over a set of publications D
as
                     REF : {v} → {u ∈ D | v cites u}, v ∈ D.                  (1)
By applying citation mapping to a certain publication u, one can obtain a set of
referenced publications. By applying inverse citation mapping

                   REF −1 : {u} → {v ∈ D | v cites u}, u ∈ D,                  (2)

to a certain publication u, one can obtain a set of publications referencing to u.
    Repeated citation mapping REF k is defined as a result of multiple application
of REF .
    The mapping (1) defines a directed graph – citation network [33, 5]

                                   N = (D, E)                                  (3)

with edges E = {vu, ∀v ∈ D, u ∈ REF ({v})} and nodes d ∈ D.
   Assumption 5. [5] Citation network (3) is almost acyclic:

    d ∈ D | ∃k ∈ N, d ∈ REF k ({d})  d ∈ D | ∀k ∈ N, d ∈ / REF k ({d})
                                      
                                                                             , (4)

where k ∈ N – is a path length in a citation network.
    Assumption 6. [19] The necessary condition for presence in a B all the im-
portant terms of a given subject area is terminological saturation of an ordered
set of publications.
    Assumption 7. Full text of a publication is not available for automatic access.
Often copyright restrictions make it difficult to automatically access the full
text of a publication. For example, search system Scopus requires a registration,
taking several steps with the usage of e-mail, and buying access to a publication
– the operation that may not be automatized. That is why in the proposed
information technology full texts of publications are used at the very last steps
when the list of selected publications is minimal.
    Formal hybrid mathematical model of the process of bibliographic detection
and selection – is a tuple

  M = hD, REF, PTM, DocDiff, δ, Snowball, B0 , DocListDiff, ω,
                                       SPC, MaxRank, Terms, Cvalue, thdi , (5)



 – D - publications available for analysis;
 – REF - citation mapping;
 – PTM - presentation of the content of the publication;
 – DocDiff - publication difference measure;
 – δ - marginal difference in publications;
 – Snowball - snowball iteration mapping;
 – B0 - snowball iteration starting point;
 – SPC - publication weigth in a subject area;
 – DocListDiff - closeness measure of ordered sets of publications;
 – ω - marginal closeness measure of ordered sets of publications;
 – MaxRank - maximal rank of publication;
 – Terms - mapping of D into set of terms T;
 – Cvalue(τ ) - term weigth τ ;
 – thd - difference measure of term sets.
    A subject area description in a model is defined with a set of seed publications
B0 (B0 ⊆ D, |B0 | ∼ O(10)), which at the same time is a starting point of snowball
iterations. Seed publications should obey such conditions:
 – publication theme is relevant;
 – publication age - 2-14 years;
 – publication is often cited in relevant publications.
It is important to note that the last item differs from typical recommendations
[34] on how to select the seed publications for snowball, providing a better start
for snowball iterations, however requiring more efforts from a user.
    Document relevance to the subject area is calculated with the help of a
probabilistic topic model of text documents (PTM) [37, 40, 36]. PTM presents a
content of each publication d ∈ D as conditional probabilities

                                   p(t|d) = PTM (d) ,                              (6)

showing probabilities of belonging of publication d to a topic t. Each topic t
is defined with probabilities p(τi |t) of belonging of collocation τi to the topic
t, and a-priori probability p(t). In the presented model an modified PTM is
used, which is based on restoring distributions p(τi |t) and p(t) from collocations
co-occurrence frequencies
                                          X
                          p(τi , τk ) =       p(τi |t)p(t)p(τk |t),                (7)
                                          t

which is calculated by counting the sentences s, where both τi and τk are found.
    Mapping publications to conditional probabilities allows the application of
the statistical measures [7] to calculate the difference DocDiff between publi-
cations.
    In our experiment, we use Kullbach-Leibler divergence and its threshold δ
that is chosen to keep the top 30% of the relevant publications during the first
controlled snowball iterations.
    Snowball iteration mapping is defined as:

  Bi+1 = Snowball (Bi )
                          [
                     =          {v} ∪ REF ({v}) ∪ REF −1 ({v}) DocDiff (v,B0 )<δ , (8)
                         v∈Bi
where Bi ⊆ D. The equation (8) differs from others [1, 21] (i) by the usage of topic
model of text documents for calculation of difference between publications and
(ii) by traversing the citation graph both in the direction provided by references
and in the inverse direction.
     Publication weight in the subject area
                                   SPCi : v → N, v ∈ Bi ,                                    (9)
is defined as search path count (SPC) measure [23, 5] calculated in subgraph Ni ∈
N citation network (3), built on the edges E = {vu, ∀v ∈ Bi , u ∈ Bi ∩ REF ({v})}
and nodes d ∈ Bi after transformation of cycles into acyclic fragments using
preprint transformation [23, 5].
    SPCi allows to find a rank Ranki (v) of each publication and define an ordered
publication set we look for:

                                   i       |B |
              Li (MaxRank) = (vk )k=1 ,Ranki (vk ) < MaxRank,
                                                                                            (10)
                                                  Ranki (vk ) ≤ Ranki (vk+1 ),
where maximal publication rank MaxRank restricts a number of items in a
ordered publication set and is defined by the requirement of fixed point of
iterations(8) achievement and terminological saturation.
    Within the framework of the developed model, the degree of closeness of
ordered sets of publications DocListDiff is calculated with Spearman rank cor-
relation ρ(Li , Li+1 ), and the fixed point of iterations (8) is
                                 |ρ(Li , Li+1 ) − 1| < ω, i > i0 ,                          (11)
where ω – marginal closeness measure of ordered sets of publications, (10) is a
parameter setting a level of variability of ordered publications set.
   Terminological saturation of ordered publications set is defined with the fol-
lowing condition: adding ∆ publication into the end of the list (10) leaves the
term list almost unchanged.
               thd(Ti (MaxRank), Ti (MaxRank + ∆))
                                                   < 1, ∆ > 0.                              (12)
                                i
Mapping of publications Li into set of terms Ti
                                        Ti = Terms(Li )                                     (13)
is conducted by application to the combined text of publications a procedure of
automatic term extraction, proposed in K. Frantzi, S. Ananiadou H. Mima [15]
and improved in V. Ermolayev et al., [19] that defines a term weigth Cvaluei (τ )
in a publication set Li (MaxRank), marginal value i of term weigth and the
measure of terms sets difference [13].
                      X                                              X
  thd(Ti , Tj ) =               |Cvaluei (τ ) − Cvaluej (τ )| +               |Cvaluei (τ )| (14)
                    τ ∈Ti ∩Tj                                     τ ∈Ti −Tj
    Minimal terminologically saturated publication set if described with the equa-
tion (10), where
                                                        
                             thd(Ti (M ), Ti (M + ∆))
             MaxRank = min M                          <1                     (15)
                                         i

    The overall model quality measure (5) is a number of publications |Li | in the
final ordered publication set, restricted with (10), (11), (12) and (15).
    Figure 1 shows the general workflow of the controlled snowball implementa-
tion as UML activity diagram.




Fig. 1. General workflow of the controlled snowball implementation as UML activity
diagram.



   The general workflow was introduced in [10] and details of the restricted
snowball sampling and probabilistic topic model construction are discussed in [11].
4   Terminological saturation of the ordered publication
    set obtained with controlled snowball method

The Spearman’s rank correlation coefficient mentioned above allows simple de-
tection of the convergence of controlled snowball iterations[10], however it does
not address the completeness of the collected publication set.
    The main idea of the presented experiment is comparison of minimal ter-
minologically saturated ordered publication sets produced with different search
methods and in different scientific databases and answer the following questions:

1. Do all common search methods produce the terminologically saturated or-
   dered publication sets?
2. Which of the common search methods produce the smaller terminologically
   saturated ordered publication set?
3. Is the suggested controlled snowball method more effective than selection by
   topic?

    In our experiments we concentrated on the existence of the terminological
saturation and the size of the minimal terminologically saturated ordered doc-
ument set that is defined as minimal value of MaxRank when (12) becomes
true.
    Starting from the uncertain information need of seminal scientific publica-
tions on the topic “Ontologies (computer science)”, four collections were consid-
ered:

1. abstracts of the seminal publications selected from the ONTO-KL citation
   network that was gathered from the “Microsoft Academic Search” database
   using the controlled snowball method described above, starting from seed
   publications on the topic “Ontologies (computer science)”;
2. abstracts of the publications indexed by the “Microsoft Academic Search”
   service having an automatically assigned category “ontologies” and arranged
   in descending order of citation index;
3. abstracts of the publications stored in the “ACM digital library” electronic
   library, having the “ontologies” label assigned by the authors and lined up
   in descending order of citation index;
4. abstracts of the publications found on Google Scholar Search by keyword
   “ontologies” and ranked by descending relevance calculated with Google’s
   internal algorithms.

The 2nd, 3rd and 4th collections represent the common and wide spread search
approaches that do not provide the formal criteria of stopping the search and
thus can produce huge sets of publications. To enable comparison we have ex-
tended them with automatic term extraction and with method of terminological
saturation detection.
   For each of the found publications we searched for full text in PDF format.
The PDF files were downloaded from different sources: “ACM digital library”
provides full publication texts to registered users; “Microsoft Academic Search”
and “Google Scholar” often provide links to full-text PDF publications that
can be automatically found and saved 1 . Also the PDF files were searched in
SemanticScholar and ResearchGate databases. Publications for which the full
text was not found were excluded from consideration and the text of the next
publication was searched.
    To study terminological saturation of ordered document set D we follow the
work of Kosa et al. [19]. First, the finite sequence of texts Di , (i = 1, 2, 4, ..., 11)
is composed where each text Di contains the concantenated full texts of the
first 20 · i documents of D. Then all Di are processed with the automatic term
extraction method. The corresponding sets of terms Ti were compared with thd,
defined by (14). The saturation criterion used is thd(Ti , Ti+1 )/ < 1 where i ≥
MaxRank. Thus we can calculate minimal MaxRank for any of used collections
of publications.
    The obtained values of minimal MaxRank shown in the Table 2 are the
quality measure for the proposed model. The Figure 2 shows the dependence of
thd(Ti , Ti+1 )/ from a number of publications incuded in Di .
    We can see that terminological saturation is observed for the collection gath-
ered with the controlled snowball and selected from “Microsoft Academic”. Pub-
lications gathered from “ACM digital library” do not provide saturation and set
of publications taken from “Google Scholar” may exhibit saturation when ex-
tended.
    The Table 2 shows that used in the paper controlled snowball method leads
to smaller terminologically saturated publications set than studied analogues.


Table 2. The size of minimal terminologically saturated ordered sets of publications,
belonging to topic “ontologies”, obtained with different methods.

                Source                Search method MaxRank
                “Microsoft Academic” “Snowball”       160
                “Microsoft Academic” automatic label  180
                “ACM digital library” author label   > 200
                “Google Scholar”      kew word       ≥ 220




5     Conclusions

The presented study introduces the formal criterion of stopping the search and
search result completeness to overcome the common issue of scientific informa-
tion retrieval when information need has low certainty. The suggested formal
criterion is based on automatic term extraction and terminological saturation
detection.
1
    Software library Puppeteer for NodeJS, https://github.com/GoogleChrome/puppeteer
              a. Controlled Snowball                 b. ACM Digital Library

                                   thd/                                     thd/
        6                                       6
thd/




        4                                       4

        2                                       2


                50    100   150    200                 50    100   150       200

                 c. Google Scholar                    d. Microsoft Academic

                                   thd/                                thd/
        6                                       6                     threshold
thd/




        4                                       4

        2                                       2


               50 100 150 200                         50 100 150 200
              Number of publications                 Number of publications


            Fig. 2. Terminological saturation of publications collections.
    In our experiment we have extended with terminological saturation detection
the following search approaches: controlled snowball method; search by automat-
ically assigned topic; keyword search; search by author keywords.
    The objectives of the experiment were the existence of the terminological sat-
uration and the size of the minimal terminologically saturated ordered document
set for each of the search approaches.
    Starting from the uncertain information need of seminal scientific publica-
tions on the topic “Ontologies (computer science)”, four collections were consid-
ered:
 1. publications gathered from the “Microsoft Academic Search” database using
    the controlled snowball method suggested by authors;
 2. publications indexed by the “Microsoft Academic Search” service having
    an automatically assigned category “ontologies” and arranged in descending
    order of citation index;
 3. publications stored in the “ACM digital library” electronic library, having
    the “ontologies” label assigned by the authors and lined up in descending
    order of citation index;
 4. publications found on Google Scholar Search by keyword “ontologies” and
    ranked by descending relevance calculated with Google’s internal algorithms.
    The experiment have shown that terminological saturation for a collected
ordered publication set, created from “Microsoft Academic” with the controlled
snowball method, is achieved for 160 publications – 9% faster than for the ordered
publication set created from “Microsoft Academic” with category “ontology”
automatically set (180 publications). Sets consisting of 200 publications with
the keyword “ontology” from “Google Scholar” and with label “ontology” from
“ACM digital library”, do not possess terminological saturation.
    So we can conclude that both the controlled snowball method and topic
search in “Microsoft Academic” produce the small terminologically saturated
publication sets of almost equal size. However, this conclusion must be supported
with search on other topics. Also, in the future studies, the term-based precision
and recall should be calculated that, in turn, requires the creation of the dataset
of terms evaluated by experts.

References
 1. Ahad, A., Fayaz, M., Shah, A.S.: Navigation through citation network based on
    content similarity using cosine similarity algorithm. Int. J. Database Theory Appl
    9(5), 9–20 (2016). https://doi.org/10.14257/ijdta.2016.9.5.02
 2. Athukorala, K., Hoggan, E., Lehtiö, A., Ruotsalo, T., Jacucci, G.: Information-
    seeking behaviors of computer scientists: Challenges for electronic literature search
    tools. In: Proceedings of the 76th ASIS&T Annual Meeting: Beyond the Cloud: Re-
    thinking Information Boundaries. p. 20. American Society for Information Science
    (2013). https://doi.org/10.1002/meet.14505001041
 3. Baez, M., Mirylenka, D., Parra, C.: Understanding and supporting search for schol-
    arly knowledge. Proceeding of the 7th European Computer Science Summit pp. 1–8
    (2011)
 4. Barabási, A.L.: Scale-free networks: a decade and beyond. Science 325(5939), 412–
    413 (2009). https://doi.org/10.1126/science.1173299
 5. Batagelj, V.: Efficient algorithms for citation network analysis. arXiv preprint
    cs/0309023 (2003)
 6. Beel, J., Gipp, B., Langer, S., Breitinger, C.: Paper recommender systems: a lit-
    erature survey. International Journal on Digital Libraries 17(4), 305–338 (2016).
    https://doi.org/10.1007/s00799-015-0156-0
 7. Choi, S.S., Cha, S.H., Tappert, C.C.: A survey of binary similarity and distance
    measures. Journal of Systemics, Cybernetics and Informatics 8(1), 43–48 (2010)
 8. Colicchia, C., Strozzi, F.: Supply chain risk management: a new methodology for a
    systematic literature review. Supply Chain Management: An International Journal
    17(4), 403–418 (2012)
 9. Dobrovolskyi, H., Keberle, N.: Collecting the seminal scientific abstracts with topic
    modelling, snowball sampling and citation analysis. In: Proceedings of the 14th In-
    ternational Conference on ICT in Education, Research and Industrial Applications.
    Integration, Harmonization and Knowledge Transfer. vol. 1, pp. 179–192. Springer
    (2018)
10. Dobrovolskyi, H., Keberle, N.: On convergence of controlled snowball sampling
    for scientific abstracts collection. In: International Conference on Information and
    Communication Technologies in Education, Research, and Industrial Applications.
    vol. 1007, pp. 18–42. Springer (2018). https://doi.org/10.1007/978-3-030-13929-2 2
11. Dobrovolskyi, H., Keberle, N., Todoriko, O.: Probabilistic topic modelling for con-
    trolled snowball sampling in citation network collection. In: International Con-
    ference on Knowledge Engineering and the Semantic Web. pp. 85–100. Springer
    (2017). https://doi.org/10.1007/978-3-319-69548-8 7
12. Dong, R., Tokarchuk, L., Ma, A.: Digging friendship: paper recommendation in
    social network. In: Proceedings of Networking & Electronic Commerce Research
    Conference (NAEC 2009). pp. 21–28 (2009)
13. Ermolayev, V., Batsakis, S., Keberle, N., Tatarintseva, O., Antoniou, G.: Ontolo-
    gies of time: Review and trends. International Journal of Computer Science &
    Applications 11(3) (2014)
14. Fisch, C., Block, J.: Six tips for your (systematic) literature review in business and
    management research. Management Review Quarterly 68(2), 103–106 (Apr 2018).
    https://doi.org/10.1007/s11301-018-0142-x, https://doi.org/10.1007/s11301-018-
    0142-x
15. Frantzi, K.T., Ananiadou, S.: The c-value/nc-value domain-independent method
    for multi-word term extraction. Journal of Natural Language Processing 6(3), 145–
    179 (1999). https://doi.org/10.5715/jnlp.6.3 145
16. Friday, D., Ryan, S., Sridharan, R., Collins, D.: Collaborative risk management:
    a systematic literature review. International Journal of Physical Distribution &
    Logistics Management 48(3), 231–253 (2018). https://doi.org/10.1108/IJPDLM-
    01-2017-0035
17. Garfield, E.: From computational linguistics to algorithmic historiography. In: Sym-
    posium in Honor of Casimir Borkowski at the University of Pittsburgh School of
    Information Sciences (2001)
18. Harris, J.K., Beatty, K.E., Lecy, J.D., Cyr, J.M., Shapiro, R.M.: Map-
    ping the multidisciplinary field of public health services and systems re-
    search. American journal of preventive medicine 41(1), 105–111 (2011).
    https://doi.org/10.1016/j.amepre.2011.03.015
19. Kosa, V., Chaves-Fraga, D., Dobrovolskyi, H., Ermolayev, V.: Optimized term
    extraction method based on computing merged partial c-values. In: Ermolayev,
    V., Mallet, F., Yakovyna, V., Mayr, H., Spivakovsky, A. (eds.) Information and
    Communication Technologies in Education, Research, and Industrial Applications.
    ICTERI 2019. Communications in Computer and Information Science, vol. 1175,
    pp. 24–49. Springer Berlin Heidelberg (2020). https://doi.org/10.1007/978-3-030-
    39459-2 2, https://link.springer.com/chapter/10.1007/978-3-030-39459-2 2
20. Lao, N., Cohen, W.W.: Relational retrieval using a combination of
    path-constrained random walks. Machine learning 81(1), 53–67 (2010).
    https://doi.org/10.1007/s10994-010-5205-8
21. Lecy, J.D., Beatty, K.E.: Representative literature reviews using constrained snow-
    ball sampling and citation network analysis. Available at SSRN 1992601 (2012).
    https://doi.org/10.2139/ssrn.1992601
22. Liu, J.S., Lu, L.Y., Lu, W.M., Lin, B.J.: Data envelopment analysis
    1978–2010: A citation-based literature survey. Omega 41(1), 3–15 (2013).
    https://doi.org/10.1016/j.omega.2010.12.006
23. Lucio-Arias, D., Leydesdorff, L.: Main-path analysis and path-dependent transi-
    tions in histcite-based historiograms. Journal of the American Society for Informa-
    tion Science and Technology 59(12), 1948–1962 (2008)
24. Meho, L.I.: The rise and rise of citation analysis. Physics World 20(1), 32 (2007).
    https://doi.org/10.1088/2058-7058/20/1/33
25. Nedumov, Y., Kuznetsov, S.: Exploratory search for scientific arti-
    cles. Programming and Computer Software 45(7), 405–416 (2019).
    https://doi.org/10.15514/ISPRAS-2018-30(6)-10
26. Nicolini, A.L., Lorenzetti, C.M., Maguitman, A.G., Chesñevar, C.I.: In-
    telligent algorithms for improving communication patterns in thematic
    p2p search. Information Processing & Management 53(2), 388–404 (2017).
    https://doi.org/10.1016/j.ipm.2016.12.001
27. Palopoli, L., Rosaci, D., Sarné, G.M.: A multi-tiered recommender system archi-
    tecture for supporting e-commerce. In: Intelligent Distributed Computing VI, pp.
    71–81. Springer (2013). https://doi.org/10.1007/978-3-642-32524-3 10
28. Petticrew, M., Gilbody, S.: Planning and conducting systematic reviews. Health
    psychology in practice pp. 150–179 (2004)
29. Ricci, F., Rokach, L., Shapira, B.: Introduction to recommender systems
    handbook. In: Recommender systems handbook, pp. 1–35. Springer (2011).
    https://doi.org/10.1007/978-0-387-85820-3 1
30. Rodriguez-Prieto, O., Araujo, L., Martinez-Romo, J.: Discovering related scientific
    literature beyond semantic similarity: a new co-citation approach. Scientometrics
    120(1), 105–127 (2019). https://doi.org/10.1007/s11192-019-03125-9
31. Schütze, H., Manning, C.D., Raghavan, P.: Introduction to information retrieval,
    vol. 39. Cambridge University Press (2008)
32. Shi, Y., Larson, M., Hanjalic, A.: Collaborative filtering beyond the user-item
    matrix: A survey of the state of the art and future challenges. ACM Computing
    Surveys (CSUR) 47(1), 3 (2014). https://doi.org/10.1145/2556270
33. de Solla Price, D.J.: Networks of scientific papers. Science 149(3683), 510–515
    (1965)
34. Varela, A.R., Pratt, M., Harris, J., Lecy, J., Salvo, D., Brownson, R.C., Hallal,
    P.C.: Mapping the historical development of physical activity and health research:
    A structured literature review and citation network analysis. Preventive medicine
    111, 466–472 (2018). https://doi.org/10.1016/j.ypmed.2017.10.020
35. Vellino, A.: Usage-based vs. citation-based methods for recommending scholarly
    research articles. arXiv preprint arXiv:1303.7149 (2013)
36. Vorontsov, K., Potapenko, A.: Tutorial on probabilistic topic modeling: Additive
    regularization for stochastic matrix factorization. In: International Conference on
    Analysis of Images, Social Networks and Texts x000D . pp. 29–46. Springer (2014).
    https://doi.org/10.1007/978-3-319-12580-0 3
37. Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In:
    Proceedings of the 22nd international conference on World Wide Web. pp. 1445–
    1456. ACM (2013). https://doi.org/10.1145/2488388.2488514
38. Zarrinkalam, F., Kahani, M.: Semcir: A citation recommendation system
    based on a novel semantic distance measure. Program 47(1), 92–112 (2013).
    https://doi.org/10.1108/00330331311296320
39. Zhou, M., Zhao, S.: Learning question paraphrases from log data (Feb 14 2008),
    uS Patent App. 11/500,224
40. Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution
    for short and imbalanced texts. Knowledge and Information Systems 48(2), 379–
    398 (2016). https://doi.org/10.1007/s10115-015-0882-z