=Paper= {{Paper |id=Vol-2105/10000179 |storemode=property |title=Collecting the Seminal Scientific Abstracts with Topic Modelling, Snowball Sampling and Citation Analysis |pdfUrl=https://ceur-ws.org/Vol-2105/10000179.pdf |volume=Vol-2105 |authors=Hennadii Dobrovolskyi,Nataliya Keberle |dblpUrl=https://dblp.org/rec/conf/icteri/DobrovolskyiK18 }} ==Collecting the Seminal Scientific Abstracts with Topic Modelling, Snowball Sampling and Citation Analysis== https://ceur-ws.org/Vol-2105/10000179.pdf
Collecting the Seminal Scientific Abstracts with
   Topic Modelling, Snowball Sampling and
               Citation Analysis

                    Hennadii Dobrovolskyi, Nataliya Keberle

       Department of Computer Science, Zaporizhzhya National University,
              Zhukovskogo st. 66, 69600, Zaporizhzhya, Ukraine,
                  gen.dobr@gmail.com, nkeberle@gmail.com



      Abstract. This paper presents a complete information technology for
      collecting and analysis of a citation network of scientific publications
      aimed at detecting of seminal papers in a selected domain of research.
      The technology consists of the seed paper selection, plain snowball sam-
      pling, probabilistic topic modeling, greedy restricted snowball sampling,
      and analysis of the collected citation network.
      The topic model is built on the base of word-word co-occurrence prob-
      ability with combination of sparse symmetric nonnegative matrix fac-
      torization and principal component approximation. Experiments with
      the collection of High Energy Physics abstracts show that the number
      of topics in the model is determined in natural way and the Kullback-
      Leibler divergence correlates with cosine similarity calculated from key-
      words provided by publication authors.
      The citation networks on “critical thinking” and “automatic pronun-
      ciation assessment” domains are collected and analyzed. The analysis
      shows that both networks are “small worlds” and therefore the observed
      saturation of the restricted snowball sampling can provide the complete
      set of publications in domains of interest. Multiple runs of the sampling
      confirm the hypothesis that the set of seminal publications is stable with
      respect to variations of the seed papers. The modified main path anal-
      ysis allows to distinguish the seminal papers including new publications
      following main stream of research.

      Keywords: text mining, short text document, topic modelling, princi-
      pal component analysis, sparse symmetric nonnegative matrix factoriza-
      tion, citation network, main path analysis.


1   Introduction

The quality of a related work review is a problem well known to each scien-
tist. The questions to be answered are “If the collected scientific publications
contain all notable scientific results of the domain of interest?” and “Which of
the collected publications direct the mainstream of the knowledge evolution in
the domain of interest?” Below we present a complete information technology for
getting answers to both questions. The essence of the technology is the collecting
and analysis of a citation network of scientific publications aimed at detecting
of seminal papers in a selected domain of research.
    To our best knowledge, while the separate parts of the method are developed,
the entire procedure that takes the small set of papers on some scientific domain
and produces perfect list of references is not known.
    The objectives of the presented work are:

 – to present the complete technology that takes a manually selected seed pa-
   pers and produces the short list of interconnected scientific publications that
   reflects evolution of main ideas of the selected scientific domain. The technol-
   ogy contains both restricted snowball sampling method and citation network
   analysis.
 – to test all initial assumptions, namely
     • if the proposed snowball restriction method [6] provides adequate seman-
        tic distance between publications;
     • if the obtained restricted snowball forms the scale-free network [22];
     • if the restricted snowball provides saturation of the publication dataset;
     • if the best age of the seed papers is 5–10 years;
     • if the biased seed papers can produce unbiased citation network.

    The distinctive features of the presented method are application of the prob-
abilistic topic model to perform restricted snowball sampling and collect citation
network, then the main path analysis is applied to point both the most influen-
tial publications and the main path of scientific knowledge evolution. The main
path allows detecting the newest publications that follow the mainstream and
the outliers that potentially contain the completely new ideas.
    The structure of the paper is following. Section 2 overviews the publications
related to the presented technology, Section 3 contains description of the crucial
steps of the algorithm, Section 4 states the experiment pre-conditions and Section
5 discusses the results. Conclusion summarises the main results and discusses
future work.


2     Related Work

Publications on a domain of knowledge can be collected from conference pro-
ceedings [7], the study of the co-authorship [15, 16], elaboration of keywords
and topics [14, 17], querying academic search engines1 with a set of keywords
or snowball sampling [10, 1, 6]. However, not for every research domain there is
a corresponding conference, as well as one author can write papers on different
topics. Building maps and ontologies of large scientific domains does not provide
the list of references rather the set of interconnected concepts. Querying with
1
    Google Scholar, https://scholar.google.com,
    Semantic Scholar, https://www.semanticscholar.org/,
    Microsoft Academic, https://academic.microsoft.com/
a set of keywords produces the biased set of publications [19] because different
researchers use slightly different terms to report their results.
    The most appropriate way to collect scientific papers is snowball sampling
[10, 1], when each publication from the current queue is considered then all ref-
erenced papers and all papers referencing to the publication are added to the
next level queue. The snowball sampling allows collecting publications on the
narrow research topic and connect them in the citation network [22]. The high
quality of citation-based search algorithm is provided with phenomena of “small
world” which is a proved property of scale-free networks [2, 22]. Newman [15]
has shown that in the most of the cases it is enough to do three iterations. How-
ever, the statistical properties of global citation network including all scientific
papers is not known, that is why we need to test if the small world assumption is
true for the collected subset of citation network and if the three iterations allows
collecting most of the papers.
   Another point of snowball sampling is dependence on the initial queue called
a seed collection. The general advice [10] recommends that the seed papers
should be the seminal papers of the knowledge domain pointed by experts or
the papers selected by the researcher. Valid seed papers should be 5–10 years
old and have to be widely cited. The best seeds are the reviews, foundational
or framing articles on the topic of interest. However, the advice also should
be checked. Moreover, we need to test if the biased seed papers can produce
unbiased citation network.
    The snowball sampling cannot be applied directly to publication crawling be-
cause the list of references can contain the items that are not directly related to
the investigated domain. Therefore the straightforward implementation of snow-
ball publication sampling causes infinite collection inflation and some restrictions
should be introduced to accept or reject the candidate publication. It should be
noted that the introduced restrictions can violate the small world property and
we need to check if the restricted snowball result is scale-free citation network.
   To filter out the most relevant papers while sampling Ahad et al. [1] in their
approach use vector document model and cosine similarity, however the doc-
ument vector model relies on word spelling rather than meaning that causes
precision loss when the short texts are considered. Lecy et al. [10] apply PageR-
ank calculated by Google Scholar as a measure of paper significance. However,
PageRank is a property of a global citation network including all topics of knowl-
edge, so it cannot be calculated from its small subset. One of the most promising
approaches is the probabilistic topic model (PTM).
    Probabilistic topic models [24] use a large collection of documents and statis-
tical approach to model words and documents as vectors in a high-dimensional
semantic space Rn , where n is much less than number of words and number of
documents. The base idea of PTM is to construct few topis which are groups
of tightly connected words. Then document words are represented as a result
of two-stage random sampling. The most known method of topic modelling is
Latent Dirichlet Allocations (LDA) [5] which is successful and simple enough.
A general introduction and survey of the topic modelling can be found in [24]
along with a novel approach, called Additive Regularization of Topic Models.
     However, in most of the scientific databases, full texts are often protected
by copyright. Therefore the only information we can use are paper title, paper
abstract, and sometimes the database-specific keywords and topics. So the doc-
uments that we analyse are short and common PTMs based on document-word
statistics lose their precision. This shortcoming is overpassed with approaches
utilizing word co-occurrence statistics in Biterm Topic Model (BTM) [25] and
Word Network Topic Model (WNTM) [26] instead of counting document-word
pairs. Another method of word embedding, called GloVe, is proposed in [18]. It is
based on word-word co-occurrence matrix and uses global matrix factorization,
so it is close to BTM [25] and WNTM [26] statistical topic modelling.
     Also, the vague part of common PTMs is that number of topics cannot be
determined with document analysis. To overcome this weakness, handling the
word-word co-occurences with principal component analysis (PCA) and sparse
symmetric nonnegative matrix factorization (Sparse SNMF) was proposed [6].
     The collected citation network [22] can be analyzed using citation count and
other simple statistics [12], PageRank [11], information retrieval techniques [1],
knowledge graph [17], combined supervised machine learning approaches [23] or
Main Path analysis [12].
     The most appropriate way to highlight the seminal papers of the small scien-
tific domain is main path analysis because the method deals only with the col-
lected citation network and allows to increase the precision of sampled dataset.
On the contrary, the citation count and other statistics, supervised machine
learning applied by Valenzuela, Ha and Etzioni [23] and PageRank cannot point
out the tightly interconnected subset of the citation network. Klink-2 [17] and
similar algorithms aim to build the map of knowledge domain but do not seek
the most influenced publications.


3   Information Technology Overview
The general workflow of the restricted snowball sampling is introduced in [6]. It
contains the following steps:
1. Collect a set of seed papers and put them in the initial, 0-th, queue.
2. Run several iterations of the unrestricted snowball sampling to pickup base-
   line documents. For n ∈ 0, 1, 2, 3
     1 get a portion of papers from the n-th queue;
     2 download the papers referenced by the portion;
     3 download the papers referencing the portion;
     4 add all the downloaded papers to the (n + 1)-th queue.
3. Create the PTM using baseline documents:
     1 extract title and abstract from each document of the collection;
     2 split all the titles and abstracts into sentences;
     3 create the dictionary containing all the nouns and adjectives that occur
       in the sentences;
     4 combine all terms from the reduced dictionary occurring in the same
        sentence into pairs and build the joint probability matrix;
     5 detect the collection specific stop-words and exclude them from the re-
        duced dictionary;
     6 perform Sparse SNMF to create PTM;
     7 map each of the seed papers to a vector of topic probabilities.
 4. Perform the batch restricted snowball sampling: for n ∈ 0, 1, 2, 3
     1 get a portion of papers from the n-th queue;
     2 download the papers referenced by the portion;
     3 download the papers referencing the portion;
     4 extract bag of stemmed words from each of downloaded papers;
     5 map each of the downloaded papers to a vector of topic probabilities;
     6 calculate distance from each downloaded paper to the seed papers;
     7 add to the next level queue only those of downloaded papers which are
        close to the seed papers.
 5. Analyse the citation network.

   The details of the restricted snowball sampling and probabilistic topic model
construction are discussed in [6].


4     Citation Network Analysis

4.1    Cycles Elimination

The correctly built citation network must be an acyclic directed graph. However,
the publication database errors accidentally can cause cycles. The problem with
cycles is that if there is a cycle in a network then there is also an infinite number of
paths between some vertices. Since a citation network is usually almost acyclic
to transform it into an acyclic network we use the “preprint” transformation
described by Batagelj [3]. First, we identify cycles and then each paper from a
cycle is duplicated with its “preprint” version and the papers inside cycle cite
“preprints”.


4.2    Simple Citation Path Count

Our approach is similar to Search Path Count (SPC) algorithm [12, 3]. We intro-
duce two pseudo-vertices – source and target. A vertex that does not reference
any other publication vertex, gets an edge to the target vertex. A vertex that is
not referenced by any publication vertex, gets an edge from the source vertex,
so the graph becomes connected. Next step is to calculate all simple paths from
the source to the target using Python library NetworkX2 . The algorithm uses a
modified depth-first search to generate the paths [9]. As the result we obtain a
set of paths, each of which is a sequence of vertices. Each pair of direct neighbour
vertices in such a sequence is an edge in the citation graph. For each edge, its
2
    NetworkX, https://networkx.github.io
frequency is calculated against all paths – a number of paths through it, simple
path count. Next we calculate edge resistance as inverse proportional to edge
simple path count – this allows diminishing the difference among the most cited
and least cited papers [21, 8]. The path resistance is then calculated as the sum
of its edge resistances. Finally, we set an order over the paths using the path
resistances. Using path resistances is a distinguishing feature of the proposed
algorithm.
    The difference of the applied algorithm from SPC algorithm is the preserva-
tion of the citation graph connectivity. In the known algorithm [4], as soon as
the edge SPC scores are calculated the edges having low scores are removed and
the citation network can become a disconnected graph.

4.3   Chasing New Ideas in Publications
Path resistance allows detecting new publications in the field, not referenced yet
by any other authors but existing in a mainstream of the domain. We can sep-
arate all papers into mainstream research and probably new research fields or
directions. The smallest (up to a certain threshold) path resistances correspond
to the mainstream, whereas the biggest path resistances correspond to the pub-
lications that are either brand new, bad written or published in a low impact
journal/conference proceedings. We assume those publications are the source of
potentially new ideas and topics.


5     Analysis of Experimental Results
5.1   Experimental Settings
We took three different corpora: “high energy physics”(HEP), “critical think-
ing”(CT) and “pronunciation quality assessment”(PQA).
    HEP publications [20] are available from the European Laboratory for Nu-
clear Research. The hep-ex partition of the HEP collection is composed of 2802
abstracts related to experimental high-energy physics that are indexed with 1093
main keywords (the categories), the hep-astroph partition contains 2716 ab-
stracts from astrophysics section and 18114 abstracts on theoretical physics in
hep-th metadata. Each publication is manually annotated with keywords.
    CT corpus is gathered with our snowball sampling software. The CT domain
is characterized with a large noisy publications corpus tightly entangled with
publications on psychology, didactics, pedagogy and phylosophy. The size of CT
corpus is 24040 publication abstracts.
    PQA domain is very specific and narrow, with a moderate-size corpus con-
taining 8339 scientific abstracts collected by our snowball sampling software.
    The sampling was run with following parameters: percentage of stop words to
exclude – 2%; percentage of rare words to exclude – 5%; number of components
in PCA which is maximal number of topics – 200; threshold KL-divergence –
0.18; sparsity parameter – 0.05; number of top citation paths – 50; minimal
number of citations – 3.
5.2   Seminal Publications for PQA domain

Figure 1 shows the results of citation network analysis for PQA domain.




Fig. 1. Top path in PQA citation network (a) and top 73 paths of the citation network
(b). Nodes are marked as (first author:year:MS Academic Id)



   On the part (a) of Figure 1 we can see that the mainstream of pronunciation
assessment contains the publications:

 1. “Automatic evaluation and training in English pronunciation” by Bernstein,
    Cohen, Murveit, Rtischev, and Weintraub, 1990
 2. “Automatic text-independent pronunciation scoring of foreign language stu-
    dent speech” by Neumeyer, Franco, Weintraub, and Price, 1996
 3. “Automatic pronunciation scoring for language instruction” by Franco, Neumeyer,
    Kim, and Ronen, 1997
 4. “Automatic pronunciation scoring of specific phone segments for language
    instruction” by Kim, Franco, and Neumeyer, 1997
 5. “Automatic scoring of pronunciation quality” by Neumeyer, Franco, Di-
    galakis, and Weintraub, 2000
 6. “Effects of speech recognition-based pronunciation feedback on second-language
    pronunciation ability” by Precoda, Halverson, and Franco, 2000
 7. “The pedagogy-technology interface in computer assisted pronunciation train-
    ing” by Neri, Cucchiarini, Strik, and Boves, 2002
 8. “Segmental errors in Dutch as a second language: how to establish priorities
    for CAPT” by Neri, Cucchiarini, and Strik, 2004
 9. “Automatic pronunciation error detection in Dutch as a second language: an
    acoustic-phonetic approach” by Truong, 2004
10. “Automatic detection of frequent pronunciation errors made by L2-learners”
    by Truong, Neri, Wet, Cucchiarini, and Strik, 2005
11. “Comparing classifiers for pronunciation error detection” by Strik, Truong,
    Wet, and Cucchiarini, 2007
12. “An overview of spoken language technology for education” by Eskenazi,
    2009
13. “Automatic detection of vowel pronunciation errors using multiple informa-
    tion sources” by Van Doremalen, Cucchiarini, and Strik, 2009
14. “A new neural network based logistic regression classifier for improving mis-
    pronunciation detection of L2 language learners” by Hu, Qian, and Soong,
    2014
15. “Improved mispronunciation detection with deep neural network trained
    acoustic models and transfer learning based logistic regression classifiers”
    by Hu, Qian, Soong, and Wang, 2015
16. “Improving non-native mispronunciation detection and enriching diagnos-
    tic feedback with DNN-based speech attribute modeling” by Li, Siniscalchi,
    Chen, and Lee, 2016
17. “Computer-assisted pronunciation training: From pronunciation scoring to-
    wards spoken language learning” by Chen and Li, 2016

    The mainstream evolution of pronunciation assessment starts from applica-
tion of automatic speech recognition, pays some attention to pedagogical aspects,
goes to simple machine learning approaches, then to neural networks and to deep
learning. Some of the seminal publications are the reviews containing discussions
of the feature selection, methods comparison and combination. The part (b) of
Figure 1 shows that the more top paths we keep, the more detailed knowledge
map we obtain.


5.3   Assumptions Checking

PTM as a restriction criteria for the proposed restricted snowball
sampling method provides adequate semantic distance between pub-
lications. To check the statement we took HEP collection, built PTM for it,
and measure similarity using keywords annotating each publication from HEP
collection. For each pair of HEP publications were calculated both symmetric
KL-divergence and cosine similarity. The results are shown in Figure 2.
    As we can see, the PTM-based symmetric Kullback-Leibler divergence [13]
provides reliable upper bound for the keyword based cosine similarity. The reason
is that the cosine similarity uses only word spelling and PTM uses Rn word
embedding taking into account the meaning of terms.
Fig. 2. Symmetric PTM-based KL-divergence and cosine similarity for the HEP ab-
stracts: (a) scatter plot; (b) violin plot.



The restricted snowball sampling forms the scale-free network. Derek
de Solla Price showed in 1965 [22] that the number of references to a paper
(node degree) in a citation network had a heavy-tailed distribution following the
power law and thus that the citation network is scale-free. One of our initial
assumptions was that the restricted snowball sampling results in a scale-free
networks so we need a few number of snowball iterations to achieve the high
recall. Figure 3 was calculated on the base of PQA corpus and shows that for
the small node degrees the logarithm of node number has linear dependency on
the logarithm of node degree and for the large node degrees the dependency
has heavy tail. That means, the restricted snowball sampling produces scale-
free citation network as well as classical snowball. So we can be sure that a
few iterations of the restricted snowball sampling allow collecting most of the
relevant publications.



Saturation of the restricted snowball sampling. The restricted snowball
sampling can be modelled as Poisson process when the publications appear se-
quentially and we can either (a) accept n-th publication and add it to the snow-
ball or (b) don’t accept. So we can calculate the confidence interval of Poisson
distribution of event (a) and compare its upper bound with some pre-defined
acceptance probability. Figure 4 shows 0.95 confidence interval of Poisson distri-
bution of paper acceptance as a colored strip and acceptance probability thresh-
old 0.05 as a straight line. The confidence interval was calculated on the base of
10 snowball runs for CT collection starting from random subsets of seed paper
collection. After some number of tested abstracts the upper bound of confidence
interval becomes lower than the threshold so the restricted snowball sampling
guarantees the saturation.
          Fig. 3. Citation network node degree distribution in log-log scale.




Fig. 4. 0.95 confidence interval of Poisson distribution of paper acceptance as a function
of the number of already tested abstracts N and acceptance probability threshold 0.05.
The Citations Age To study the influence of a publication age on the prob-
ability of the publication citation we have attributed each edge of the PQA
citation network with age calculated as difference between years of referencing
publication and referenced one. The number of the edges as a function of edge
age is shown on Figure 5. We can see that the maximal number of the references
is observed for the publications that are 2–8 years old. Such publications are
still regarded as new ones but at the same time are old enough to be read and
estimated by many reserchers.




               Fig. 5. Frequencies of the citation network edge ages.




The biased seed papers can produce unbiased citation network. To
estimate the stability of the restricted snowball sampling with respect to the
seed papers variation we run the sampling starting from the full set of the PQA
seed papers and mark the relevant papers with main path analysis. Then we run
the sampling again starting from 10 random subsets of the PQA seed papers
and count the number of the runs where each seminal paper occurs. In our
experiments the random subsets contain 50% of the seed papers and 66% of
relevant papers are detected every time, 14% – in 80% of runs, 14% – in 60% of
runs, 6% – at least once. So we can conclude that the PQA citation network is
stable with respect to large seed paper variations being input for the restricted
snowball sampling and the result of the sampling is unbiased.
Fig. 6. Probability of the relevant paper detection: A – 66% detected every time; B –
14% detected in 80% of runs; C – 14% detected in 60% of runs; D – 6% detected at
least once.


6   Conclusions and Future Studies
The main objective of the paper was to present the complete information tech-
nology that obtains a set of publications on some scientific topic as input and
produces a list of seminal publications for that topic. It provides data for future
detailed analysis and serves as a good point to begin investigation in a new do-
main. Additionally, we tested several initial assumptions regarding the results of
the technology application and show that:
 – PTM as a restriction criteria for the restricted snowball sampling method
   provides adequate semantic distance between publications.
 – The restricted snowball sampling guarantees the saturation.
 – The maximal number of the references is observed for the publications that
   are 2–8 years old. Such publications are still regarded as new ones but at the
   same time are old enough to be read and estimated by many reserchers.
 – The biased seed papers produce unbiased citation network.
 – The collected citation network is stable with respect to large seed paper
   variations being input for the restricted snowball sampling and the result of
   the sampling is unbiased.

The presented technology is implemented as sequence of Python scripts3 .


References
 1. Ahad, A., Fayaz, M., Shah, A.S.: Navigation through citation network based
    on content similarity using cosine similarity algorithm. International Journal of
    Database Theory and Application 9(5), 9–20 (2016)
 2. Barabási, A.L.: Scale-free networks: a decade and beyond. Science 325(5939), 412–
    413 (2009)
 3. Batagelj, V.: Efficient algorithms for citation network analysis. arXiv preprint
    cs/0309023 (2003)
 4. Batagelj, V., Mrvar, A.: Pajek-program for large network analysis. Connections
    21(2), 47–57 (1998)
 5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine
    Learning research 3(Jan), 993–1022 (2003)
 6. Dobrovolskyi, H., Keberle, N., Todoriko, O.: Probabilistic topic modelling for con-
    trolled snowball sampling in citation network collection. In: International Con-
    ference on Knowledge Engineering and the Semantic Web. pp. 85–100. Springer
    (2017)
 7. Ermolayev, V., Batsakis, S., Keberle, N., Tatarintseva, O., Antoniou, G.: Ontolo-
    gies of time: Review and trends. International Journal of Computer Science &
    Applications 11(3) (2014)
 8. Even, S.: Graph algorithms. Cambridge University Press (2011)
 9. Golumbic, M.C.: Algorithmic graph theory and perfect graphs, vol. 57. Elsevier
    (2004)
10. Lecy, J.D., Beatty, K.E.: Representative literature reviews using constrained snow-
    ball sampling and citation network analysis (2012)
11. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of massive datasets. Cambridge
    university press (2014)
12. Lucio-Arias, D., Leydesdorff, L.: Main-path analysis and path-dependent transi-
    tions in histcite-based historiograms. Journal of the Association for Information
    Science and Technology 59(12), 1948–1962 (2008)
13. MacKay, D.J.: Information theory, inference and learning algorithms. Cambridge
    university press (2003)
14. Moya-Anegón, F., Vargas-Quesada, B., Herrero-Solana, V., Chinchilla-Rodrı́guez,
    Z., Corera-Álvarez, E., Munoz-Fernández, F.: A new technique for building maps
    of large scientific domains based on the cocitation of classes and categories. Scien-
    tometrics 61(1), 129–145 (2004)
15. Newman, M.E.: The structure of scientific collaboration networks. Proceedings of
    the National Academy of Sciences 98(2), 404–409 (2001)
16. Newman, M.E.: Coauthorship networks and patterns of scientific collaboration.
    Proceedings of the national academy of sciences 101(suppl 1), 5200–5205 (2004)
3
    https://github.com/gendobr/snowball
17. Osborne, F., Motta, E.: Klink-2: integrating multiple web sources to generate se-
    mantic topic networks. In: International Semantic Web Conference. pp. 408–424.
    Springer (2015)
18. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre-
    sentation. In: EMNLP. vol. 14, pp. 1532–1543 (2014)
19. Petticrew, M., Gilbody, S.: Planning and conducting systematic reviews. Health
    psychology in practice pp. 150–179 (2004)
20. Ráez, A.M., López, L.A.U., Steinberger, R.: Adaptive selection of base classifiers in
    one-against-all learning for large multi-labeled collections. In: Advances in Natural
    Language Processing, pp. 1–12. Springer (2004)
21. Salganik, M.J., Heckathorn, D.D.: Sampling and estimation in hidden populations
    using respondent-driven sampling. Sociological methodology 34(1), 193–240 (2004)
22. de Solla Price, D.J.: Networks of scientific papers. Science 149(3683), 510–515
    (1965)
23. Valenzuela, M., Ha, V., Etzioni, O.: Identifying meaningful citations. In: AAAI
    Workshop: Scholarly Big Data (2015)
24. Vorontsov, K., Potapenko, A.: Tutorial on probabilistic topic modeling: Additive
    regularization for stochastic matrix factorization. In: International Conference on
    Analysis of Images, Social Networks and Texts x000D . pp. 29–46. Springer (2014)
25. Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In:
    Proceedings of the 22nd international conference on World Wide Web. pp. 1445–
    1456. ACM (2013)
26. Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution
    for short and imbalanced texts. Knowledge and Information Systems 48(2), 379–
    398 (2016)