=Paper= {{Paper |id=Vol-2126/paper5 |storemode=property |title=A Hybrid Approach for Dynamic Topic Models with Fluctuating Number of Topics |pdfUrl=https://ceur-ws.org/Vol-2126/paper5.pdf |volume=Vol-2126 |authors=Christin Katharina Kreutz |dblpUrl=https://dblp.org/rec/conf/gvd/Kreutz18 }} ==A Hybrid Approach for Dynamic Topic Models with Fluctuating Number of Topics== https://ceur-ws.org/Vol-2126/paper5.pdf
          A Hybrid Approach for Dynamic Topic Models with
                    Fluctuating Number of Topics

                                                   Christin Katharina Kreutz
                                                            Trier University
                                                            54286 Trier, DE
                                                      kreutzch@uni-trier.de


ABSTRACT
Scientific communities are always changing and evolving. To-
pics of today might split or even disappear in the future,
other topics might merge or appear at some time. Nowa-
days, the closest we come to picture these developments are
dynamic topic models which come with a fixed number of
topics k. It would be desirable to omit k. This work out-
lines a research agenda for approaching that task by using
LDA as a base in combination with the observation of state
transitions in topics at consecutive times.

Categories and Subject Descriptors
H.1 [Models and Principles]: Document Topic Models;
I.5 [Pattern Recognition]: Trend Mining

General Terms
Algorithms

Keywords
Trend Mining, Dynamic Topic Models, LDA
                                                                           Figure 1: Simplified visualisation of our research plan.
1.   INTRODUCTION
  With today’s publication methods, the number of papers
                                                                      and keywords should be mapped in a dynamic topic model
increases rapidly. Losing track of the evolution of the ma-
                                                                      with variable number of topics. Second, potential upcoming
jority of themes is common. Simultaneously, identifying im-
                                                                      trends in the topics across the years should automatically be
portant publications is difficult but cardinal for scientists.
                                                                      detected, predicted and extracted from this model, so they
  Automatic detection of trends and their indicators in a
                                                                      can be evaluated. And third, influential authors, papers and
scientific community (trend mining) could benefit resear-
                                                                      venues should be determined in these found trends. The re-
chers, politicians or entrepreneurs who are not ahead of
                                                                      sulting new insights about what supports the development
current developments but want to get quick insights into
                                                                      of a topic can be used to enhance the identification of trends.
promising areas.
                                                                         The steps are relatively independent of another, step two
  Our goal is to construct a system, which autonomously
                                                                      would be applicable on another suitable topic model without
identifies trends and accompanying influential persons and
                                                                      requiring a solution of step one. Figure 1 gives a schematic
papers from a variety of bibliographic data. The appurtenant
                                                                      overview of our projected line of action.
research plan is partitioned into three succeeding sections:
                                                                         In this work, we focus on outlining a research direction
First, the transformation of topics generated from a biblio-
                                                                      for the first step, present current state of research on rela-
graphic data set over time, their assigned papers, authors
                                                                      ted models and mark the problems at hand. We touch on
                                                                      trend mining, before we close with an evaluation plan and
                                                                      an outlook on possible application for our future model.


                                                                      2.     DEVELOPMENT OF TOPICS
                                                                         We assume the importance and set of topics is not sta-
                                                                      tic over time. Topics might sprout, expand, diminish, split,
30th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 22.05.2018 - 25.05.2018, Wuppertal, Germany.                 merge or vanish. Terms that represent the topics change as
Copyright is held by the author/owner(s).                             new words appear [5]. To better understand the dynamics
of topics, we wanted to observe real bibliographical data.

2.1    Notation
   Before diving into details of our experiments or the pro-
posed model, some basic terms need to be set in order to
formally discuss our concepts.
   A paper has a number of fundamental, possibly latent,
ideas. They can be grouped by motive to more general topics
denoted by si . By observing co-occurring topics and terms in
papers, conclusions about the assignment of terms to topics
can be drawn. Topics can be term-wise alike or (partially)
overlap with other topics. Assertions on this can be derived
from the term distributions for topics.
   The total time observed t can be sliced in disjunct conse-
cutive intervals which are called times t0 , . . . , tn . Given two   Figure 2: Simplified depiction of the composition of the ex-
times tx and ty , if x < y, tx indicates an interval (and real        tended dblp data set. Data is partial.
period) before ty . Given two times tx and tx+1 , tx describes
the interval immediately before tx+1 .
   Publications can be uniquely attached to intervals if the          cluded. The extension contains author affiliations, citation
time is sliced by year and their year of publication determi-         data, abstracts, full texts, keywords and topics. The struc-
nes the assignment. Exact publication dates are mostly not            ture of the data set is depicted in Figure 2. Because we only
available. This classification is an approximate observation          focus on bibliographic information, further data sources like
raster as in theory there is a time continuum and in reality          Twitter are not incorporated in our set.
we only have rough year specifications. States of topics are            For the experiments in this paper, only the data contained
regarded at times.                                                    in dblp as well as abstracts were taken into consideration.
   A topic si is said to be trending at time tx+y , y ≥ 1, if it is   At the moment, full texts are only available for a certain
unpopular or not even existing at time tx , but its significan-       small area in computer science so the usage of them could
ce soars. This could be indicated by an increasing number             have distorted the outcome of our initial trials drastically.
of publications targeting this subject or its appearance in
important journals or conferences. Essential members of the           2.3   Methodology
scientific community might start to work in this direction               Of the enriched dblp data, only English publications who-
or the subjects builds its own experts which become widely            se abstract was of considerable length (≥ 10 words, fewer
known.                                                                words indicate flawed data) were taken into account. The
   A topic that has not (yet) assigned any publications is de-        titles and abstracts were purged and stemmed with a Porter
scribed by s∅ . This case occurs before a topic is born or if it      stemmer. Afterwards, LDA [4] with k = 100 was run on all
is inactive. A topic is inactive, if the number of publications       2.5 million of them. We ignore terms occurring in over 50
assigned to the topic does not surpass a threshold or papers          percent of publications (collection dependent stop words) or
assigned with this topic do only cite papers from the same            in under 100 papers as they are often system names.
topic and are only cited by papers from this area. The to-               A visualisation of the data enabled us to draw conclusions
pic has hardly any influence on the rest of the corpus. The           about the characteristics of topics.
community which works on this is very tightly connected
but relatively isolated from the rest of the scientific world.        2.4   Initial Observations
These enclaves can be described as sects.                               In Figure 3, the popularity of a topic in relation to all
   Opposing inactive topics are active topics. The set of ac-         topics in the corpus per year is visualised for the years 1990
tive topics at a time tx can be identified by kx . The set of         to 2015 for four selected topics. We assume the number of
inactive topics at a time tx can be described by kx .                 topics is appropriate. Different settings can be observed:

2.2    Data Set                                                          • There are subjects, which are inactive and whose popu-
                                                                           larity rises, so they become active like topic 12, which
   The data set used in this research is an incompletely en-
                                                                           is about mobile devices.
riched form of the dblp computer science bibliography data
with part of the data from open academic graph. The dblp                 • There are subjects, which were always active and who-
data contains bibliographic information related to publica-                se popularity increases as seen in topic 13, which covers
tions, authors, conferences and journals from the field of                 terms like management, knowledge and business.
computer science and adjacent areas [15]. As of February
2018, it holds metadata of over 4 million publications and               • There are subjects, whose popularity declines such as
more than 2 million authors. The Microsoft Academic Graph                  seen with topic 27, which includes papers concerning
within open academic graph is used. It contains over 166 mil-              logic programming and reasoning.
lion publications and amongst others citation information,               • There are subjects, whose popularity does not really
abstracts and details on authors [22, 21].                                 seem to change over the course of years such as topic
   In our set, data from dblp was used completely. In addi-                76, which deals with image processing.
tion, where publications could be matched based on DOI or
title and author matches where DOI information was not                   In our data set, we found the case of a topic being ac-
available, information from open academic graph was in-               tive at a point in time but unrepresented by publications
                                                                                        Topic   10 most important stems
                                                                                                mobil, devic, network, commun,
                                                                                         12     peer, music, ad, hoc, messag, wire-
                                                                                                less
                                                                                                manag, knowledg, studi, inform, re-
                                                                                         13     search, technolog, organ, busi, fac-
                                                                                                tor, effect
                                                                                                program, logic, fuzzi, oper, reason,
                                                                                         27
                                                                                                gener, comput, base, languag, execut
                                                                                                imag, color, reconstruct, map, me-
                                                                                         76     thod, algorithm, base, render, reso-
                                                                                                lut, pixel
(a) Overview of popularity of selected topics, topic distributions of papers are
sliced by year. Size of bubble indicates relative importance of topic in all papers   (b) Topic number with corresponding assigned most
from this year.                                                                       important stems.

Figure 3: Exemplary illustration of the development of selected topics over time and their associated stems by running LDA
with k = 100 on the whole extended dblp data set.


for a few following years. Later, it re-emerged. The topic’s             by dividing a corpus by year so the topic distribution can
top keywords contained cloud, so early publications with a               change over time. Topics in slice tx+1 are derived from the
portion of this topic might have a background in weather,                topics in slice tx . Words assigned to a subject are variable
whereas the late publications which were (partly) assigned               but k is still fixed. Information relating to authors is not
to the topic probably pick up on cloud computing.                        used but papers are no longer interchangeable. [3]
  The importance and number of active topics is highly va-
rying throughout the years.                                              3.2    Problem Description
                                                                            The described methods cannot fully map the dynamics in
3.    PROBLEM                                                            a corpus, as the number of topics k is unchangeable. If data
   Topics can be generated from a corpus by several proba-               up until a point in time tx is used to generate a DTM, at
bilistic topic models. The most popular ones all have the                time tx+1 new publications can only be assigned to these
significant weakness of an unchangeable number of topics.                already existing k topics. If DTM would be run with new
Before we dive into the problem, we present some existing                publications and k + n topics, the resulting topics would
methods.                                                                 not necessarily represent the former k and additional n new
                                                                         ones even closely. Changing k slightly results in a different
3.1    Topic Models                                                      document topic distribution.
   The assignment of topics to papers can be performed by                   An easy way to capture the dynamics of topics would be
a number of approaches. The simplest one would be Latent                 to find a suitable k, perform LDA on the whole corpus, slice
Dirichlet Allocation LDA. Here, it is assumed that every                 the corpus by year and look at topics changing over time like
document is a mixture of topics and every word in the do-                we did in our experiment. Trends could be found retrospec-
cuments comes from a specific drawn topic. There are no                  tively. If new data is integrated, LDA could be used another
words that are partially assigned to no or even a residue                time on all the publications. Again, trends could be located
topic. Hidden random variables contain information on the                in retrospect. Big disadvantages are the determination of k
structure of topics in the documents. First, topic proportions           and the inability to map the topics of the first run to the
for a document are drawn. After this step, for every posi-               topics of the subsequent runs, especially if k is incremented.
tion of a word in the document, a topic is drawn from this               Terms which get mapped to subjects shift and it is impossi-
distribution. In the last part, actual words are drawn from              ble to regain old patterns. It would be unfeasible to measure
the topic word distribution. LDA and constitutive models                 if the identification of future trends was successful.
assume that documents are interchangeable in time. The                      Emergence, disappearance, splitting and merging of topics
number of topics k is fixed for a corpus and has to be chosen            over the course of time cannot be modelled with existing pro-
beforehand. The vocabulary of the corpus is also fixed. [4]              babilistic topic models. Changes in subjects are indicators
   A lot of approaches build upon LDA, such as the Author-               for trends and should thereby be observed.
Topic Model ATM. Here, an additional dimension, the aut-                    There are other approaches to find trends which make use
hors, is taken into account. The individual author codeter-              of a number of other features: Asooja et al. utilise keyword
mines the topic from which a word is drawn. [18]                         distributions on textural information [1], Glänzel et al. work
   The correlation of topics was presented with Correlating              on citations and textual information [9], Salatino et al. ob-
Topic Models CTM. Here, LDA was modified so instead of                   serve a topic network deployed from connections between
drawing topic distributions for documents from a dirichlet               keywords, publications, authors, venues and organisations
distribution, they were now taken from a logistic normal                 [19].
distribution. [2]                                                           Current methods usually only use a small portion on the
   The temporal aspect of a collection and the development               spectrum of available data. A model which incorporates au-
of topics has been widely disregarded until the introduction             thors, affiliations as well as scientometric measures [20, 13,
of Dynamic Topic Models DTM. This method extends CTM                     10], publication information such as citations [17] and ve-
                                                                     were emerging from it. A topic describing machine learning
            si                si                                     might be a good example of case c). Many areas treating
     a)
                                                                     algorithms are collapsing into this big one, as machine lear-
                                                                     ning has the potential to outperform even the most refined
                             si0                                     hand-knitted approaches. If a topic describes RSA, it could
                                                                     fall into category d), as it is no longer considered save, the-
                                                                     refore publications concerning this subject are most likely
     b)     si               ...                                     going to decrease over the next years until the topic is in-
                                                                     active. This is a good candidate for the forming of a sect.
                                                                     The development of a topic for quantum computers could be
                             si00
                                                                     mapped to case e). It somewhat was the birth of this topic in
                                                                     computer science. There certainly were influences from diffe-
            si                                                       rent communities on the subject but in a corpus restricted to
                                                                     information technology, the representation might be fitting.
                                                                     As neural networks are currently experiencing a renaissance,
     c)     ...              sij                                     they are an example of f).

                                                                     4.2    Hybrid Topic Model
            sj
                                                                        Our future model needs to be able to find and represent all
                                                                     described transitions of topics. In the following, we explain
     d)     si                s∅                                     the core components of a hybrid model.
                                                                        The rough plan would be to split t in years and use LDA
                                                                     to generate a baseline of topics for t0 . For every new year,
     e)     s∅                si                                     the topics of the prior year need to be considered when cal-
                                                                     culating the current developments. Citations are a key part
                                                                     in this as they indicate how information is being spread.
     f)     si                s∅                             si      At time tx+1 , we examine kx as well as kx and observe co-
                                                                     authorships, used words and how new publications cite al-
 Time       tx              tx+1            tx+...          tx+y     ready classified papers. By looking at the topic distributions
                                                                     and summing the percentages for each topic, it can be cal-
Figure 4: Possible state transitions of topics si over time t.       culated, which topics are cited with corresponding weights
                                                                     by a new paper. With for example the Wasserstein metric
                                                                     [8], the distance between term distributions of topics disttd
nues in addition to titles, abstracts, full texts, keywords and      is calculated as their difference. A threshold thtd describes
topics has the potential to detect trends reliably.                  the distance value over which topic term distributions are
                                                                     considered dissimilar.
4.        HYBRID APPROACH                                               For every topic, the following strategies decide which state
                                                                     transition has occurred from tx to tx+1 :
  Our theoretic approach is based on the assumption that
there are different topic state transitions. They need to be           a) With the first case, there is no major change in un-
represented by our model.                                                 derlying motives from tx to tx+1 . Publications in this
                                                                          topic reference about the same topics that were cited
4.1       Evolution of Topics over Time                                   at tx and thtd > disttd . The content in cited publica-
   We identified possible state transitions with which the evo-           tions is typically pretty similar to the content of the
lution of topics can be described, they are shown in Figure               new ones.
4. There are six distinguishable forms: Case a) shows a topic
which does not significantly change, b) shows the split of a           b) In this situation, we have the same phenomena as in ca-
topic si into possibly numerous topics si0 , . . . , si00 that are        se a) but a clustering on publications of this topic pro-
somewhat coherent or the emergence of a topic si00 from an                duces multiple distinguishable groups which are regar-
already existing (and persisting) topic si , c) shows the mer-            ded as new topics split from the old one, thtd < disttd
ging of possibly numerous disconnected topics si , . . . , sj into        amongst the new topics. New words are likely to occur
one, d) shows a vanishing topic, e) shows the birth of a new              in the publications. If they solely appear in the papers
topic and f ) shows a combination of cases d) and e) with                 from this area and not throughout the whole corpus,
the anomaly of the topic si being inactive and re-emerging                they strongly hint at a change or split in the topic.
over a span of time being the same. The different transitions          c) If a merging of topics occurs, the witnessed effects
can be joined ad libitum.                                                 will resemble those of case a), although publications
   An example for a) could be the image topic we alrea-                   which would be ordered to prior topics harmonise their
dy encountered in Figure 3. The distribution of words in                  term distributions and citation behaviour. A clustering
the topic surely changes over time, because the fundamen-                 would group the topics together.
tal terms vary, though the overall motive in them stays the
same. As instance of case b), algorithms concerning depth              d) A dying topic gets none or few new publications as-
first search could be the base, from which other algorithms,              signed to. The number of papers in this topic might
such as ones for the computation of strongly connected com-               already be declining for a few years. A topic getting
ponents, derived. The original topic persisted while new ones             inactive all of a sudden is highly unlikely.
     e) If a new topic emerges, publications do not really match     researchers from different domains within computer science.
        term distributions of existing ones. They usually cite a     A list which contains our results is presented to them. They
        lot of different topics as they have no clear predecessor.   should rate it against the real trends with corresponding
        The overlap of content from cited papers (not topics)        years.
        by a new publication and the citing paper should be             Additionally, the trends, important researchers and ve-
        calculated, as it is deemed to be rather small.              nues identified by our system will be presented to those ex-
                                                                     perts. They then should rank the correctness of the findings.
     f) With the sudden re-emergence of a topic, the term               An automatic method to quantify the accuracy of the mo-
        distribution of publications match a topic in kx .           del would involve the observation of data up until a time tx .
                                                                     Potential trends at this time will be detected, their evolu-
  After the topic distributions for the new publications are         tion and future importance is going to be predicted for the
computed, the then active and inactive topics are assigned           succeeding five years and the predictions will be compared
to kx+1 and kx+1 respectively. A run concludes with the              to the real development of significance of these topics. Num-
processing of the next year of papers in the same manner.            bers of papers from topics and citation behaviour could be
                                                                     prognosticated. If there are discrepancies in predicted and
4.3      Topic Development Prediction and Trend                      real data, a manual step could be put in, to question experts
         Mining                                                      to explain the actual development.
   Predicting the development of a topic is directly linked to          The hybrid approach also needs to be tested against the
trend mining. Topics which are about to blow up are future           purely incremental model which does not use LDA with a
trends. The upcoming number of publications in a field, the          predetermined k as first step.
estimation of citations a new paper is going to gain [17]
and possible collaborations between researchers can only be          5.2    Applications
computed if the underlying author-publication-graph of the
past is thoroughly analysed and influences on its evolution             Possible applications of the dynamic topic model with
are discovered.                                                      varying number of topics complete with the identification
   The computation of trends in currently active topics is           of trends are manifold. A reviewer recommendation system
a step which follows directly from the hybrid topic model.           for given publications, a citation recommendation system, a
Topics which changed a lot from tx to tx+1 are candidates            keynote speaker recommendation system or a visualisation
for trends. Not only the development of topics from the last         tool for exploring bibliographic data with special focus on
to the current time frame is going to be observed, the over-         trends could be constructed.
all behaviour of the term distributions and cited topics are            Some reviewer recommendation systems work on word to-
relevant. The appearance of new and popular words in the             pic and topic citation distributions [11] or are only usable
assigned terms of a topic could signal the beginning of a            for already established conferences as they use former pro-
trend and is worth further investigation.                            gram committees [23]. Others are more refined and want to
   Often, popular papers are written by well-known and high-         integrate the research interest and direction of scientists into
ly linked authors, they appear in journals with a lot of im-         the recommendations [16, 12]. Our model is independent of
pact or are presented at seminal conferences. Here, the en-          past conferences. It could make use of the enriched author-
riched data is going to be used. A co-author-graph with re-          publication-graph to find scientists capable and willing to
searchers’ affiliations linked to a paper-citation-graph com-        review new publications from the field of their current rese-
plete with venues and relationships between journals and             arch interest. As the available data for this task is extensive,
conferences could help discover core persons [7], venues and         the results could be excellent.
publications in topics and trends. Sometimes, trends also               Citation recommendation systems suggest fitting publica-
develop from sects, so they have to be steadily looked at.           tions based on their content, but they do not focus on retur-
Topics which were active in tx+1 are judged on whether they          ning fundamental papers which lead the way of a topic or
are likely going to be trending in the future. The evolution         those written by influential authors for an area [11]. The re-
can be predicted based on the progress of the topic and the          lative importance of a paper for an area and its development
found influences.                                                    is not considered. With our hybrid model, the identification
                                                                     of influential papers and persons is a by-product and could
                                                                     be easily incorporated in such a system.
5.     FUTURE PROSPECTS                                                 Keynote speakers for a conference from topic si should
  After completing the construction of our hybrid approach,          be influential scientists from a different topic sj , which is
an evaluation of the proposed system needs to prove and              related to si . A linkage of the topics could be predicted,
quantify its validity. Furthermore, several practical uses for       the term distributions of the topics harmonise or one topic
the model are presented.                                             adapts words from the other area. The findings in one to-
                                                                     pic could highly benefit the other. Our model contains this
5.1      Evaluation Plan                                             information so it could be used for this application.
   The evaluation of our planned system, which includes the             A visualisation tool for the exploration of found topics,
trend mining part, contains multiple steps. The results need         relationships and trends in the data would be beneficial for
to be cross-validated.                                               researchers, politicians and entrepreneurs [5]. Past work on
   Our hybrid model is going to be run on a base of data             the exploration of topics or trends in bibliographic data so-
up until 1995, then topic developments are computed by               metimes lacks the support for growing and big data sets [14]
the iterative part with data for the next 10 years. For the          or base on a topic model with fixed number of topics [6]. A
following 5 years, trends are predicted. Afterwards, a manual        tool using our model and data would inherently dodge these
evaluation of our model and the found trends involves expert         weaknesses.
6.   CONCLUSION                                                 [12] J. Jin, Q. Geng, Q. Zhao, and L. Zhang. Integrating
  This work proposed a hybrid approach which aims at mo-             the trend of research interest for reviewer assignment.
delling the agile evolution of topics and trends in a growing        In Proceedings of the 26th International Conference on
corpus of bibliographic data without a fixed and predefined          World Wide Web Companion, Perth, Australia, April
number of topics with help of an LDA base. Different state           3-7, 2017, pages 1233–1241, 2017.
transitions were used to describe the development of topics     [13] P. Knoth and D. Herrmannova. Towards
over time in detail. A link to trend mining was drawn. The           semantometrics: A new semantic similarity based
work concludes with the presentation of an evaluation con-           measure for assessing a research publication’s
cept to confirm the utility of the approach and numerous             contribution. D-Lib Magazine, 20(11/12), 2014.
examples of use to underline the potential of our future mo-    [14] B. Lee, G. Smith, G. G. Robertson, M. Czerwinski,
del.                                                                 and D. S. Tan. Facetlens: exposing trends and
                                                                     relationships to support sensemaking within faceted
Acknowledgements                                                     datasets. In Proceedings of the 27th International
                                                                     Conference on Human Factors in Computing Systems,
Special thanks goes to my supervisor Ralf Schenkel for his           CHI 2009, Boston, MA, USA, April 4-9, 2009, pages
invaluable support.                                                  1293–1302, 2009.
                                                                [15] M. Ley. DBLP - some lessons learned. PVLDB,
7.   REFERENCES                                                      2(2):1493–1500, 2009.
 [1] K. Asooja, G. Bordea, G. Vulcu, and P. Buitelaar.          [16] X. Liu, T. Suel, and N. D. Memon. A robust model for
     Forecasting emerging trends from scientific literature.         paper reviewer assignment. In Eighth ACM
     In Proceedings of the Tenth International Conference            Conference on Recommender Systems, RecSys ’14,
     on Language Resources and Evaluation LREC 2016,                 Foster City, Silicon Valley, CA, USA - October 06 -
     Portorož, Slovenia, May 23-28, 2016., 2016.                    10, 2014, pages 25–32, 2014.
 [2] D. M. Blei and J. D. Lafferty. Correlated topic            [17] A. Livne, E. Adar, J. Teevan, and S. Dumais.
     models. In Advances in Neural Information Processing            Predicting citation counts using text and graph
     Systems 18 [Neural Information Processing Systems,              mining. February 2013.
     NIPS 2005, December 5-8, 2005, Vancouver, British          [18] M. Rosen-Zvi, T. L. Griffiths, M. Steyvers, and
     Columbia, Canada], pages 147–154, 2005.                         P. Smyth. The author-topic model for authors and
 [3] D. M. Blei and J. D. Lafferty. Dynamic topic models.            documents. In UAI ’04, Proceedings of the 20th
     In Machine Learning, Proceedings of the Twenty-Third            Conference in Uncertainty in Artificial Intelligence,
     International Conference (ICML 2006), Pittsburgh,               Banff, Canada, July 7-11, 2004, pages 487–494, 2004.
     Pennsylvania, USA, June 25-29, 2006, pages 113–120,        [19] A. A. Salatino and E. Motta. Detection of embryonic
     2006.                                                           research topics by analysing semantic topic networks.
 [4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent                  In A. González-Beltrán, F. Osborne, and S. Peroni,
     dirichlet allocation. Journal of Machine Learning               editors, Semantics, Analytics, Visualization.
     Research, 3:993–1022, 2003.                                     Enhancing Scholarly Data, pages 131–146, Cham,
 [5] J. Boyd-Graber, Y. Hu, and D. Mimno. Applications               2016. Springer International Publishing.
     of topic models. 11:143–296, 01 2017.                      [20] S. Siebert, S. Dinesh, and S. Feyer. Extending a
 [6] A. J. Chaney and D. M. Blei. Visualizing topic                  research-paper recommendation system with
     models. In Proceedings of the Sixth International               bibliometric measures. In Proceedings of the Fifth
     Conference on Weblogs and Social Media, Dublin,                 Workshop on Bibliometric-enhanced Information
     Ireland, June 4-7, 2012, 2012.                                  Retrieval (BIR) co-located with the 39th European
 [7] A. Fiallos OrdoÃśez, K. Jimenes, C. Vaca, and                 Conference on Information Retrieval (ECIR 2017),
     X. Ochoa. Scientific communities detection and                  Aberdeen, UK, April 9th, 2017., pages 112–121, 2017.
     analysis in the bibliographic database: Scopus, 04         [21] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. P.
     2017.                                                           Hsu, and K. Wang. An overview of microsoft academic
 [8] A. L. Gibbs and F. E. Su. On choosing and bounding              service (mas) and applications. In Proceedings of the
     probability metrics. INTERNAT. STATIST. REV.,                   24th International Conference on World Wide Web,
     pages 419–435, 2002.                                            WWW ’15 Companion, pages 243–246, New York,
 [9] W. Glänzel and B. Thijs. Using ’core documents’ for            NY, USA, 2015. ACM.
     detecting and labelling new emerging topics.               [22] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su.
     Scientometrics, 91(2):399–416, 2012.                            Arnetminer: Extraction and mining of academic social
[10] D. Herrmannova and P. Knoth. Semantometrics:                    networks. In Proceedings of the 14th ACM SIGKDD
     Towards fulltext-based research evaluation. In                  International Conference on Knowledge Discovery and
     Proceedings of the 16th ACM/IEEE-CS on Joint                    Data Mining, KDD ’08, pages 990–998, New York,
     Conference on Digital Libraries, JCDL 2016, Newark,             NY, USA, 2008. ACM.
     NJ, USA, June 19 - 23, 2016, pages 235–236, 2016.          [23] H. D. Tran, G. Cabanac, and G. Hubert. Expert
[11] W. Huang, Z. Wu, P. Mitra, and C. L. Giles. Refseer:            suggestion for conference program committees. In 11th
     A citation recommendation system. In IEEE/ACM                   International Conference on Research Challenges in
     Joint Conference on Digital Libraries, JCDL 2014,               Information Science, RCIS 2017, Brighton, United
     London, United Kingdom, September 8-12, 2014,                   Kingdom, May 10-12, 2017, pages 221–232, 2017.
     pages 371–374, 2014.