A Hybrid Approach for Dynamic Topic Models with Fluctuating Number of Topics Christin Katharina Kreutz Trier University 54286 Trier, DE kreutzch@uni-trier.de ABSTRACT Scientific communities are always changing and evolving. To- pics of today might split or even disappear in the future, other topics might merge or appear at some time. Nowa- days, the closest we come to picture these developments are dynamic topic models which come with a fixed number of topics k. It would be desirable to omit k. This work out- lines a research agenda for approaching that task by using LDA as a base in combination with the observation of state transitions in topics at consecutive times. Categories and Subject Descriptors H.1 [Models and Principles]: Document Topic Models; I.5 [Pattern Recognition]: Trend Mining General Terms Algorithms Keywords Trend Mining, Dynamic Topic Models, LDA Figure 1: Simplified visualisation of our research plan. 1. INTRODUCTION With today’s publication methods, the number of papers and keywords should be mapped in a dynamic topic model increases rapidly. Losing track of the evolution of the ma- with variable number of topics. Second, potential upcoming jority of themes is common. Simultaneously, identifying im- trends in the topics across the years should automatically be portant publications is difficult but cardinal for scientists. detected, predicted and extracted from this model, so they Automatic detection of trends and their indicators in a can be evaluated. And third, influential authors, papers and scientific community (trend mining) could benefit resear- venues should be determined in these found trends. The re- chers, politicians or entrepreneurs who are not ahead of sulting new insights about what supports the development current developments but want to get quick insights into of a topic can be used to enhance the identification of trends. promising areas. The steps are relatively independent of another, step two Our goal is to construct a system, which autonomously would be applicable on another suitable topic model without identifies trends and accompanying influential persons and requiring a solution of step one. Figure 1 gives a schematic papers from a variety of bibliographic data. The appurtenant overview of our projected line of action. research plan is partitioned into three succeeding sections: In this work, we focus on outlining a research direction First, the transformation of topics generated from a biblio- for the first step, present current state of research on rela- graphic data set over time, their assigned papers, authors ted models and mark the problems at hand. We touch on trend mining, before we close with an evaluation plan and an outlook on possible application for our future model. 2. DEVELOPMENT OF TOPICS We assume the importance and set of topics is not sta- tic over time. Topics might sprout, expand, diminish, split, 30th GI-Workshop on Foundations of Databases (Grundlagen von Daten- banken), 22.05.2018 - 25.05.2018, Wuppertal, Germany. merge or vanish. Terms that represent the topics change as Copyright is held by the author/owner(s). new words appear [5]. To better understand the dynamics of topics, we wanted to observe real bibliographical data. 2.1 Notation Before diving into details of our experiments or the pro- posed model, some basic terms need to be set in order to formally discuss our concepts. A paper has a number of fundamental, possibly latent, ideas. They can be grouped by motive to more general topics denoted by si . By observing co-occurring topics and terms in papers, conclusions about the assignment of terms to topics can be drawn. Topics can be term-wise alike or (partially) overlap with other topics. Assertions on this can be derived from the term distributions for topics. The total time observed t can be sliced in disjunct conse- cutive intervals which are called times t0 , . . . , tn . Given two Figure 2: Simplified depiction of the composition of the ex- times tx and ty , if x < y, tx indicates an interval (and real tended dblp data set. Data is partial. period) before ty . Given two times tx and tx+1 , tx describes the interval immediately before tx+1 . Publications can be uniquely attached to intervals if the cluded. The extension contains author affiliations, citation time is sliced by year and their year of publication determi- data, abstracts, full texts, keywords and topics. The struc- nes the assignment. Exact publication dates are mostly not ture of the data set is depicted in Figure 2. Because we only available. This classification is an approximate observation focus on bibliographic information, further data sources like raster as in theory there is a time continuum and in reality Twitter are not incorporated in our set. we only have rough year specifications. States of topics are For the experiments in this paper, only the data contained regarded at times. in dblp as well as abstracts were taken into consideration. A topic si is said to be trending at time tx+y , y ≥ 1, if it is At the moment, full texts are only available for a certain unpopular or not even existing at time tx , but its significan- small area in computer science so the usage of them could ce soars. This could be indicated by an increasing number have distorted the outcome of our initial trials drastically. of publications targeting this subject or its appearance in important journals or conferences. Essential members of the 2.3 Methodology scientific community might start to work in this direction Of the enriched dblp data, only English publications who- or the subjects builds its own experts which become widely se abstract was of considerable length (≥ 10 words, fewer known. words indicate flawed data) were taken into account. The A topic that has not (yet) assigned any publications is de- titles and abstracts were purged and stemmed with a Porter scribed by s∅ . This case occurs before a topic is born or if it stemmer. Afterwards, LDA [4] with k = 100 was run on all is inactive. A topic is inactive, if the number of publications 2.5 million of them. We ignore terms occurring in over 50 assigned to the topic does not surpass a threshold or papers percent of publications (collection dependent stop words) or assigned with this topic do only cite papers from the same in under 100 papers as they are often system names. topic and are only cited by papers from this area. The to- A visualisation of the data enabled us to draw conclusions pic has hardly any influence on the rest of the corpus. The about the characteristics of topics. community which works on this is very tightly connected but relatively isolated from the rest of the scientific world. 2.4 Initial Observations These enclaves can be described as sects. In Figure 3, the popularity of a topic in relation to all Opposing inactive topics are active topics. The set of ac- topics in the corpus per year is visualised for the years 1990 tive topics at a time tx can be identified by kx . The set of to 2015 for four selected topics. We assume the number of inactive topics at a time tx can be described by kx . topics is appropriate. Different settings can be observed: 2.2 Data Set • There are subjects, which are inactive and whose popu- larity rises, so they become active like topic 12, which The data set used in this research is an incompletely en- is about mobile devices. riched form of the dblp computer science bibliography data with part of the data from open academic graph. The dblp • There are subjects, which were always active and who- data contains bibliographic information related to publica- se popularity increases as seen in topic 13, which covers tions, authors, conferences and journals from the field of terms like management, knowledge and business. computer science and adjacent areas [15]. As of February 2018, it holds metadata of over 4 million publications and • There are subjects, whose popularity declines such as more than 2 million authors. The Microsoft Academic Graph seen with topic 27, which includes papers concerning within open academic graph is used. It contains over 166 mil- logic programming and reasoning. lion publications and amongst others citation information, • There are subjects, whose popularity does not really abstracts and details on authors [22, 21]. seem to change over the course of years such as topic In our set, data from dblp was used completely. In addi- 76, which deals with image processing. tion, where publications could be matched based on DOI or title and author matches where DOI information was not In our data set, we found the case of a topic being ac- available, information from open academic graph was in- tive at a point in time but unrepresented by publications Topic 10 most important stems mobil, devic, network, commun, 12 peer, music, ad, hoc, messag, wire- less manag, knowledg, studi, inform, re- 13 search, technolog, organ, busi, fac- tor, effect program, logic, fuzzi, oper, reason, 27 gener, comput, base, languag, execut imag, color, reconstruct, map, me- 76 thod, algorithm, base, render, reso- lut, pixel (a) Overview of popularity of selected topics, topic distributions of papers are sliced by year. Size of bubble indicates relative importance of topic in all papers (b) Topic number with corresponding assigned most from this year. important stems. Figure 3: Exemplary illustration of the development of selected topics over time and their associated stems by running LDA with k = 100 on the whole extended dblp data set. for a few following years. Later, it re-emerged. The topic’s by dividing a corpus by year so the topic distribution can top keywords contained cloud, so early publications with a change over time. Topics in slice tx+1 are derived from the portion of this topic might have a background in weather, topics in slice tx . Words assigned to a subject are variable whereas the late publications which were (partly) assigned but k is still fixed. Information relating to authors is not to the topic probably pick up on cloud computing. used but papers are no longer interchangeable. [3] The importance and number of active topics is highly va- rying throughout the years. 3.2 Problem Description The described methods cannot fully map the dynamics in 3. PROBLEM a corpus, as the number of topics k is unchangeable. If data Topics can be generated from a corpus by several proba- up until a point in time tx is used to generate a DTM, at bilistic topic models. The most popular ones all have the time tx+1 new publications can only be assigned to these significant weakness of an unchangeable number of topics. already existing k topics. If DTM would be run with new Before we dive into the problem, we present some existing publications and k + n topics, the resulting topics would methods. not necessarily represent the former k and additional n new ones even closely. Changing k slightly results in a different 3.1 Topic Models document topic distribution. The assignment of topics to papers can be performed by An easy way to capture the dynamics of topics would be a number of approaches. The simplest one would be Latent to find a suitable k, perform LDA on the whole corpus, slice Dirichlet Allocation LDA. Here, it is assumed that every the corpus by year and look at topics changing over time like document is a mixture of topics and every word in the do- we did in our experiment. Trends could be found retrospec- cuments comes from a specific drawn topic. There are no tively. If new data is integrated, LDA could be used another words that are partially assigned to no or even a residue time on all the publications. Again, trends could be located topic. Hidden random variables contain information on the in retrospect. Big disadvantages are the determination of k structure of topics in the documents. First, topic proportions and the inability to map the topics of the first run to the for a document are drawn. After this step, for every posi- topics of the subsequent runs, especially if k is incremented. tion of a word in the document, a topic is drawn from this Terms which get mapped to subjects shift and it is impossi- distribution. In the last part, actual words are drawn from ble to regain old patterns. It would be unfeasible to measure the topic word distribution. LDA and constitutive models if the identification of future trends was successful. assume that documents are interchangeable in time. The Emergence, disappearance, splitting and merging of topics number of topics k is fixed for a corpus and has to be chosen over the course of time cannot be modelled with existing pro- beforehand. The vocabulary of the corpus is also fixed. [4] babilistic topic models. Changes in subjects are indicators A lot of approaches build upon LDA, such as the Author- for trends and should thereby be observed. Topic Model ATM. Here, an additional dimension, the aut- There are other approaches to find trends which make use hors, is taken into account. The individual author codeter- of a number of other features: Asooja et al. utilise keyword mines the topic from which a word is drawn. [18] distributions on textural information [1], Glänzel et al. work The correlation of topics was presented with Correlating on citations and textual information [9], Salatino et al. ob- Topic Models CTM. Here, LDA was modified so instead of serve a topic network deployed from connections between drawing topic distributions for documents from a dirichlet keywords, publications, authors, venues and organisations distribution, they were now taken from a logistic normal [19]. distribution. [2] Current methods usually only use a small portion on the The temporal aspect of a collection and the development spectrum of available data. A model which incorporates au- of topics has been widely disregarded until the introduction thors, affiliations as well as scientometric measures [20, 13, of Dynamic Topic Models DTM. This method extends CTM 10], publication information such as citations [17] and ve- were emerging from it. A topic describing machine learning si si might be a good example of case c). Many areas treating a) algorithms are collapsing into this big one, as machine lear- ning has the potential to outperform even the most refined si0 hand-knitted approaches. If a topic describes RSA, it could fall into category d), as it is no longer considered save, the- refore publications concerning this subject are most likely b) si ... going to decrease over the next years until the topic is in- active. This is a good candidate for the forming of a sect. The development of a topic for quantum computers could be si00 mapped to case e). It somewhat was the birth of this topic in computer science. There certainly were influences from diffe- si rent communities on the subject but in a corpus restricted to information technology, the representation might be fitting. As neural networks are currently experiencing a renaissance, c) ... sij they are an example of f). 4.2 Hybrid Topic Model sj Our future model needs to be able to find and represent all described transitions of topics. In the following, we explain d) si s∅ the core components of a hybrid model. The rough plan would be to split t in years and use LDA to generate a baseline of topics for t0 . For every new year, e) s∅ si the topics of the prior year need to be considered when cal- culating the current developments. Citations are a key part in this as they indicate how information is being spread. f) si s∅ si At time tx+1 , we examine kx as well as kx and observe co- authorships, used words and how new publications cite al- Time tx tx+1 tx+... tx+y ready classified papers. By looking at the topic distributions and summing the percentages for each topic, it can be cal- Figure 4: Possible state transitions of topics si over time t. culated, which topics are cited with corresponding weights by a new paper. With for example the Wasserstein metric [8], the distance between term distributions of topics disttd nues in addition to titles, abstracts, full texts, keywords and is calculated as their difference. A threshold thtd describes topics has the potential to detect trends reliably. the distance value over which topic term distributions are considered dissimilar. 4. HYBRID APPROACH For every topic, the following strategies decide which state transition has occurred from tx to tx+1 : Our theoretic approach is based on the assumption that there are different topic state transitions. They need to be a) With the first case, there is no major change in un- represented by our model. derlying motives from tx to tx+1 . Publications in this topic reference about the same topics that were cited 4.1 Evolution of Topics over Time at tx and thtd > disttd . The content in cited publica- We identified possible state transitions with which the evo- tions is typically pretty similar to the content of the lution of topics can be described, they are shown in Figure new ones. 4. There are six distinguishable forms: Case a) shows a topic which does not significantly change, b) shows the split of a b) In this situation, we have the same phenomena as in ca- topic si into possibly numerous topics si0 , . . . , si00 that are se a) but a clustering on publications of this topic pro- somewhat coherent or the emergence of a topic si00 from an duces multiple distinguishable groups which are regar- already existing (and persisting) topic si , c) shows the mer- ded as new topics split from the old one, thtd < disttd ging of possibly numerous disconnected topics si , . . . , sj into amongst the new topics. New words are likely to occur one, d) shows a vanishing topic, e) shows the birth of a new in the publications. If they solely appear in the papers topic and f ) shows a combination of cases d) and e) with from this area and not throughout the whole corpus, the anomaly of the topic si being inactive and re-emerging they strongly hint at a change or split in the topic. over a span of time being the same. The different transitions c) If a merging of topics occurs, the witnessed effects can be joined ad libitum. will resemble those of case a), although publications An example for a) could be the image topic we alrea- which would be ordered to prior topics harmonise their dy encountered in Figure 3. The distribution of words in term distributions and citation behaviour. A clustering the topic surely changes over time, because the fundamen- would group the topics together. tal terms vary, though the overall motive in them stays the same. As instance of case b), algorithms concerning depth d) A dying topic gets none or few new publications as- first search could be the base, from which other algorithms, signed to. The number of papers in this topic might such as ones for the computation of strongly connected com- already be declining for a few years. A topic getting ponents, derived. The original topic persisted while new ones inactive all of a sudden is highly unlikely. e) If a new topic emerges, publications do not really match researchers from different domains within computer science. term distributions of existing ones. They usually cite a A list which contains our results is presented to them. They lot of different topics as they have no clear predecessor. should rate it against the real trends with corresponding The overlap of content from cited papers (not topics) years. by a new publication and the citing paper should be Additionally, the trends, important researchers and ve- calculated, as it is deemed to be rather small. nues identified by our system will be presented to those ex- perts. They then should rank the correctness of the findings. f) With the sudden re-emergence of a topic, the term An automatic method to quantify the accuracy of the mo- distribution of publications match a topic in kx . del would involve the observation of data up until a time tx . Potential trends at this time will be detected, their evolu- After the topic distributions for the new publications are tion and future importance is going to be predicted for the computed, the then active and inactive topics are assigned succeeding five years and the predictions will be compared to kx+1 and kx+1 respectively. A run concludes with the to the real development of significance of these topics. Num- processing of the next year of papers in the same manner. bers of papers from topics and citation behaviour could be prognosticated. If there are discrepancies in predicted and 4.3 Topic Development Prediction and Trend real data, a manual step could be put in, to question experts Mining to explain the actual development. Predicting the development of a topic is directly linked to The hybrid approach also needs to be tested against the trend mining. Topics which are about to blow up are future purely incremental model which does not use LDA with a trends. The upcoming number of publications in a field, the predetermined k as first step. estimation of citations a new paper is going to gain [17] and possible collaborations between researchers can only be 5.2 Applications computed if the underlying author-publication-graph of the past is thoroughly analysed and influences on its evolution Possible applications of the dynamic topic model with are discovered. varying number of topics complete with the identification The computation of trends in currently active topics is of trends are manifold. A reviewer recommendation system a step which follows directly from the hybrid topic model. for given publications, a citation recommendation system, a Topics which changed a lot from tx to tx+1 are candidates keynote speaker recommendation system or a visualisation for trends. Not only the development of topics from the last tool for exploring bibliographic data with special focus on to the current time frame is going to be observed, the over- trends could be constructed. all behaviour of the term distributions and cited topics are Some reviewer recommendation systems work on word to- relevant. The appearance of new and popular words in the pic and topic citation distributions [11] or are only usable assigned terms of a topic could signal the beginning of a for already established conferences as they use former pro- trend and is worth further investigation. gram committees [23]. Others are more refined and want to Often, popular papers are written by well-known and high- integrate the research interest and direction of scientists into ly linked authors, they appear in journals with a lot of im- the recommendations [16, 12]. Our model is independent of pact or are presented at seminal conferences. Here, the en- past conferences. It could make use of the enriched author- riched data is going to be used. A co-author-graph with re- publication-graph to find scientists capable and willing to searchers’ affiliations linked to a paper-citation-graph com- review new publications from the field of their current rese- plete with venues and relationships between journals and arch interest. As the available data for this task is extensive, conferences could help discover core persons [7], venues and the results could be excellent. publications in topics and trends. Sometimes, trends also Citation recommendation systems suggest fitting publica- develop from sects, so they have to be steadily looked at. tions based on their content, but they do not focus on retur- Topics which were active in tx+1 are judged on whether they ning fundamental papers which lead the way of a topic or are likely going to be trending in the future. The evolution those written by influential authors for an area [11]. The re- can be predicted based on the progress of the topic and the lative importance of a paper for an area and its development found influences. is not considered. With our hybrid model, the identification of influential papers and persons is a by-product and could be easily incorporated in such a system. 5. FUTURE PROSPECTS Keynote speakers for a conference from topic si should After completing the construction of our hybrid approach, be influential scientists from a different topic sj , which is an evaluation of the proposed system needs to prove and related to si . A linkage of the topics could be predicted, quantify its validity. Furthermore, several practical uses for the term distributions of the topics harmonise or one topic the model are presented. adapts words from the other area. The findings in one to- pic could highly benefit the other. Our model contains this 5.1 Evaluation Plan information so it could be used for this application. The evaluation of our planned system, which includes the A visualisation tool for the exploration of found topics, trend mining part, contains multiple steps. The results need relationships and trends in the data would be beneficial for to be cross-validated. researchers, politicians and entrepreneurs [5]. Past work on Our hybrid model is going to be run on a base of data the exploration of topics or trends in bibliographic data so- up until 1995, then topic developments are computed by metimes lacks the support for growing and big data sets [14] the iterative part with data for the next 10 years. For the or base on a topic model with fixed number of topics [6]. A following 5 years, trends are predicted. Afterwards, a manual tool using our model and data would inherently dodge these evaluation of our model and the found trends involves expert weaknesses. 6. CONCLUSION [12] J. Jin, Q. Geng, Q. Zhao, and L. Zhang. Integrating This work proposed a hybrid approach which aims at mo- the trend of research interest for reviewer assignment. delling the agile evolution of topics and trends in a growing In Proceedings of the 26th International Conference on corpus of bibliographic data without a fixed and predefined World Wide Web Companion, Perth, Australia, April number of topics with help of an LDA base. Different state 3-7, 2017, pages 1233–1241, 2017. transitions were used to describe the development of topics [13] P. Knoth and D. Herrmannova. Towards over time in detail. A link to trend mining was drawn. The semantometrics: A new semantic similarity based work concludes with the presentation of an evaluation con- measure for assessing a research publication’s cept to confirm the utility of the approach and numerous contribution. D-Lib Magazine, 20(11/12), 2014. examples of use to underline the potential of our future mo- [14] B. Lee, G. Smith, G. G. Robertson, M. Czerwinski, del. and D. S. Tan. Facetlens: exposing trends and relationships to support sensemaking within faceted Acknowledgements datasets. In Proceedings of the 27th International Conference on Human Factors in Computing Systems, Special thanks goes to my supervisor Ralf Schenkel for his CHI 2009, Boston, MA, USA, April 4-9, 2009, pages invaluable support. 1293–1302, 2009. [15] M. Ley. DBLP - some lessons learned. PVLDB, 7. REFERENCES 2(2):1493–1500, 2009. [1] K. Asooja, G. Bordea, G. Vulcu, and P. Buitelaar. [16] X. Liu, T. Suel, and N. D. Memon. A robust model for Forecasting emerging trends from scientific literature. paper reviewer assignment. In Eighth ACM In Proceedings of the Tenth International Conference Conference on Recommender Systems, RecSys ’14, on Language Resources and Evaluation LREC 2016, Foster City, Silicon Valley, CA, USA - October 06 - Portorož, Slovenia, May 23-28, 2016., 2016. 10, 2014, pages 25–32, 2014. [2] D. M. Blei and J. D. Lafferty. Correlated topic [17] A. Livne, E. Adar, J. Teevan, and S. Dumais. models. In Advances in Neural Information Processing Predicting citation counts using text and graph Systems 18 [Neural Information Processing Systems, mining. February 2013. NIPS 2005, December 5-8, 2005, Vancouver, British [18] M. Rosen-Zvi, T. L. Griffiths, M. Steyvers, and Columbia, Canada], pages 147–154, 2005. P. Smyth. The author-topic model for authors and [3] D. M. Blei and J. D. Lafferty. Dynamic topic models. documents. In UAI ’04, Proceedings of the 20th In Machine Learning, Proceedings of the Twenty-Third Conference in Uncertainty in Artificial Intelligence, International Conference (ICML 2006), Pittsburgh, Banff, Canada, July 7-11, 2004, pages 487–494, 2004. Pennsylvania, USA, June 25-29, 2006, pages 113–120, [19] A. A. Salatino and E. Motta. Detection of embryonic 2006. research topics by analysing semantic topic networks. [4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent In A. González-Beltrán, F. Osborne, and S. Peroni, dirichlet allocation. Journal of Machine Learning editors, Semantics, Analytics, Visualization. Research, 3:993–1022, 2003. Enhancing Scholarly Data, pages 131–146, Cham, [5] J. Boyd-Graber, Y. Hu, and D. Mimno. Applications 2016. Springer International Publishing. of topic models. 11:143–296, 01 2017. [20] S. Siebert, S. Dinesh, and S. Feyer. Extending a [6] A. J. Chaney and D. M. Blei. Visualizing topic research-paper recommendation system with models. In Proceedings of the Sixth International bibliometric measures. In Proceedings of the Fifth Conference on Weblogs and Social Media, Dublin, Workshop on Bibliometric-enhanced Information Ireland, June 4-7, 2012, 2012. Retrieval (BIR) co-located with the 39th European [7] A. Fiallos OrdoÃśez, K. Jimenes, C. Vaca, and Conference on Information Retrieval (ECIR 2017), X. Ochoa. Scientific communities detection and Aberdeen, UK, April 9th, 2017., pages 112–121, 2017. analysis in the bibliographic database: Scopus, 04 [21] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. P. 2017. Hsu, and K. Wang. An overview of microsoft academic [8] A. L. Gibbs and F. E. Su. On choosing and bounding service (mas) and applications. In Proceedings of the probability metrics. INTERNAT. STATIST. REV., 24th International Conference on World Wide Web, pages 419–435, 2002. WWW ’15 Companion, pages 243–246, New York, [9] W. Glänzel and B. Thijs. Using ’core documents’ for NY, USA, 2015. ACM. detecting and labelling new emerging topics. [22] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Scientometrics, 91(2):399–416, 2012. Arnetminer: Extraction and mining of academic social [10] D. Herrmannova and P. Knoth. Semantometrics: networks. In Proceedings of the 14th ACM SIGKDD Towards fulltext-based research evaluation. In International Conference on Knowledge Discovery and Proceedings of the 16th ACM/IEEE-CS on Joint Data Mining, KDD ’08, pages 990–998, New York, Conference on Digital Libraries, JCDL 2016, Newark, NY, USA, 2008. ACM. NJ, USA, June 19 - 23, 2016, pages 235–236, 2016. [23] H. D. Tran, G. Cabanac, and G. Hubert. Expert [11] W. Huang, Z. Wu, P. Mitra, and C. L. Giles. Refseer: suggestion for conference program committees. In 11th A citation recommendation system. In IEEE/ACM International Conference on Research Challenges in Joint Conference on Digital Libraries, JCDL 2014, Information Science, RCIS 2017, Brighton, United London, United Kingdom, September 8-12, 2014, Kingdom, May 10-12, 2017, pages 221–232, 2017. pages 371–374, 2014.