Exploring the Evolution of Science with Pivot Topic Graphs Ke Li Hubert Naacke Bernd Amann LIP6, CNRS, Sorbonne Université LIP6, CNRS, Sorbonne Université LIP6, CNRS, Sorbonne Université Paris, France Paris, France Paris, France ke.li@lip6.fr hubert.naacke@lip6.fr bernd.amann@lip6.fr ABSTRACT period between 2000 and 2006 decomposed into three overlap- In this article we propose a data model for the visualisation ping time periods (3 year periods with one year overlap). Each and exploration of topic evolution networks representing the topic is represented by a rectangle containing the top-10 topic research progress in scientific document archives. Our model terms obtained by a simple NLP document pre-processing step. is independent of a particular topic extraction and alignment Emerging terms are shown in green, decaying term boxes are method and proposes a set of semantic and structural metrics for colored in red, stable terms which exist both, in ancestor topics characterizing and filtering meaningful topic evolution patterns. and in descendant topics, are grouped in a blue boxe and specific These metrics are particularly useful for the visualization and terms which appear only in the current topic are in white. The the exploration of large topic evolution graphs. We also present thickness of the alignment edges reflects the similarity of the a first implementation of our model on top of Apache Spark and connected topics. Several topics in both subgraphs contain the experimental results obtained for three well-known document term "database" and we can observe different evolution patterns. archives. The left subgraph shows that in period 2002 − 2004, topic 77 ("databases, queries, optimization, integration") splits in two re- KEYWORDS search directions "databases, queries and constraints" (topics 100, 188) and "prediction, probability, random" (topics 104, 191, 152). Topic Modeling, LDA, Science Evolution, Big data The right subgraph covers the same period with topics related to "data mining" (83), "data access interfaces" (90), "information re- 1 INTRODUCTION trieval" (92), "logics, semantics" (80) and "knowledge, reasoning" (54). The first three topics converge in 2002 − 2004 into a single There is an increasing demand for practical tools to explore topic on "object, xml, store, data mining" (146) which splits in the the evolution of scientific research published in bibliographic period of 2004 − 2006 into "storage servers" (170), "data mining archives such as the Web of Science (WoS), ISTEX, arXiv or and management" (158) and "knowledge and ontologies" (150). PubMed. Revealing meaningful evolution patterns from docu- Building such meaningful topic evolution networks is still dif- ment archives has many applications and can be used to synthe- ficult and needs an important expertise in statistical text mining. size narratives from datasets across multiple domains, including A first challenge for domain experts is to correctly tune method news stories, research papers, legal cases and works of litera- specific hyper parameters with respect to a given dataset and ture [16]. an expected output. A second challenge concerns the visual ex- The cognitive view of scientific evolution emphasizes the shared ploration of large topic evolution networks. Whereas existing knowledge and the change of ideas present in the document con- graph visualisation standards and tools like Gephi 3 or Graphviz 4 tents [13], whereas the social view takes account of authorship can be used to generate high-quality visualisations, their use for information and social interactions represented, for example, in exploring large graphs and identifying meaningful evolution co-authorship and citation graphs [8, 17]. Bibliographic archives patterns is still limited. In this article we propose a generic evo- often include both kinds of information and there also exist meth- lution network computation and visualization framework which ods which combine both views to study science evolution [9]. combines a high-level data model with big data technology for In the interdisciplinary EPIQUE project1 , we adopt the cogni- extracting and exploring topic evolution networks. The graph tive view for modeling science evolution and assume that the model relies on the notion of pivot topic graphs, which describe evolution only depends on the textual document contents (ti- the contents and the evolution dynamics of topics at different tle, abstract, main contents). This choice reduces the expressive levels of detail. The model also includes a number of high-level power of our evolution model, but it also decreases the "social" semantic metrics which enable domain experts to specify mean- bias and detects more easily possible interactions between sci- ingful topic evolution patterns (queries) for exploring large topic entific ideas and contributions, independently of any particular evolution networks. scientific community. The remainder of this paper is organized as follows. The next Graph-based topic evolution analysis builds on topic evolu- section introduces the related work on topic evolution models and tion networks which track complex temporal evolution dynamics is followed by Section 3 which defines the EPIQUE topic evolution by periodical topic discovery and similarity-based topic align- model including the evolution pattern metrics and a simple query ment. Figure 1 shows two snippets of a single topic evolution language. Section 4 describes our workflow and gives an outline graph extracted from the arXiv2 corpus. The graph covers the of the algorithms for building topic evolution networks. Section 5 1 This work was funded by French ANR-16-CE38-0002-01 project EPIQUE illustrates some experimental results obtained by applying our 2 https://arxiv.org/ evolution pattern metrics on three different document archives. The final section presents our conclusions and outlines future © 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed- work. ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen, Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- 3 https://gephi.org/ tribution 4.0 International (CC BY 4.0) 4 https://www.graphviz.org/ Figure 1: Pivot topics containing term "database" extracted from arXiv, green = emerging terms, blue = stable terms, red = decaying terms 2 RELATED WORK alignment methods. [5] comes up with a method to enable a Topic modeling is a text mining task which consists in extract- bottom-up reconstruction of the dynamics of scientific fields. ing a compact representation of the content of a collection of They generate topics by word co-occurrence graphs and align documents. Statistical topic models like LDA [4] define a proba- inter-temporal topics by Jaccard similarity [11]. [1] generates bilistic procedure to generate documents as mixtures of a low- topics by a Hierarchical Dirichlet Process (HDP) [18] and uses dimensional set of topics. The goal of dynamic topic model [3, 19, Bhattacharyya similarity [2], representing the gradual speciation 20] is to capture the evolution of topics in a sequential document and convergence similar to biologic evolution, for identifying corpus. They generally achieve better predictive accuracy than topic alignments. The alignment process also applies (asymmet- static topic models which ignore the temporal dimension. In our ric) Kullback-Leibler divergence (KLD) for detecting topic split work we use a simple topic evolution model where documents and merge. [15] introduces a novel approach to the early detec- are organized into possibly overlapping time periods and LDA is tion of research topics by using the Computer Science Ontology5 applied on each document sub-collection independently from the to model research topics in the Rexplore system. They apply a other time slices. As our experiments show, the results we obtain Clique Percolation Method (ACPM) for analyzing the dynamics with this strategy are already of good quality and the integration between existent topics. Other examples of science evolution of dynamic topic models is an open future work. studies explore how "cognitive science" as a field has changed Topic detection and trend analysis systems [12] aim to iden- over the last three decades [7] or analyze topic evolution patterns tify and follow event-based topics across incoming streams of (split, merge and knowledge transfer) in the field of Information documents. Usually, a tracking system is given seed documents retrieval (IR) [6]. to monitor the document stream for further documents on the same topic, whereas a detection system performs unsupervised 3 TOPIC EVOLUTION MODEL clustering of the incoming document stream. For example, Hu This section presents the topic evolution model implemented in et. al. [10] applied LDA and regression analysis to identify differ- the EPIQUE workflow. The model is based on a multi-stage graph ent topic evolution patterns for preprints and papers from arXiv representation of topic evolution networks and introduces the and the Web of Science (WoS) in astrophysics for the last 20 years notion of pivot evolution graphs with appropriate evolution pat- (1992 − 2011). The paper redefines the notion of topic trend and tern metrics. The sections also presents a simple query language popularity, and demonstrates that open access preprints have for searching meaningful evolution patterns. stronger growth tendency as compared to traditional printed publications. Topic Evolution Graphs. We consider a corpus C of time-stamped Trend analysis describes the temporal evolution of the popu- documents and a list P of periods. Cp denotes the corpus of doc- larity, the utility or interest of topics, but does not take account uments having their timestamp in period p ∈ P. We consider of their structural evolution where one topic can evolve into set of topics T and denote Tp ⊆ T the topics extracted from the several sub-topics or several topics can merge into a single topic. documents in Cp . A topic t = (v, p) ∈ Tp is defined by a (sparse) Topic evolution networks represent this kind of structural topic weighted term vector v ∈ R |V | where V is a vocabulary of terms. evolution and track complex temporal changes by periodic topic We will denote by t .v the terms and by t .p the period of t. We discovery and directed acyclic networks aligning topics of dif- also define a function sim : T × T → [0, 1] estimating the simi- ferent periods. Existing evolution network based frameworks larity between topics in T . The similarity measure depends on mainly can be distinguished by the chosen topic extraction and 5 http://cso.kmi.open.ac.uk/ the term vectors of topics and estimates their semantic proxim- with respect to the pivot topic t. ity. For example in Figure 1, P has 3 periods: p1="2000 − 2002", pevol(G β (t)) = 1 − avдti ∈T ′ (sim(t, ti )) p2="2002 − 2004" p3="2004 − 2006", Tp1 contains topics 54 to 92, and topic 92 = (v, p1), where v is a weighted vector containing A low pivot evolution degree signifies that the pivot topic does terms "queri", "optim", "databas", ... with a positive weight. The not evolve much (all other topics are similar), whereas a high similarity between topic 77 and topic 100 is sim(77, 100) = 0.74. value indicates that the pivot topic evolves rapidly . Based on the previous definitions, we define a topic evolution - The split degree split(G β (t)) of a pivot topic (t, β) is defined graph as a directed labeled multistage graph G β = (T , E, sim, β) by the average outdegree of G β (t). overT where the edges E connect topics from consecutive periods |E ′ | and their similarity is higher or equal to some threshold β. That split(G β (t)) = |{ti |ti ∈ T ∧ outdeд(ti ) > 0}| is, E = {(ti , t j ) ∈ T |sim(ti , t j ) ≥ β ∧ t j .p = ti .p + 1}. A low value signifies that the topics evolve along linear paths Pivot Evolution Graphs. Threshold β strongly influences the and a high value signifies that the topics split into several complexity of the obtained evolution graphs. It is easy to see that future sub-topics. G β is a subgraph of G β for all β ′ ≥ β and G 0 is the complete - The convergence degree conv(G β (t)) of a pivot topic (t, β) is ′ graph connecting all topics of two consecutive periods. More defined by the average indegree of G β (t). exactly, higher β values generate more "linear" graphs with many |E ′ | isolated topics, whereas lower values generate more complex conv(G β (t)) = |{ti |ti ∈ T ∧ indeд(ti ) > 0}| and heterogeneous graphs containing a variety of potentially A low value signifies that many topics depend on a single interesting structures. Analyzing science evolution by using topic parent topic and a high value signifies that many topics are evolution graphs then becomes a complex task which consists in the result of the fusion of past topics. computing and visually exploring multiple graphs for different β values. Topic labeling. All topics t of some evolution graph G β are To solve this problem, we propose a different approach which labeled by the top-k highest weighted terms t .l in the topic term allows users to formulate queries for characterising and extracting vector t .v. Let t .lp ⊆ t .l and t .l f ⊆ t .l be the sets of past and interesting subgraphs from a set of evolution graphs defined by future terms which appear, respectively, in the ancestor topics a set of β thresholds. For this, we decompose topic evolution and in the descendant topics of t. Then, the terms in some topic graphs into the set of all connected subgraphs defined by all vector t .l are partitioned into the following four subsets of : paths containing a given topic t (one graph per topic). More • emerging future terms t .le = t .l f − t .lp which do not exist in formally, a pivot evolution graph G β (t) = (T ′, E ′, sim, β) of topic past topics, t in G β is the subgraph of G β which contains t and all ancestors • decaying past terms t .ld = t .lp − t .l f which do not exist in and descendants of t. The subgraph of G β (t) containing all nodes future topics, which are reachable from t is called the future of t, denoted by • stable terms t .lд = t .lp ∩ t .l f which exist in the past and the F β (t), and the subgraph of nodes which reach t is called the future topics of t, and past of t, denoted by P β (t). The couple (t, β) is called a pivot • specific terms t .ls = t .l − (t .lp ∪ t .l f ) which neither exist in the topic with pivot graph G β (t), future F β (t) and past P β (t). It is past nor in the future topics of t. easy to see that if t 1 appears in the future (past) of t 2 , then the The quadruple [t .le , t .ld , t .lд , t .ls ] is called the term label of t. future (past) of t 1 is a subgraph of the future (past) of t 2 and t 2 appears in the past (future) of t 1 . This property can be exploited Pivot Topic Query Language. Liveliness, relative evolution de- to filter topics wrt. future and past topics (see the definition of gree, pivot evolution degree, split degree and convergence degree Path Filters below). allow to characterize the amount and complexity of the evolution The evolution of topics within their evolution graphs can be of a topic in some evolution graph G β . Combined with other characterized by the following metrics: filters on the topic labels and the graph structure, it is possible to filter pivot topics satisfying rich evolution patterns within a - The liveliness live(G β (t)) of a pivot topic (t, β) is defined by set of evolution graphs G βi , 1 ≤ i ≤ n. the diameter of its pivot graph G β (t). Let DB be the union of all future and past pivot topic graphs live(G β (t)) = max {lenдth(p)|p = path in G β (t)} F β (t) and P β (t). Operators can be composed by concatenation (similar to Scala methods). For each attribute A we define a filter A high liveliness value describes a long living topic, whereas A(X ) where X is a valid value, and for each ordinal attribute the a value equal to 0 corresponds to an isolated topic without filter A(X, Y ) contains a second attribute Y restricting X to be ancestors and descendants. the minimal (Y = 0) and maximal (Y = 1) value respectively. - The relative evolution degree revol(G β (t)) of a pivot topic (t, β) • Term Filters select pivot graphs with respect to the pivot topic is defined by the average topic dissimilarity (edge) weight in labels. In particular, they can be applied to filter pivot graphs G β (t). wrt. to their emerging, decaying, stable, and specific terms. revol(G β (t)) = 1 − avд(ti ,t j )∈E ′ (sim(ti , t j )) Find all pivot topics where the term "deep learning" is emerging and the term "big data" is decaying: A low relative evolution degree states that most topics evolve DB . Emerge ( " deep ␣ learning " ) . Decay ( " big ␣ data " ) slowly. On the other hand, a high value signifies that most topics have an important "semantic gap". By definition, we • Temporal Filters allow the expert to filter all topics situated have revol(G β (t)) ≤ 1 − β. within a certain time period. Find all topics between 2012 and - The pivot evolution degree pevol(G β (t)) of a pivot topic (t, β) 2017: is defined by the average dissimilarity of all topics in G β (t) DB . Period ( 2 0 1 2 , 0 ) . Period ( 2 0 1 7 , 1 ) • Pattern Filters can filter topics by their pivot graph structure along their liveliness, split degree and convergence degree. Find all pivot graphs which cover at least 6 periods where all topics split into at least three subtopics in average: DB . Live ( 6 , 0 ) . S p l i t ( 3 , 0 ) • Evolution Filters are applied to filter topics by their relative and pivot evolution degrees. Find all topics which are evolving "slowly": DB . Revol ( 0 . 3 , 1 ) The previous filters are applied to sets of pivot topics and can be combined with other operators: • Temporal Projection allows to project the pivot graph to its Figure 2: Topic evolution model of EPIQUE past and future. Find all pivot topics with a linear future of a minimal length of 3 periods: DB . Future . Live ( 3 , 0 ) . S p l i t ( 1 , 1 ) period. We will illustrate in our experiments how experts can be • Set Operations (union, intersection, minus) combine sets of assisted in choosing the right number of topics. topics. Find all topics with decaying term "big data" and without The topic-term vectors generated by the topic extraction step emerging term "deep learning": are combined with an appropriate similarity measure for aligning topics from different periods. In our experiments, we use cosine DB . Decay ( " big ␣ data " ) similarity which performs well in measuring the correlations be- . Minus ( DB . Emerge ( " deep ␣ learning " ) ) tween sparse vectors. Observe that the choice of LDA and cosine • Path Filters select pivot topics by the existence of a path to/from similarity does not exclude the use of other topic models and other topics. Find all topics with an emerging term "deep learn- similarity measures like Jaccard similarity [11] or Battacharya ing" where the future contains a path to a topic with the decaying similarity [2]. term "big data": The next step produces instances of topic evolution graphs fol- DB . Emerge ( " deep ␣ learning " ) lowing the model introduced in Section 3. The topics are aligned . Future . Path ( DB . Decay ( " big ␣ data " ) ) to generate a single topic evolution graph G β0 for some small alignment threshold β 0 (see the central part of Figure 2). This • Ordering: Find all topics about "big data" ordered by period: global evolution graph is then transformed into n families of DB . Term ( " big ␣ data " ) . S o r t ( " Period " , " asc " ) pivot evolution graphs defined by a set of alignment thresholds βi > β 0 , 1 ≤ i ≤ n. Each family contains the pivot graphs G βi (t) 4 IMPLEMENTATION of all pivot topics (t, βi ). We only consider pivot graphs with This section presents the implementation of the topic evolution at least one edge and ignore isolated pivot topics (single node model in the context of the EPIQUE project. Our implemented graphs). The final database then contains at most n × |T | pivot workflow is able to handle any scientific document corpus where graphs where |T | is the number of topics in G β0 . These graphs each document has a publication date and some text content (title, can then be queried using the filters defined in Section 3 and we abstract, keywords, ...). will illustrate the result of some queries in Section 5. EPIQUE Workflow. Figure 2 details the main steps of the EPIQUE Pivot Graph Computation. We first present an algorithm that workflow. The workflow starts with a standard document pre- computes G βi for a sequence of relative evolution bounds β 0 , processing step composed of some lexical analysis, stop-word β 1 , ..., βn where βi < βi+1 . The basic idea of this algorithm is removal, stemming, index term generation and term selection. to exploit the monotonicity property that G βi +1 is a subgraph The main goal of this step is to extract for each document a term of G βi for all 0 ≤ i < n. The input of the algorithm is a set of index which precisely describes the scientific document contents. time-stamped topics T , a similarity function sim and a sequence The document preprocessing step is followed by corpus periodiza- of βi values. Topics are represented as a binary table Topics(t, a) tion step which decomposes the document collection according storing the topics t and their periods a, the sequence of βi values to several continuous, possibly overlapping, time windows, i.e., is defined as a binary table Beta(b, i) where b = βi and the the same document may appear in two periods. Each time win- similarity function is defined as a table Sim(x, y, s) where s = dow defines a corpus period, which is the subset of documents sim(x, y). The following recursive Datalog program computes all published during the corresponding time period. The choice of G βi as a relational table Graph(x, y, s, a, i) connecting all topics the window size and sliding step depends on the granularity of x of period a to all topics y of period a + 1 where there exists an the document time-stamps (year, month, day) and on the number evolution edge of similarity s = sim(x, y) ≥ βi . of available documents in each period. Graph ( x , y , s , a , 0 ) : − Topics ( x , a ) , Topics ( y , a + 1 ) , In the following step, each corpus period is analyzed by a topic Sim ( x , y , s ) , Beta ( b , 0 ) , s >=b model. In our implementation, we use LDA [4] to extract the Graph ( x , y , s , a , i ) : − Graph ( x , y , s , i − 1 ) , topics of each corpus period. The main output of LDA is a topic- Beta ( b , i ) , s >=b term matrix describing each topic as a weighted term vector. LDA requires to set the number of topics to be generated in advance. Starting from Graph we can compute all pivot topic evolution Tuning this parameter is important and subtle because it strongly graphs for all topics in T and beta value βi . This is done by influences the diversity of the topics generated for each time generating first a table TC(x, y, s, l, a, i) containing the transitive closure of graph G βi where l is the distance between x and y 6 , a is the period of x and s is the similarity between x and y. TC( x , y , s , 1 , a , i ) : − Graph ( x , y , s , a , i ) TC( x , y , s , l + 1 , a , i ) : −TC( x , z , _ , l , a , i ) , Graph ( z , y , _ , _ , i ) , Sim ( x , y , s ) This table can then be used to compute the future and the past graph for each pivot topic p. Future ( p , x , y , r s , ps , l , a , i ) : −TC( p , x , _ , _ , a , i ) , TC( p , y , ps , l , _ , i ) , Graph ( x , y , r s , _ , i ) Past ( p , x , y , r s , ps , l , a , i ) : −TC( x , p , ps , l , a , i ) , TC( y , p , _ , _ , _ , i ) , Graph ( x , y , r s , _ , i ) Figure 3: EPIQUE web application architecture Graphs Past and Future contain the pivot topic evolution graphs of all topics where a tuple (p, x, y, rs, ps, l, a, i) represents an edge Architecture. Figure 3 gives an overview of the architecture of (x, y) in G βi (p) for pivot p in period a with relative evolution sim- our web application implemented on top of Apache Spark and ilarity rs, pivot evolution similarity ps of x (Past) and y (Future) Jupyter Notebook. The entire process to study science evolution and distance l of x (Past) and y (Future) from p. over a corpus is splitted into two steps for building the pivot Table size estimation. We can estimate the size of each table evolution graphs and for interactively exploring these graphs. as follows. Let p be the number of periods, t be the number of Each step corresponds to a separate user interface. The evolution topics per period and k be the maximal outdegree and indegree graph generation is implemented in Scala and exectued through in G βi . Then the size of Graph is bound by |Graph| ≤ k ∗ t ∗ the Spylon7 kernel. Evolution graph exploration uses a standard (p − 1) (remind that Graph is a multistage graph). The size of the Python kernel to take advantage of advanced Python 3 graphical transitive closure TC is bound by |TC | ≤ k ∗t ∗p ∗(p −1)/2. Finally, user interface libraries for facilitating user interaction. both graphs Future and Past, contain for each tuple (x, y, s, l, a, i) in the transitive closure TC (x and y are connected by a path of 5 EVALUATION length l in G βi ), at most k tuples Future(x, y, _, _, _, _) and at Experiment Setting. We conducted our experimentations on most k tuples Past(y, x, _, _, _, _). Therefore, the size of Future three real-world data sets of different scales by using the titles and Past is bound by k times the size of the transitive closure: and the abstracts of each document. The smallest dataset, called |Future | ≤ k 2 ∗ t ∗ p ∗ (p − 1)/2. As our experiments show, even ISTEX, contains 14 851 papers in the domain of ecological eco- for small β-thresholds (β = 0.2) the maximum indegree and nomics and environmental economics. The second one is arXiv, a outdegree of a topics is smaller than k = 10 and we generally repository of electronic preprints approved for publication after assume about t = 100 topics over p = 20 periods. Then, the size moderation, which consists of scientific papers in the fields of of the transitive closure is |TC | ≤ 10 ∗ 100 ∗ 10 ∗ 19 = 1.9 ∗ 105 mathematics, physics, astronomy, electrical engineering, com- edges and the size of |Future | ≤ 1.9 ∗ 106 edges. These numbers puter science, quantitative biology, statistics, and quantitative are much smaller in practice (see Section 5) and current big data finance, etc. This data set contains about 1.1 million documents. frameworks can easily manage graphs of this size. We plan to The third dataset is from the Wiley online library which contains study possible optimizations in the future. 1 million documents including additional domains such as agri- culture, art, humanities, etc. The statistics over these three data Metrics computation. The liveliness, relative evolution degree, sets are summarized in Table 1, where #D is the total number of pivot evolution degree, split degree and convergence degrees of documents, #T is the number of topics per period, #G is the total all pivot evolution graphs can directly be computed by a stan- number of pivot graphs, #E is the number of edges (total and per dard SQL aggregation query. For example, the following query pivot graph). computes these metrics for all future pivot evolution graphs: We run our EPIQUE workflow on an Apache Spark cluster in c r e a t e view P i v o t F u t u r e as standalone mode with Spark version 2.4, Scala version 2.11 and s e l e c t p , i max ( l ) as l i v e l i n e s s , Java version 8. The cluster consists of 11 machines: one driver 1−avg ( r s ) as r e v o l , has 20GB memory, and 10 worker nodes, each one having 24 1−avg ( ps ) as p e v o l , CPU cores with hyperthreading and 50 GB memory. For all of count ( ∗ ) / count ( d i s t i n c t x ) as s p l i t , our experiments, we used the documents extracted from a 20- count ( ∗ ) / count ( d i s t i n c t y ) as conv year period and split each corpus into 10 slices by using the time from F u t u r e window spanning 3 years with 1-year overlap. Thus, we have 10 group by p , i LDA models for each corpus. Column LDA in Table 1 shows the average execution time Observe that for evaluating the pivot query filters presented in for computing the LDA model per period. This computation is Section 3 without the Path-operator, it is sufficient to store Graph only done once and mainly depends on the number of extracted (for visualization) and PivotFuture, PivotPast and PivotAll (for topics per period. In Table 1, we can see that the Wiley corpus is filtering). An efficient implementation of Path queries, for exam- about 70 times larger than ISTEX but has a similar LDA execution ple by using graph-labeling schemes for checking node reacha- time for the same number of topics. On the other hand, the LDA bility in acyclic graphs, is part of our future work. execution time is more important for arXiv, where we extract a higher number of topics. 6 Since G β i is a multistage graph, all paths between two nodes are of the same length. 7 https://github.com/Valassis-Digital-Media/spylon-kernel Table 1: Dataset statistics Datasets #D Period #Periods #T #G #E #E LDA G G total total total / period total total per pivot sec / period sec (total) sec / pivot ISTEX 14 851 1991 − 2010 10 20 1806 211 850 117 28 490 0.27 arXiv 1 156 300 1998 − 2017 10 50 3364 272 944 81 40 641 0.19 Wiley 1 023 515 1996 − 2015 10 20 1360 198 179 145 30 505 0.37 The last two columns G give the total and average execution among which some become very complex. For example, in Fig- time for the pivot graph computation step. We computed the pivot ure 4g, pivot topics tend to evolve a lot even for a low relative graphs for 9 β-threshold values spanning from 0.1 to 0.9. The evolution degree. The pivot graphs in Figure 4h are much more total execution time obviously depends on the β-thresholds. The complex than the graphs generated by the same β-threshold with number of the thresholds increases the number of pivot graphs #T = 50 topics (Figure 4e and Figure 4f). As we can see, the split to be computed and the values of the thresholds defines the size degree attains a value of 19 compared with maximal split degree of the computed pivot graphs. The average execution time to 1.5 in Figure 4e. The increase of #T reduces the proportion of construct a pivot graph (last column) can be obtained by dividing isolated topics, 30% for #T = 150 compared with 60% for #T = 50. the total time to build all pivot graphs by the number of pivot As we see in the next section, this is also due to the existence graphs. For example, for ISTEX corpus, we obtained 1806 pivot of many similar topics in each period, which also increases the graphs in 490 seconds which gives the average value of 0.27 sec- probability that two topics can be aligned. onds. The average pivot graph computation time mainly depends on the pivot graph size. This can be seen for the arXiv dataset, Diversity-based Topic Number Selection. Figure 5 shows a fu- which has the smallest average number of pivot graph edges ture arXiv pivot graph generated for #T = 150 and β = 0.5, which (#E/pivot = 81) and the lowest average pivot graph computation corresponds to a data point in Figure 4h where split(G β (t)) = 8.3 time (0.19 seconds/graph). and conv(G β (t)) = 6.4. The graph connects topics with similarity higher than β = 0.5 and has nevertheless a high split and conver- gence degree. When looking in more detail, we can observe that Pivot Topic Analysis. The metrics defined in Section 3 can be the topics in each period are also very similar which explains used for the structural and quantitative analysis of the evolution why the single root pivot topic is connected to more than 20 of topics. The objective of this section is to explore the impact of topics in the second period. the main parameters,i.e., the β threshold and the topic number In order to build pivot graphs over more representative topic #T , on the structure and the semantics of the generated pivot sets, we use topic diversity for estimating the quality of a topic set. evolution graphs. The topic diversity inside a period can be estimated by observing Figure 4 shows the distribution of future pivot evolution graphs the dissimilarity distribution over all topic pairs inside the period. in arXiv wrt. three groups of metrics, the relative evolution degree For example, Figure 6a and Figure 6b shows the topic diversity vs. the pivot evolution degree , the split degree vs. convergence de- obtained for different LDA models applied to 1164 documents gree and the liveliness vs. the split degree. The figure is organized published in arXiv during 1998 to 2000 and 16 072 documents into 3 lines of 3 sub-graphs where each line corresponds to identi- published in arXiv during 2008 to 2010 respectively. Each LDA cal fixed parameters β and #T and each sub-graph corresponds to model corresponds to a different number of topics #T ranging a group of metrics. On the first line, we set β = 0.2 and #T = 50. from 10 to 150. For example, we can see in Figure 6a that for #T On the second line #T remains the same (#T = 50) whereas β is ranging between 40 and 60, less than 5 percent (blue line) of all increased to β = 0.5. On the third line, β remains the same as in topic pairs have a similarity value higher than 0.1 (dissimilarity the 2nd line (β = 0.5) whereas #T is increased to #T = 150. Each value lower than 0.9), but this diversity value rapidly drops for Figure only shows pivot topic graphs with at least two nodes and #T > 60. For example, for #T = 100, the same similarity bound the number of isolated topics is reported in the figure captions. of 0.1 only holds for the half of the topic pairs (green median When comparing Figure 4a with Figure 4d, we can see that for line). Figure 6b shows that for the larger corpus we can achieve the lower threshold β = 0.2, pivot topics evolve more than for the same diversity for much more topics (the topic diversity for the higher value β = 0.5. Lower β values also allow pivot topics #T = 140 topics in period 2008 − 2010 is similar to the topic to connect with more topics than higher β values which only diversity for #T = 60 topics in period 1998 − 2000. This kind connect similar topics. This is shown in Figure 4b which repre- of grid analysis allows experts to choose an optimal number of sents a large number of complex pivot topic graphs with higher topics for their analysis. split and convergence degrees than the pivot topic graphs in Fig- ure 4e. The previous observation is also confirmed in Figure 4c Pivot Topic Exploration. Our query language allows users to and Figure 4f which compare topic liveliness vs. split degree: the select topics in specific regions of the sub-figures in Figure 4. For lower threshold β = 0.2 generates pivot graphs which are more example, the following query Q1 chooses all topics which appear complex than pivot graphs with the same liveliness scores gen- in the upper right window of Figures 4a and on the right part of erated by β = 0.5. Therefore, for a fixed #T , varying β allows Figure 4b on the line corresponding to the liveliness value 5 in for revealing interesting evolution patterns at different levels of Figure 4c. detail where the evolution of some topic might be too complex for low β values and become more intelligible for higher β values. Q1 : = DB . Future . Revol ( 0 . 5 , 0 ) . Pevol ( 0 . 6 , 0 ) When the topic number per period increases (#T = 150 in Fig- . S p l i t ( 2 , 0 ) . Live ( 5 ) ures (g), (h) and (i)), the workflow generates more pivot graphs, (a) β = 0.2, #T = 50, #Pivot = 477, #Isolated = 23 (b) β = 0.2, #T = 50, #Pivot = 477, #Isolated = 23 (c) β = 0.2, #T = 50, #Pivot = 477, #Isolated = 23 (d) β = 0.5, #T = 50, #Pivot = 198, #Isolated = 302 (e) β = 0.5, #T = 50, #Pivot = 198, #Isolated = 302 (f) β = 0.5, #T = 50, #Pivot = 198, #Isolated = 302 (g) β = 0.5, #T = 150, #Pivot = 1014, #Isolated = 486 (h) β = 0.5, #T = 150, #Pivot = 1014, #Isolated = 486 (i) β = 0.5, #T = 150, #Pivot = 1014, #Isolated = 486 Figure 4: Distribution of future pivot evolution graphs in arXiv wrt. their metrics. Figure 5: G 0.5 (495) with #T = 150 (a) 1998-2000 (b) 2008-2010 Figure 6: Dissimilarity distribution by topic number (b) Q2: r evol ≥ 0.5 ∧ (c) Q3: r evol ≤ 0.4 ∧ (d) Q4: r evol ≤ 0.4 ∧ (a) Q1: r evol ≥ 0.5 ∧ pevol ≥ 0.6 ∧ spl it ≥ 2 ∧ l ive = 5 pevol ≥ 0.6∧split ≤ pevol ≤ 0.5 ∧ split ≥ pevol ≤ 0.5∧split ≤ 1.2 ∧ live = 5 1.5 ∧ live = 3 1.2 ∧ live = 5 Figure 7: Examples of query results which filter pivot topics by using aforementioned metrics Observe that the user does not specify the β-threshold. A result [5] David Chavalarias and Jean-Philippe Philippe Cointet. 2013. Phylomemetic example of query Q1 and three other queries is shown in Fig- patterns in science evolution—the rise and fall of scientific fields. PloS one 8, 2 (2013), e54847. ure 7. Although Figure 7b and Figure 7d have the same structure, [6] Baitong Chen, Satoshi Tsutsui, Ying Ding, and Feicheng Ma. 2017. Understand- they have different evolution pace (corresponding to different β ing the topic evolution in a scientific domain: An exploratory study for the field of information retrieval. Journal of Informetrics 11, 4 (2017), 1175–1189. values). The pivot graph in Figure 7b has more emerging terms [7] Uriel Cohen Priva and Joseph L. Austerweil. 2015. Analyzing the history of (green part) whereas the pivot graph in Figure 7d has more sta- Cognition using Topic Models. Cognition 135 (Feb. 2015), 4–9. ble terms (blue part) which correspond to our queries to select [8] Eugene Garfield. 1955. Citation Indexes for Science: A New Dimension in Documentation through Association of Ideas. Science 122, 3159 (July 1955), high-evolution and low-evolution pivot topics respectively. 108–111. Apart from these metric-based filters, our query language also [9] Qi He, Bi Chen, Jian Pei, Baojun Qiu, Prasenjit Mitra, and Lee Giles. 2009. allows users to define other multi-dimensional filtering criteria Detecting Topic Evolution in Scientific Literature: How Can Citations Help?. In ACM Conf. on Information and Knowledge Management (CIKM ’09). New including topic labels and temporal conditions for the selection York, NY, USA, 957–966. of pivot topics. [10] Beibei Hu, Xianlei Dong, Chenwei Zhang, Timothy D. Bowman, Ying Ding, Staša Milojević, Chaoqun Ni, Erjia Yan, and Vincent Larivière. 2015. A Lead- lag Analysis of the Topic Evolution Patterns for Preprints and Publications. J. Assoc. Inf. Sci. Technol. 66, 12 (Dec. 2015), 2643–2656. 6 CONCLUSION AND FUTURE WORK [11] Paul Jaccard. 1912. The Distribution of the Flora in the Alpine Zone.1. New We have presented a new framework for the visualisation and ex- Phytologist 11, 2 (1912), 37–50. [12] April Kontostathis, Leon M. Galitsky, William M. Pottenger, Soma Roy, and ploration of topic evolution networks representing the progress Daniel J. Phelps. 2004. A survey of emerging trend detection in textual data and evolution of research in scientific document archives. This mining. In Survey of text mining. Springer, 185–224. framework has been implemented on top of Apache Spark using [13] Thomas S. Kuhn, Otto Neurath, and Thomas Samuel Kuhn. 1994. The Structure of scientific revolutions (2nd ed., enlarged ed.). Number ed.-in-chief: Otto Neu- LDA and cosine similarity for topic extraction and topic align- rath ; Vol. 2 No. 2 in International encyclopedia of unified science Foundations ment. The user can express complex evolution pattern queries to of the unity of science. Chicago Univ. Press, Chicago, Ill. OCLC: 258260085. [14] Ke Li, Hubert Naacke, and Bernd Amann. 2020. EPIQUE: Extracting Meaning- obtain the relevant pivot topic graphs. A first prototype [14] is ful Science Evolution Patterns from Large Document Archives (Demonstra- currently used to extract complex evolution patterns for different tion). In Int’l Conf. on Extending Database Technology (EDBT). Copenhagen, scientific domains as part of the EPIQUE project and in collabo- Denmark. [15] Angelo A. Salatino, Francesco Osborne, and Enrico Motta. 2018. AUGUR: ration with philosophers of science. As future work we intend to Forecasting the Emergence of New Research Topics. In ACM/IEEE on Joint optimize the computation of pivot topic evolution graphs and ex- Conference on Digital Libraries (JCDL ’18). ACM, New York, NY, USA, 303–312. ploit the LDA document-topic matrix for enriching the analysis. [16] Dafna Shahaf, Carlos Guestrin, Eric Horvitz, and Jure Leskovec. 2015. Infor- mation Cartography. Commun. ACM 58, 11 (2015), 62–73. Additionally, we plan to integrate other topic extraction methods [17] Xiaoling Sun, Jasleen Kaur, Staša Milojević, Alessandro Flammini, and Filippo than LDA. Menczer. 2013. Social Dynamics of Science. Scientific Reports 3 (Jan. 2013), 1069. https://doi.org/10.1038/srep01069 [18] Yee W Teh, Michael I Jordan, Matthew J Beal, and David M Blei. 2005. Sharing clusters among related groups: Hierarchical Dirichlet processes. In Advances REFERENCES in neural information processing systems. 1385–1392. [1] Victor Andrei and Ognjen Arandjelović. 2016. Complex temporal topic evolu- [19] Chong Wang, David Blei, and David Heckerman. 2008. Continuous Time tion modelling using the Kullback-Leibler divergence and the Bhattacharyya Dynamic Topic Models. In Conference on Uncertainty in Artificial Intelligence distance. EURASIP Journal on Bioinformatics and Systems Biology 2016, 1 (UAI’08). AUAI Press, Arlington, Virginia, United States, 579–586. (2016), 16. [20] Xuerui Wang and Andrew McCallum. 2006. Topics over Time: A non-Markov [2] A. Bhattacharyya. 1943. On a measure of divergence between two statistical Continuous-time Model of Topical Trends. In Int’l Conf. on Knowledge Discov- populations defined by their probability distributions. Bulletin of the Calcutta ery and Data Mining (KDD ’06). ACM, New York, NY, USA, 424–433. Mathematical Society 35 (1943), 99–109. [3] David M. Blei and John D. Lafferty. 2006. Dynamic Topic Models. In Proceedings of the 23rd International Conference on Machine Learning (ICML ’06). ACM, New York, NY, USA, 113–120. [4] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.