Exploring the Evolution of Science with Pivot Topic Graphs
                          Ke Li                                           Hubert Naacke                                     Bernd Amann
       LIP6, CNRS, Sorbonne Université                         LIP6, CNRS, Sorbonne Université                       LIP6, CNRS, Sorbonne Université
                Paris, France                                            Paris, France                                         Paris, France
                ke.li@lip6.fr                                        hubert.naacke@lip6.fr                                 bernd.amann@lip6.fr

ABSTRACT                                                                               period between 2000 and 2006 decomposed into three overlap-
In this article we propose a data model for the visualisation                          ping time periods (3 year periods with one year overlap). Each
and exploration of topic evolution networks representing the                           topic is represented by a rectangle containing the top-10 topic
research progress in scientific document archives. Our model                           terms obtained by a simple NLP document pre-processing step.
is independent of a particular topic extraction and alignment                          Emerging terms are shown in green, decaying term boxes are
method and proposes a set of semantic and structural metrics for                       colored in red, stable terms which exist both, in ancestor topics
characterizing and filtering meaningful topic evolution patterns.                      and in descendant topics, are grouped in a blue boxe and specific
These metrics are particularly useful for the visualization and                        terms which appear only in the current topic are in white. The
the exploration of large topic evolution graphs. We also present                       thickness of the alignment edges reflects the similarity of the
a first implementation of our model on top of Apache Spark and                         connected topics. Several topics in both subgraphs contain the
experimental results obtained for three well-known document                            term "database" and we can observe different evolution patterns.
archives.                                                                              The left subgraph shows that in period 2002 − 2004, topic 77
                                                                                       ("databases, queries, optimization, integration") splits in two re-
KEYWORDS                                                                               search directions "databases, queries and constraints" (topics 100,
                                                                                       188) and "prediction, probability, random" (topics 104, 191, 152).
Topic Modeling, LDA, Science Evolution, Big data
                                                                                       The right subgraph covers the same period with topics related to
                                                                                       "data mining" (83), "data access interfaces" (90), "information re-
1     INTRODUCTION                                                                     trieval" (92), "logics, semantics" (80) and "knowledge, reasoning"
                                                                                       (54). The first three topics converge in 2002 − 2004 into a single
There is an increasing demand for practical tools to explore                           topic on "object, xml, store, data mining" (146) which splits in the
the evolution of scientific research published in bibliographic                        period of 2004 − 2006 into "storage servers" (170), "data mining
archives such as the Web of Science (WoS), ISTEX, arXiv or                             and management" (158) and "knowledge and ontologies" (150).
PubMed. Revealing meaningful evolution patterns from docu-                                 Building such meaningful topic evolution networks is still dif-
ment archives has many applications and can be used to synthe-                         ficult and needs an important expertise in statistical text mining.
size narratives from datasets across multiple domains, including                       A first challenge for domain experts is to correctly tune method
news stories, research papers, legal cases and works of litera-                        specific hyper parameters with respect to a given dataset and
ture [16].                                                                             an expected output. A second challenge concerns the visual ex-
   The cognitive view of scientific evolution emphasizes the shared                    ploration of large topic evolution networks. Whereas existing
knowledge and the change of ideas present in the document con-                         graph visualisation standards and tools like Gephi 3 or Graphviz 4
tents [13], whereas the social view takes account of authorship                        can be used to generate high-quality visualisations, their use for
information and social interactions represented, for example, in                       exploring large graphs and identifying meaningful evolution
co-authorship and citation graphs [8, 17]. Bibliographic archives                      patterns is still limited. In this article we propose a generic evo-
often include both kinds of information and there also exist meth-                     lution network computation and visualization framework which
ods which combine both views to study science evolution [9].                           combines a high-level data model with big data technology for
In the interdisciplinary EPIQUE project1 , we adopt the cogni-                         extracting and exploring topic evolution networks. The graph
tive view for modeling science evolution and assume that the                           model relies on the notion of pivot topic graphs, which describe
evolution only depends on the textual document contents (ti-                           the contents and the evolution dynamics of topics at different
tle, abstract, main contents). This choice reduces the expressive                      levels of detail. The model also includes a number of high-level
power of our evolution model, but it also decreases the "social"                       semantic metrics which enable domain experts to specify mean-
bias and detects more easily possible interactions between sci-                        ingful topic evolution patterns (queries) for exploring large topic
entific ideas and contributions, independently of any particular                       evolution networks.
scientific community.                                                                      The remainder of this paper is organized as follows. The next
   Graph-based topic evolution analysis builds on topic evolu-                         section introduces the related work on topic evolution models and
tion networks which track complex temporal evolution dynamics                          is followed by Section 3 which defines the EPIQUE topic evolution
by periodical topic discovery and similarity-based topic align-                        model including the evolution pattern metrics and a simple query
ment. Figure 1 shows two snippets of a single topic evolution                          language. Section 4 describes our workflow and gives an outline
graph extracted from the arXiv2 corpus. The graph covers the                           of the algorithms for building topic evolution networks. Section 5
1 This work was funded by French ANR-16-CE38-0002-01 project EPIQUE
                                                                                       illustrates some experimental results obtained by applying our
2 https://arxiv.org/                                                                   evolution pattern metrics on three different document archives.
                                                                                       The final section presents our conclusions and outlines future
© 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed-
                                                                                       work.
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen,
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-
                                                                                       3 https://gephi.org/
tribution 4.0 International (CC BY 4.0)
                                                                                       4 https://www.graphviz.org/
Figure 1: Pivot topics containing term "database" extracted from arXiv, green = emerging terms, blue = stable terms, red = decaying terms


2   RELATED WORK                                                       alignment methods. [5] comes up with a method to enable a
Topic modeling is a text mining task which consists in extract-        bottom-up reconstruction of the dynamics of scientific fields.
ing a compact representation of the content of a collection of         They generate topics by word co-occurrence graphs and align
documents. Statistical topic models like LDA [4] define a proba-       inter-temporal topics by Jaccard similarity [11]. [1] generates
bilistic procedure to generate documents as mixtures of a low-         topics by a Hierarchical Dirichlet Process (HDP) [18] and uses
dimensional set of topics. The goal of dynamic topic model [3, 19,     Bhattacharyya similarity [2], representing the gradual speciation
20] is to capture the evolution of topics in a sequential document     and convergence similar to biologic evolution, for identifying
corpus. They generally achieve better predictive accuracy than         topic alignments. The alignment process also applies (asymmet-
static topic models which ignore the temporal dimension. In our        ric) Kullback-Leibler divergence (KLD) for detecting topic split
work we use a simple topic evolution model where documents             and merge. [15] introduces a novel approach to the early detec-
are organized into possibly overlapping time periods and LDA is        tion of research topics by using the Computer Science Ontology5
applied on each document sub-collection independently from the         to model research topics in the Rexplore system. They apply a
other time slices. As our experiments show, the results we obtain      Clique Percolation Method (ACPM) for analyzing the dynamics
with this strategy are already of good quality and the integration     between existent topics. Other examples of science evolution
of dynamic topic models is an open future work.                        studies explore how "cognitive science" as a field has changed
    Topic detection and trend analysis systems [12] aim to iden-       over the last three decades [7] or analyze topic evolution patterns
tify and follow event-based topics across incoming streams of          (split, merge and knowledge transfer) in the field of Information
documents. Usually, a tracking system is given seed documents          retrieval (IR) [6].
to monitor the document stream for further documents on the
same topic, whereas a detection system performs unsupervised           3     TOPIC EVOLUTION MODEL
clustering of the incoming document stream. For example, Hu            This section presents the topic evolution model implemented in
et. al. [10] applied LDA and regression analysis to identify differ-   the EPIQUE workflow. The model is based on a multi-stage graph
ent topic evolution patterns for preprints and papers from arXiv       representation of topic evolution networks and introduces the
and the Web of Science (WoS) in astrophysics for the last 20 years     notion of pivot evolution graphs with appropriate evolution pat-
(1992 − 2011). The paper redefines the notion of topic trend and       tern metrics. The sections also presents a simple query language
popularity, and demonstrates that open access preprints have           for searching meaningful evolution patterns.
stronger growth tendency as compared to traditional printed
publications.                                                             Topic Evolution Graphs. We consider a corpus C of time-stamped
    Trend analysis describes the temporal evolution of the popu-       documents and a list P of periods. Cp denotes the corpus of doc-
larity, the utility or interest of topics, but does not take account   uments having their timestamp in period p ∈ P. We consider
of their structural evolution where one topic can evolve into          set of topics T and denote Tp ⊆ T the topics extracted from the
several sub-topics or several topics can merge into a single topic.    documents in Cp . A topic t = (v, p) ∈ Tp is defined by a (sparse)
Topic evolution networks represent this kind of structural topic       weighted term vector v ∈ R |V | where V is a vocabulary of terms.
evolution and track complex temporal changes by periodic topic         We will denote by t .v the terms and by t .p the period of t. We
discovery and directed acyclic networks aligning topics of dif-        also define a function sim : T × T → [0, 1] estimating the simi-
ferent periods. Existing evolution network based frameworks            larity between topics in T . The similarity measure depends on
mainly can be distinguished by the chosen topic extraction and         5 http://cso.kmi.open.ac.uk/
the term vectors of topics and estimates their semantic proxim-             with respect to the pivot topic t.
ity. For example in Figure 1, P has 3 periods: p1="2000 − 2002",
                                                                                         pevol(G β (t)) = 1 − avдti ∈T ′ (sim(t, ti ))
p2="2002 − 2004" p3="2004 − 2006", Tp1 contains topics 54 to 92,
and topic 92 = (v, p1), where v is a weighted vector containing             A low pivot evolution degree signifies that the pivot topic does
terms "queri", "optim", "databas", ... with a positive weight. The          not evolve much (all other topics are similar), whereas a high
similarity between topic 77 and topic 100 is sim(77, 100) = 0.74.           value indicates that the pivot topic evolves rapidly .
    Based on the previous definitions, we define a topic evolution        - The split degree split(G β (t)) of a pivot topic (t, β) is defined
graph as a directed labeled multistage graph G β = (T , E, sim, β)          by the average outdegree of G β (t).
overT where the edges E connect topics from consecutive periods                                                   |E ′ |
and their similarity is higher or equal to some threshold β. That                    split(G β (t)) =
                                                                                                   |{ti |ti ∈ T ∧ outdeд(ti ) > 0}|
is, E = {(ti , t j ) ∈ T |sim(ti , t j ) ≥ β ∧ t j .p = ti .p + 1}.
                                                                            A low value signifies that the topics evolve along linear paths
   Pivot Evolution Graphs. Threshold β strongly influences the              and a high value signifies that the topics split into several
complexity of the obtained evolution graphs. It is easy to see that         future sub-topics.
G β is a subgraph of G β for all β ′ ≥ β and G 0 is the complete          - The convergence degree conv(G β (t)) of a pivot topic (t, β) is
    ′


graph connecting all topics of two consecutive periods. More                defined by the average indegree of G β (t).
exactly, higher β values generate more "linear" graphs with many                                                   |E ′ |
isolated topics, whereas lower values generate more complex                           conv(G β (t)) =
                                                                                                     |{ti |ti ∈ T ∧ indeд(ti ) > 0}|
and heterogeneous graphs containing a variety of potentially
                                                                            A low value signifies that many topics depend on a single
interesting structures. Analyzing science evolution by using topic
                                                                            parent topic and a high value signifies that many topics are
evolution graphs then becomes a complex task which consists in
                                                                            the result of the fusion of past topics.
computing and visually exploring multiple graphs for different β
values.                                                                      Topic labeling. All topics t of some evolution graph G β are
   To solve this problem, we propose a different approach which           labeled by the top-k highest weighted terms t .l in the topic term
allows users to formulate queries for characterising and extracting       vector t .v. Let t .lp ⊆ t .l and t .l f ⊆ t .l be the sets of past and
interesting subgraphs from a set of evolution graphs defined by           future terms which appear, respectively, in the ancestor topics
a set of β thresholds. For this, we decompose topic evolution             and in the descendant topics of t. Then, the terms in some topic
graphs into the set of all connected subgraphs defined by all             vector t .l are partitioned into the following four subsets of :
paths containing a given topic t (one graph per topic). More              • emerging future terms t .le = t .l f − t .lp which do not exist in
formally, a pivot evolution graph G β (t) = (T ′, E ′, sim, β) of topic     past topics,
t in G β is the subgraph of G β which contains t and all ancestors        • decaying past terms t .ld = t .lp − t .l f which do not exist in
and descendants of t. The subgraph of G β (t) containing all nodes          future topics,
which are reachable from t is called the future of t, denoted by          • stable terms t .lд = t .lp ∩ t .l f which exist in the past and the
F β (t), and the subgraph of nodes which reach t is called the              future topics of t, and
past of t, denoted by P β (t). The couple (t, β) is called a pivot        • specific terms t .ls = t .l − (t .lp ∪ t .l f ) which neither exist in the
topic with pivot graph G β (t), future F β (t) and past P β (t). It is      past nor in the future topics of t.
easy to see that if t 1 appears in the future (past) of t 2 , then the       The quadruple [t .le , t .ld , t .lд , t .ls ] is called the term label of t.
future (past) of t 1 is a subgraph of the future (past) of t 2 and t 2
appears in the past (future) of t 1 . This property can be exploited          Pivot Topic Query Language. Liveliness, relative evolution de-
to filter topics wrt. future and past topics (see the definition of       gree, pivot evolution degree, split degree and convergence degree
Path Filters below).                                                      allow to characterize the amount and complexity of the evolution
   The evolution of topics within their evolution graphs can be           of a topic in some evolution graph G β . Combined with other
characterized by the following metrics:                                   filters on the topic labels and the graph structure, it is possible
                                                                          to filter pivot topics satisfying rich evolution patterns within a
- The liveliness live(G β (t)) of a pivot topic (t, β) is defined by
                                                                          set of evolution graphs G βi , 1 ≤ i ≤ n.
  the diameter of its pivot graph G β (t).
                                                                              Let DB be the union of all future and past pivot topic graphs
        live(G β (t)) = max {lenдth(p)|p = path in G β (t)}               F β (t) and P β (t). Operators can be composed by concatenation
                                                                          (similar to Scala methods). For each attribute A we define a filter
  A high liveliness value describes a long living topic, whereas          A(X ) where X is a valid value, and for each ordinal attribute the
  a value equal to 0 corresponds to an isolated topic without             filter A(X, Y ) contains a second attribute Y restricting X to be
  ancestors and descendants.                                              the minimal (Y = 0) and maximal (Y = 1) value respectively.
- The relative evolution degree revol(G β (t)) of a pivot topic (t, β)
                                                                          • Term Filters select pivot graphs with respect to the pivot topic
  is defined by the average topic dissimilarity (edge) weight in
                                                                             labels. In particular, they can be applied to filter pivot graphs
  G β (t).
                                                                             wrt. to their emerging, decaying, stable, and specific terms.
           revol(G β (t)) = 1 − avд(ti ,t j )∈E ′ (sim(ti , t j ))           Find all pivot topics where the term "deep learning" is emerging
                                                                             and the term "big data" is decaying:
  A low relative evolution degree states that most topics evolve
                                                                            DB . Emerge ( " deep ␣ learning " ) . Decay ( " big ␣ data " )
  slowly. On the other hand, a high value signifies that most
  topics have an important "semantic gap". By definition, we              • Temporal Filters allow the expert to filter all topics situated
  have revol(G β (t)) ≤ 1 − β.                                              within a certain time period. Find all topics between 2012 and
- The pivot evolution degree pevol(G β (t)) of a pivot topic (t, β)         2017:
  is defined by the average dissimilarity of all topics in G β (t)          DB . Period ( 2 0 1 2 , 0 ) . Period ( 2 0 1 7 , 1 )
• Pattern Filters can filter topics by their pivot graph structure
  along their liveliness, split degree and convergence degree.
  Find all pivot graphs which cover at least 6 periods where all
  topics split into at least three subtopics in average:
    DB . Live ( 6 , 0 ) . S p l i t ( 3 , 0 )
• Evolution Filters are applied to filter topics by their relative
  and pivot evolution degrees. Find all topics which are evolving
  "slowly":
    DB . Revol ( 0 . 3 , 1 )

   The previous filters are applied to sets of pivot topics and can
be combined with other operators:
• Temporal Projection allows to project the pivot graph to its
                                                                                     Figure 2: Topic evolution model of EPIQUE
  past and future. Find all pivot topics with a linear future of a
  minimal length of 3 periods:
    DB . Future . Live ( 3 , 0 ) . S p l i t ( 1 , 1 )                  period. We will illustrate in our experiments how experts can be
• Set Operations (union, intersection, minus) combine sets of           assisted in choosing the right number of topics.
  topics. Find all topics with decaying term "big data" and without        The topic-term vectors generated by the topic extraction step
  emerging term "deep learning":                                        are combined with an appropriate similarity measure for aligning
                                                                        topics from different periods. In our experiments, we use cosine
    DB . Decay ( " big ␣ data " )
                                                                        similarity which performs well in measuring the correlations be-
       . Minus ( DB . Emerge ( " deep ␣ learning " ) )
                                                                        tween sparse vectors. Observe that the choice of LDA and cosine
• Path Filters select pivot topics by the existence of a path to/from   similarity does not exclude the use of other topic models and
  other topics. Find all topics with an emerging term "deep learn-      similarity measures like Jaccard similarity [11] or Battacharya
  ing" where the future contains a path to a topic with the decaying    similarity [2].
  term "big data":                                                         The next step produces instances of topic evolution graphs fol-
    DB . Emerge ( " deep ␣ learning " )                                 lowing the model introduced in Section 3. The topics are aligned
       . Future . Path ( DB . Decay ( " big ␣ data " ) )                to generate a single topic evolution graph G β0 for some small
                                                                        alignment threshold β 0 (see the central part of Figure 2). This
• Ordering: Find all topics about "big data" ordered by period:
                                                                        global evolution graph is then transformed into n families of
    DB . Term ( " big ␣ data " ) . S o r t ( " Period " , " asc " )     pivot evolution graphs defined by a set of alignment thresholds
                                                                        βi > β 0 , 1 ≤ i ≤ n. Each family contains the pivot graphs G βi (t)
4      IMPLEMENTATION                                                   of all pivot topics (t, βi ). We only consider pivot graphs with
This section presents the implementation of the topic evolution         at least one edge and ignore isolated pivot topics (single node
model in the context of the EPIQUE project. Our implemented             graphs). The final database then contains at most n × |T | pivot
workflow is able to handle any scientific document corpus where         graphs where |T | is the number of topics in G β0 . These graphs
each document has a publication date and some text content (title,      can then be queried using the filters defined in Section 3 and we
abstract, keywords, ...).                                               will illustrate the result of some queries in Section 5.

   EPIQUE Workflow. Figure 2 details the main steps of the EPIQUE            Pivot Graph Computation. We first present an algorithm that
workflow. The workflow starts with a standard document pre-             computes G βi for a sequence of relative evolution bounds β 0 ,
processing step composed of some lexical analysis, stop-word            β 1 , ..., βn where βi < βi+1 . The basic idea of this algorithm is
removal, stemming, index term generation and term selection.            to exploit the monotonicity property that G βi +1 is a subgraph
The main goal of this step is to extract for each document a term       of G βi for all 0 ≤ i < n. The input of the algorithm is a set of
index which precisely describes the scientific document contents.       time-stamped topics T , a similarity function sim and a sequence
The document preprocessing step is followed by corpus periodiza-        of βi values. Topics are represented as a binary table Topics(t, a)
tion step which decomposes the document collection according            storing the topics t and their periods a, the sequence of βi values
to several continuous, possibly overlapping, time windows, i.e.,        is defined as a binary table Beta(b, i) where b = βi and the
the same document may appear in two periods. Each time win-             similarity function is defined as a table Sim(x, y, s) where s =
dow defines a corpus period, which is the subset of documents           sim(x, y). The following recursive Datalog program computes all
published during the corresponding time period. The choice of           G βi as a relational table Graph(x, y, s, a, i) connecting all topics
the window size and sliding step depends on the granularity of          x of period a to all topics y of period a + 1 where there exists an
the document time-stamps (year, month, day) and on the number           evolution edge of similarity s = sim(x, y) ≥ βi .
of available documents in each period.
                                                                        Graph ( x , y , s , a , 0 ) : − Topics ( x , a ) , Topics ( y , a + 1 ) ,
   In the following step, each corpus period is analyzed by a topic
                                                                                                        Sim ( x , y , s ) , Beta ( b , 0 ) , s >=b
model. In our implementation, we use LDA [4] to extract the
                                                                        Graph ( x , y , s , a , i ) : − Graph ( x , y , s , i − 1 ) ,
topics of each corpus period. The main output of LDA is a topic-
                                                                                                        Beta ( b , i ) , s >=b
term matrix describing each topic as a weighted term vector. LDA
requires to set the number of topics to be generated in advance.        Starting from Graph we can compute all pivot topic evolution
Tuning this parameter is important and subtle because it strongly       graphs for all topics in T and beta value βi . This is done by
influences the diversity of the topics generated for each time          generating first a table TC(x, y, s, l, a, i) containing the transitive
closure of graph G βi where l is the distance between x and y 6 , a
is the period of x and s is the similarity between x and y.
TC( x , y , s , 1 , a , i ) : − Graph ( x , y , s , a , i )
TC( x , y , s , l + 1 , a , i ) : −TC( x , z , _ , l , a , i ) ,
                              Graph ( z , y , _ , _ , i ) , Sim ( x , y , s )
This table can then be used to compute the future and the past
graph for each pivot topic p.
Future ( p , x , y , r s , ps , l , a , i ) : −TC( p , x , _ , _ , a , i ) ,
                  TC( p , y , ps , l , _ , i ) , Graph ( x , y , r s , _ , i )
Past ( p , x , y , r s , ps , l , a , i ) : −TC( x , p , ps , l , a , i ) ,
                  TC( y , p , _ , _ , _ , i ) , Graph ( x , y , r s , _ , i )                   Figure 3: EPIQUE web application architecture

Graphs Past and Future contain the pivot topic evolution graphs
of all topics where a tuple (p, x, y, rs, ps, l, a, i) represents an edge               Architecture. Figure 3 gives an overview of the architecture of
(x, y) in G βi (p) for pivot p in period a with relative evolution sim-              our web application implemented on top of Apache Spark and
ilarity rs, pivot evolution similarity ps of x (Past) and y (Future)                 Jupyter Notebook. The entire process to study science evolution
and distance l of x (Past) and y (Future) from p.                                    over a corpus is splitted into two steps for building the pivot
   Table size estimation. We can estimate the size of each table                     evolution graphs and for interactively exploring these graphs.
as follows. Let p be the number of periods, t be the number of                       Each step corresponds to a separate user interface. The evolution
topics per period and k be the maximal outdegree and indegree                        graph generation is implemented in Scala and exectued through
in G βi . Then the size of Graph is bound by |Graph| ≤ k ∗ t ∗                       the Spylon7 kernel. Evolution graph exploration uses a standard
(p − 1) (remind that Graph is a multistage graph). The size of the                   Python kernel to take advantage of advanced Python 3 graphical
transitive closure TC is bound by |TC | ≤ k ∗t ∗p ∗(p −1)/2. Finally,                user interface libraries for facilitating user interaction.
both graphs Future and Past, contain for each tuple (x, y, s, l, a, i)
in the transitive closure TC (x and y are connected by a path of                     5    EVALUATION
length l in G βi ), at most k tuples Future(x, y, _, _, _, _) and at                    Experiment Setting. We conducted our experimentations on
most k tuples Past(y, x, _, _, _, _). Therefore, the size of Future                  three real-world data sets of different scales by using the titles
and Past is bound by k times the size of the transitive closure:                     and the abstracts of each document. The smallest dataset, called
|Future | ≤ k 2 ∗ t ∗ p ∗ (p − 1)/2. As our experiments show, even                   ISTEX, contains 14 851 papers in the domain of ecological eco-
for small β-thresholds (β = 0.2) the maximum indegree and                            nomics and environmental economics. The second one is arXiv, a
outdegree of a topics is smaller than k = 10 and we generally                        repository of electronic preprints approved for publication after
assume about t = 100 topics over p = 20 periods. Then, the size                      moderation, which consists of scientific papers in the fields of
of the transitive closure is |TC | ≤ 10 ∗ 100 ∗ 10 ∗ 19 = 1.9 ∗ 105                  mathematics, physics, astronomy, electrical engineering, com-
edges and the size of |Future | ≤ 1.9 ∗ 106 edges. These numbers                     puter science, quantitative biology, statistics, and quantitative
are much smaller in practice (see Section 5) and current big data                    finance, etc. This data set contains about 1.1 million documents.
frameworks can easily manage graphs of this size. We plan to                         The third dataset is from the Wiley online library which contains
study possible optimizations in the future.                                          1 million documents including additional domains such as agri-
                                                                                     culture, art, humanities, etc. The statistics over these three data
   Metrics computation. The liveliness, relative evolution degree,                   sets are summarized in Table 1, where #D is the total number of
pivot evolution degree, split degree and convergence degrees of                      documents, #T is the number of topics per period, #G is the total
all pivot evolution graphs can directly be computed by a stan-                       number of pivot graphs, #E is the number of edges (total and per
dard SQL aggregation query. For example, the following query                         pivot graph).
computes these metrics for all future pivot evolution graphs:                           We run our EPIQUE workflow on an Apache Spark cluster in
c r e a t e view P i v o t F u t u r e as                                            standalone mode with Spark version 2.4, Scala version 2.11 and
  s e l e c t p , i max ( l ) as l i v e l i n e s s ,                               Java version 8. The cluster consists of 11 machines: one driver
              1−avg ( r s ) as r e v o l ,                                           has 20GB memory, and 10 worker nodes, each one having 24
              1−avg ( ps ) as p e v o l ,                                            CPU cores with hyperthreading and 50 GB memory. For all of
              count ( ∗ ) / count ( d i s t i n c t x ) as s p l i t ,               our experiments, we used the documents extracted from a 20-
              count ( ∗ ) / count ( d i s t i n c t y ) as conv                      year period and split each corpus into 10 slices by using the time
 from F u t u r e                                                                    window spanning 3 years with 1-year overlap. Thus, we have 10
  group by p , i                                                                     LDA models for each corpus.
                                                                                        Column LDA in Table 1 shows the average execution time
    Observe that for evaluating the pivot query filters presented in
                                                                                     for computing the LDA model per period. This computation is
Section 3 without the Path-operator, it is sufficient to store Graph
                                                                                     only done once and mainly depends on the number of extracted
(for visualization) and PivotFuture, PivotPast and PivotAll (for
                                                                                     topics per period. In Table 1, we can see that the Wiley corpus is
filtering). An efficient implementation of Path queries, for exam-
                                                                                     about 70 times larger than ISTEX but has a similar LDA execution
ple by using graph-labeling schemes for checking node reacha-
                                                                                     time for the same number of topics. On the other hand, the LDA
bility in acyclic graphs, is part of our future work.
                                                                                     execution time is more important for arXiv, where we extract a
                                                                                     higher number of topics.
6 Since G
            β i is a multistage graph, all paths between two nodes are of the same
length.                                                                              7 https://github.com/Valassis-Digital-Media/spylon-kernel
                                                          Table 1: Dataset statistics


   Datasets           #D          Period    #Periods          #T      #G          #E            #E             LDA                G              G
                    total           total       total   / period    total       total    per pivot     sec / period      sec (total)    sec / pivot
     ISTEX         14 851   1991 − 2010            10         20    1806    211 850             117                28            490          0.27
      arXiv     1 156 300   1998 − 2017            10         50    3364    272 944              81                40            641          0.19
     Wiley      1 023 515   1996 − 2015            10         20    1360    198 179             145                30            505          0.37


   The last two columns G give the total and average execution              among which some become very complex. For example, in Fig-
time for the pivot graph computation step. We computed the pivot            ure 4g, pivot topics tend to evolve a lot even for a low relative
graphs for 9 β-threshold values spanning from 0.1 to 0.9. The               evolution degree. The pivot graphs in Figure 4h are much more
total execution time obviously depends on the β-thresholds. The             complex than the graphs generated by the same β-threshold with
number of the thresholds increases the number of pivot graphs               #T = 50 topics (Figure 4e and Figure 4f). As we can see, the split
to be computed and the values of the thresholds defines the size            degree attains a value of 19 compared with maximal split degree
of the computed pivot graphs. The average execution time to                 1.5 in Figure 4e. The increase of #T reduces the proportion of
construct a pivot graph (last column) can be obtained by dividing           isolated topics, 30% for #T = 150 compared with 60% for #T = 50.
the total time to build all pivot graphs by the number of pivot             As we see in the next section, this is also due to the existence
graphs. For example, for ISTEX corpus, we obtained 1806 pivot               of many similar topics in each period, which also increases the
graphs in 490 seconds which gives the average value of 0.27 sec-            probability that two topics can be aligned.
onds. The average pivot graph computation time mainly depends
on the pivot graph size. This can be seen for the arXiv dataset,               Diversity-based Topic Number Selection. Figure 5 shows a fu-
which has the smallest average number of pivot graph edges                  ture arXiv pivot graph generated for #T = 150 and β = 0.5, which
(#E/pivot = 81) and the lowest average pivot graph computation              corresponds to a data point in Figure 4h where split(G β (t)) = 8.3
time (0.19 seconds/graph).                                                  and conv(G β (t)) = 6.4. The graph connects topics with similarity
                                                                            higher than β = 0.5 and has nevertheless a high split and conver-
                                                                            gence degree. When looking in more detail, we can observe that
   Pivot Topic Analysis. The metrics defined in Section 3 can be
                                                                            the topics in each period are also very similar which explains
used for the structural and quantitative analysis of the evolution
                                                                            why the single root pivot topic is connected to more than 20
of topics. The objective of this section is to explore the impact of
                                                                            topics in the second period.
the main parameters,i.e., the β threshold and the topic number
                                                                               In order to build pivot graphs over more representative topic
#T , on the structure and the semantics of the generated pivot
                                                                            sets, we use topic diversity for estimating the quality of a topic set.
evolution graphs.
                                                                            The topic diversity inside a period can be estimated by observing
   Figure 4 shows the distribution of future pivot evolution graphs
                                                                            the dissimilarity distribution over all topic pairs inside the period.
in arXiv wrt. three groups of metrics, the relative evolution degree
                                                                            For example, Figure 6a and Figure 6b shows the topic diversity
vs. the pivot evolution degree , the split degree vs. convergence de-
                                                                            obtained for different LDA models applied to 1164 documents
gree and the liveliness vs. the split degree. The figure is organized
                                                                            published in arXiv during 1998 to 2000 and 16 072 documents
into 3 lines of 3 sub-graphs where each line corresponds to identi-
                                                                            published in arXiv during 2008 to 2010 respectively. Each LDA
cal fixed parameters β and #T and each sub-graph corresponds to
                                                                            model corresponds to a different number of topics #T ranging
a group of metrics. On the first line, we set β = 0.2 and #T = 50.
                                                                            from 10 to 150. For example, we can see in Figure 6a that for #T
On the second line #T remains the same (#T = 50) whereas β is
                                                                            ranging between 40 and 60, less than 5 percent (blue line) of all
increased to β = 0.5. On the third line, β remains the same as in
                                                                            topic pairs have a similarity value higher than 0.1 (dissimilarity
the 2nd line (β = 0.5) whereas #T is increased to #T = 150. Each
                                                                            value lower than 0.9), but this diversity value rapidly drops for
Figure only shows pivot topic graphs with at least two nodes and
                                                                            #T > 60. For example, for #T = 100, the same similarity bound
the number of isolated topics is reported in the figure captions.
                                                                            of 0.1 only holds for the half of the topic pairs (green median
   When comparing Figure 4a with Figure 4d, we can see that for
                                                                            line). Figure 6b shows that for the larger corpus we can achieve
the lower threshold β = 0.2, pivot topics evolve more than for
                                                                            the same diversity for much more topics (the topic diversity for
the higher value β = 0.5. Lower β values also allow pivot topics
                                                                            #T = 140 topics in period 2008 − 2010 is similar to the topic
to connect with more topics than higher β values which only
                                                                            diversity for #T = 60 topics in period 1998 − 2000. This kind
connect similar topics. This is shown in Figure 4b which repre-
                                                                            of grid analysis allows experts to choose an optimal number of
sents a large number of complex pivot topic graphs with higher
                                                                            topics for their analysis.
split and convergence degrees than the pivot topic graphs in Fig-
ure 4e. The previous observation is also confirmed in Figure 4c                Pivot Topic Exploration. Our query language allows users to
and Figure 4f which compare topic liveliness vs. split degree: the          select topics in specific regions of the sub-figures in Figure 4. For
lower threshold β = 0.2 generates pivot graphs which are more               example, the following query Q1 chooses all topics which appear
complex than pivot graphs with the same liveliness scores gen-              in the upper right window of Figures 4a and on the right part of
erated by β = 0.5. Therefore, for a fixed #T , varying β allows             Figure 4b on the line corresponding to the liveliness value 5 in
for revealing interesting evolution patterns at different levels of         Figure 4c.
detail where the evolution of some topic might be too complex
for low β values and become more intelligible for higher β values.          Q1 : = DB . Future . Revol ( 0 . 5 , 0 ) . Pevol ( 0 . 6 , 0 )
   When the topic number per period increases (#T = 150 in Fig-                       . S p l i t ( 2 , 0 ) . Live ( 5 )
ures (g), (h) and (i)), the workflow generates more pivot graphs,
 (a) β = 0.2, #T = 50, #Pivot = 477, #Isolated = 23     (b) β = 0.2, #T = 50, #Pivot = 477, #Isolated = 23     (c) β = 0.2, #T = 50, #Pivot = 477, #Isolated = 23


 (d) β = 0.5, #T = 50, #Pivot = 198, #Isolated = 302    (e) β = 0.5, #T = 50, #Pivot = 198, #Isolated = 302    (f) β = 0.5, #T = 50, #Pivot = 198, #Isolated = 302


(g) β = 0.5, #T = 150, #Pivot = 1014, #Isolated = 486 (h) β = 0.5, #T = 150, #Pivot = 1014, #Isolated = 486   (i) β = 0.5, #T = 150, #Pivot = 1014, #Isolated = 486


                                Figure 4: Distribution of future pivot evolution graphs in arXiv wrt. their metrics.


                                                             Figure 5: G 0.5 (495) with #T = 150


                                   (a) 1998-2000                                                                 (b) 2008-2010


                                                   Figure 6: Dissimilarity distribution by topic number
                                                                                (b) Q2: r evol ≥ 0.5 ∧           (c) Q3: r evol ≤ 0.4 ∧                 (d) Q4: r evol ≤ 0.4 ∧
    (a) Q1: r evol ≥ 0.5 ∧ pevol ≥ 0.6 ∧ spl it ≥ 2 ∧ l ive = 5                 pevol ≥ 0.6∧split ≤              pevol ≤ 0.5 ∧ split ≥                  pevol ≤ 0.5∧split ≤
                                                                                1.2 ∧ live = 5                   1.5 ∧ live = 3                         1.2 ∧ live = 5

                           Figure 7: Examples of query results which filter pivot topics by using aforementioned metrics


Observe that the user does not specify the β-threshold. A result                            [5] David Chavalarias and Jean-Philippe Philippe Cointet. 2013. Phylomemetic
example of query Q1 and three other queries is shown in Fig-                                    patterns in science evolution—the rise and fall of scientific fields. PloS one 8, 2
                                                                                                (2013), e54847.
ure 7. Although Figure 7b and Figure 7d have the same structure,                            [6] Baitong Chen, Satoshi Tsutsui, Ying Ding, and Feicheng Ma. 2017. Understand-
they have different evolution pace (corresponding to different β                                ing the topic evolution in a scientific domain: An exploratory study for the
                                                                                                field of information retrieval. Journal of Informetrics 11, 4 (2017), 1175–1189.
values). The pivot graph in Figure 7b has more emerging terms                               [7] Uriel Cohen Priva and Joseph L. Austerweil. 2015. Analyzing the history of
(green part) whereas the pivot graph in Figure 7d has more sta-                                 Cognition using Topic Models. Cognition 135 (Feb. 2015), 4–9.
ble terms (blue part) which correspond to our queries to select                             [8] Eugene Garfield. 1955. Citation Indexes for Science: A New Dimension in
                                                                                                Documentation through Association of Ideas. Science 122, 3159 (July 1955),
high-evolution and low-evolution pivot topics respectively.                                     108–111.
   Apart from these metric-based filters, our query language also                           [9] Qi He, Bi Chen, Jian Pei, Baojun Qiu, Prasenjit Mitra, and Lee Giles. 2009.
allows users to define other multi-dimensional filtering criteria                               Detecting Topic Evolution in Scientific Literature: How Can Citations Help?.
                                                                                                In ACM Conf. on Information and Knowledge Management (CIKM ’09). New
including topic labels and temporal conditions for the selection                                York, NY, USA, 957–966.
of pivot topics.                                                                           [10] Beibei Hu, Xianlei Dong, Chenwei Zhang, Timothy D. Bowman, Ying Ding,
                                                                                                Staša Milojević, Chaoqun Ni, Erjia Yan, and Vincent Larivière. 2015. A Lead-
                                                                                                lag Analysis of the Topic Evolution Patterns for Preprints and Publications. J.
                                                                                                Assoc. Inf. Sci. Technol. 66, 12 (Dec. 2015), 2643–2656.
6     CONCLUSION AND FUTURE WORK                                                           [11] Paul Jaccard. 1912. The Distribution of the Flora in the Alpine Zone.1. New
We have presented a new framework for the visualisation and ex-                                 Phytologist 11, 2 (1912), 37–50.
                                                                                           [12] April Kontostathis, Leon M. Galitsky, William M. Pottenger, Soma Roy, and
ploration of topic evolution networks representing the progress                                 Daniel J. Phelps. 2004. A survey of emerging trend detection in textual data
and evolution of research in scientific document archives. This                                 mining. In Survey of text mining. Springer, 185–224.
framework has been implemented on top of Apache Spark using                                [13] Thomas S. Kuhn, Otto Neurath, and Thomas Samuel Kuhn. 1994. The Structure
                                                                                                of scientific revolutions (2nd ed., enlarged ed.). Number ed.-in-chief: Otto Neu-
LDA and cosine similarity for topic extraction and topic align-                                 rath ; Vol. 2 No. 2 in International encyclopedia of unified science Foundations
ment. The user can express complex evolution pattern queries to                                 of the unity of science. Chicago Univ. Press, Chicago, Ill. OCLC: 258260085.
                                                                                           [14] Ke Li, Hubert Naacke, and Bernd Amann. 2020. EPIQUE: Extracting Meaning-
obtain the relevant pivot topic graphs. A first prototype [14] is                               ful Science Evolution Patterns from Large Document Archives (Demonstra-
currently used to extract complex evolution patterns for different                              tion). In Int’l Conf. on Extending Database Technology (EDBT). Copenhagen,
scientific domains as part of the EPIQUE project and in collabo-                                Denmark.
                                                                                           [15] Angelo A. Salatino, Francesco Osborne, and Enrico Motta. 2018. AUGUR:
ration with philosophers of science. As future work we intend to                                Forecasting the Emergence of New Research Topics. In ACM/IEEE on Joint
optimize the computation of pivot topic evolution graphs and ex-                                Conference on Digital Libraries (JCDL ’18). ACM, New York, NY, USA, 303–312.
ploit the LDA document-topic matrix for enriching the analysis.                            [16] Dafna Shahaf, Carlos Guestrin, Eric Horvitz, and Jure Leskovec. 2015. Infor-
                                                                                                mation Cartography. Commun. ACM 58, 11 (2015), 62–73.
Additionally, we plan to integrate other topic extraction methods                          [17] Xiaoling Sun, Jasleen Kaur, Staša Milojević, Alessandro Flammini, and Filippo
than LDA.                                                                                       Menczer. 2013. Social Dynamics of Science. Scientific Reports 3 (Jan. 2013),
                                                                                                1069. https://doi.org/10.1038/srep01069
                                                                                           [18] Yee W Teh, Michael I Jordan, Matthew J Beal, and David M Blei. 2005. Sharing
                                                                                                clusters among related groups: Hierarchical Dirichlet processes. In Advances
REFERENCES                                                                                      in neural information processing systems. 1385–1392.
 [1] Victor Andrei and Ognjen Arandjelović. 2016. Complex temporal topic evolu-            [19] Chong Wang, David Blei, and David Heckerman. 2008. Continuous Time
     tion modelling using the Kullback-Leibler divergence and the Bhattacharyya                 Dynamic Topic Models. In Conference on Uncertainty in Artificial Intelligence
     distance. EURASIP Journal on Bioinformatics and Systems Biology 2016, 1                    (UAI’08). AUAI Press, Arlington, Virginia, United States, 579–586.
     (2016), 16.                                                                           [20] Xuerui Wang and Andrew McCallum. 2006. Topics over Time: A non-Markov
 [2] A. Bhattacharyya. 1943. On a measure of divergence between two statistical                 Continuous-time Model of Topical Trends. In Int’l Conf. on Knowledge Discov-
     populations defined by their probability distributions. Bulletin of the Calcutta           ery and Data Mining (KDD ’06). ACM, New York, NY, USA, 424–433.
     Mathematical Society 35 (1943), 99–109.
 [3] David M. Blei and John D. Lafferty. 2006. Dynamic Topic Models. In Proceedings
     of the 23rd International Conference on Machine Learning (ICML ’06). ACM,
     New York, NY, USA, 113–120.
 [4] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet
     allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.