Exploiting Contextual Embeddings to Extract Topic
                                Genealogy from Scientific Literature
                                Alfio Ferrara1 , Stefano Montanelli1 , Sergio Picascia1,∗ and Davide Riva1,∗
                                1
                                 Università degli Studi di Milano
                                Department of Computer Science
                                Via Celoria, 18 - 20133 Milano, Italy


                                                                           Abstract
                                                                           Modeling the evolution of topics and forecast future trends is a crucial task when analyzing scientific papers. In this work we
                                                                           propose tASKE (temporal Automated System for Knowledge Extraction), a dynamic topic modeling approach which exploits
                                                                           zero-shot classification and contextual embeddings in order to track topic evolution through time. The approach is evaluated
                                                                           against a corpus of data science papers, assessing the ability of tASKE to correctly classify documents and retrieving relevant
                                                                           derivation relationships between older and new topics in time.

                                                                           Keywords
                                                                           Natural Language Processing, Scientometrics, Topic Genealogy


                                1. Introduction                                                                                               manner in order to discover topics when they actually ap-
                                                                                                                                              pear, and ii) genealogy reconstruction, in which extracted
                                With the amount of published scientific literature increas-                                                   topics are placed in a tree structure representing their
                                ing each year, keeping track of newly formulated topics                                                       lineage in the history of the discipline.
                                and their derivation process becomes a challenge for re-                                                         In this paper, we present tASKE, a method to extract
                                searchers, scholars, and publishers. The problem lies in                                                      topics from a diachronic corpus of scientific papers and
                                the fact that the total amount of definitions, theorems,                                                      reconstruct their genealogy in a completely unsupervised
                                properties, tasks, and subdomains tends to grow expo-                                                         way. Our method is developed upon our Automated
                                nentially, since several of them may be conceived start-                                                      System for Knowledge Extraction (ASKE) framework [1],
                                ing from a single one or the interaction of a few ones.                                                       which relies on pre-trained contextual embedding models
                                For instance, in the domain of Machine Learning, the                                                          to represent documents and topics in the same vector
                                idea of neural networks gave rise to that of deep learning,                                                   space and on a cyclical term extraction and clustering
                                which has then been applied to problems such as image                                                         phase to extract new topics. Besides presenting tASKE
                                reconstruction and partial differential equations, and was                                                    as a time-aware extension of ASKE, we introduce an
                                further deepened with topics such as attention, which                                                         evaluation framework and a case study on a corpus of
                                in turn provided the intuition behind transformers and a                                                      abstracts of scientific papers related to the Data Science
                                basis for explainability.                                                                                     domain, with the goal of demonstrating the effectiveness
                                   Referring to definitions, theorems, properties, tasks,                                                     of tASKE both for topic extraction and for extracting
                                subdomains and the like with the generic label of “top-                                                       topic-to-topic derivation relationships.
                                ics”, abstract objects a text refers to, it is possible to study                                                 The work is organized as follows: Section 2 Related
                                “topic genealogy” in a diachronic corpus, i.e. the descent                                                    Work reports on the literature about topic modeling as
                                of topics from older ones over time. The task of extract-                                                     well as the technology underlying our method. Section 3
                                ing topic genealogy falls within the scope of Knowledge                                                       Methodology presents the methodology and techniques
                                Extraction (KE), and it consists of two main sub-tasks: i)                                                    enforced in tASKE. Section 4 Case Study and Evaluation
                                topic extraction, by which we aim to retrieve topics that                                                     presents the case study on a Data Science Literature
                                are important in a written document, possibly in a timely                                                     corpus, on which the evaluation was conducted. Sec-
                                                                                                                                              tion 5 Concluding Remarks draws some conclusions and
                                SDU2023: The Third AAAI Workshop on Scientific Document Under-
                                                                                                                                              sketches some future work.
                                standing, February 14, 2023, Washington, DC
                                ∗
                                     Corresponding author.
                                Envelope-Open alfio.ferrara@unimi.it (A. Ferrara); stefano.montanelli@unimi.it
                                (S. Montanelli); sergio.picascia@unimi.it (S. Picascia);                                                      2. Related Work
                                davide.riva1@unimi.it (D. Riva)
                                Orcid 0000-0002-4991-4984 (A. Ferrara); 0000-0002-6594-6644                                                                         The task of classifying large amounts of textual docu-
                                (S. Montanelli); 0000-0001-6863-0082 (S. Picascia);                                                                                 ments without relying on labeled data and presenting la-
                                0009-0003-9681-9423 (D. Riva)                                                                                                       tent features of texts, such as hidden topics, is commonly
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                                                       Attribution 4.0 International (CC BY 4.0).                                                   addressed employing topic modeling techniques. Latent
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
     classification

      derivation

      belonging

      derivation


Figure 1: The tASKE Conceptual Graph.


Semantic Analysis (LSA) [2] was one of the first proposed     [11], each of them being tailored to specific tasks, such
approaches, exploiting Singular Value Decomposition           as semantic similarity [12] and zero-shot learning [13].
(SVD) in order to reduce the number of dimensions of             Zero-Shot Learning (ZSL) is a problem setup in the
a document-term matrix and to easily compute similar-         field of machine learning, where a classifier is required
ity between document vector representations. LSA was          to predict labels of examples extracted from classes that
soon followed by Latent Dirichlet Allocation (LDA) [3],       were never observed in the training phase. It was firstly
which employs Bayesian analysis in order to optimize          referred to as dataless classification in 2008 [14] and has
the distributions of documents belonging to topics, and       quickly become a subject of interest, particularly in the
of words defining these topics. The majority of recent        field of NLP. The great advantage of this approach con-
works in topic modeling takes its inspiration from the        sists in the resulting classifier being able to operate effi-
original LDA with several variations proposed, such as        ciently in a partially or totally unlabeled environment.
Correlated Topic Modeling [4] and Hierarchical Topic             tASKE aims at dynamically modeling the presence
Modeling [5].                                                 and evolution of latent topics in a diachronic corpus of
   Common topic modeling methods are not able to cap-         documents. It exploits zero-shot learning and contextual
ture the changes of topics over time. For this reason,        embeddings not only to perform the classification task,
techniques of Dynamic Topic Modeling (DTM) are em-            but also to extract relevant knowledge from textual data.
ployed when dealing with diachronic corpora. Since the
first approach (Dynamic LDA [6]) was proposed, the field
has been attracting attention among researchers. Among        3. Methodology
the possible applications of the designed methods, the
                                                              The objective of tASKE is to extract a genealogy of topics
study of scientific papers, also known as “Scientometrics”,
                                                              from a diachronic corpus of documents. Every piece of in-
was addressed with the aim to assess past and present
                                                              formation is stored in a graph-based data structure called
trends in a specific discipline [7] or to forecast possible
                                                              tASKE Conceptual Graph (ACG), whose architecture is
future subareas of research interest [8].
                                                              illustrated in Figure 1.
   Later studies have been taking into consideration the
                                                                 The nodes in the ACG model belong to three different
integration between DTM and word embeddings [9] so to
                                                              categories:
further capture the semantic aspect of the analyzed doc-
uments [10]. Embedding techniques are vastly employed              • document chunks 𝐾: the object of the analysis,
in the field of Natural Language Processing (NLP), in or-            they are small portions of the original documents
der to represent textual data in a vector space. Several             extracted through the application of tokenization
models capable of computing contextual token embed-                  techniques. They are tuples of the form (𝑘, k),
dings have been released since the presentation of BERT              where 𝑘 is the text of the document chunk and k
                                                                     is its vector representation;
                                                             zero-shot
                                                           classification


            embedding
              model
                                    data                       ACG                              terminology         knowledge
                                preprocessing                                                   enrichment            base
              corpus


                                                              topic
                                                            formation


Figure 2: The tASKE cycle at time 𝑡.


     • topics 𝐶: they represent the abstract objects to             𝑊 (𝑡) from the document chunks and assigns them to the
                                                                               ←
       which documents chunks are assigned and, in                  topics 𝐶 (𝑡 ) , finally updating the set of current topics
       practice, they are clusters of related terms. They             (𝑡)
                                                                    𝐶 .
                                                                                                                           (𝑡 ← )
       are tuples of the form (𝑐, c), where 𝑐 is the label             As a consequence of this process, in ACG a topic 𝑐𝑗
       given to the topic and c is its vector representa-           can have multiple relations with the other components of
       tion;                                                                                  (𝑡 ← )
                                                                    ACG. In particular, for 𝑐𝑗 , we have i) relation classifica-
     • terms 𝑊: they are extracted from document                    tion with document chunks 𝐾 (𝑡) ; ii) a relation derivation
       chunks and clustered together in order to form                               (𝑡)
                                                                    with terms 𝑊𝑗 discovered from document chunks asso-
       topics. They are triplets of the form (𝑤𝑠 , 𝑤𝑑 , w),
                                                                                       (𝑡 ← )
       where 𝑤𝑠 is the label of the term, 𝑤𝑑 is a short             ciated with 𝑐𝑗              ; iii) a relation belonging with terms 𝑊𝑗
       sentence giving the term definition, and w is the                                                                              (𝑡)
                                                                    in its cluster; iv) a relation derivation with a new topic 𝑐𝑙
       term vector representation.                                  formed by some of the terms in 𝑊𝑗 . It can happen that
                                                                            (𝑡 ← )
   The vector representations k of document chunks and              topic 𝑐𝑘         is not associated with any document chunk at
                                                                                                   (𝑡 ← )
w of terms are computed by an embedding model which                 time 𝑡. This means 𝑐𝑘 is no longer a useful topic with
maps a text into a vector space: for k, the embedding               respect to the documents of time 𝑡. In this case, the topic
model is applied over the document chunk text 𝑘, while                (𝑡 ← )
                                                                    𝑐𝑘 becomes inactive, together with the set of terms 𝑊𝑘
for w, it is applied over the term definition 𝑤𝑑 . The vector       belonging to it, and it will not be able to form new topics.
representation c of topics is computed as the mean of the           This can be interpreted as the disappearance of interest
vectors w𝑖 of all the terms 𝑤𝑖 belonging to 𝑐. The label of         towards a certain topic, which emerged in past periods
each topic corresponds to the label 𝑤𝑠 of the term 𝑤 that           𝑡 ← but has lost its relevance in the current corpus, 𝐾 (𝑡) .
is the closest to c.                                                     In the remaining part of this section, we will discuss
   At the beginning of the analysis (i.e., time 0) the user         each phase in Figure 2, explaining in deeper details how
is required to define a set of initial topics 𝐶 (0) of interest.    each of the aforementioned relations is discovered.
               (0)
Each topic 𝑐𝑖 ∈ 𝐶 (0) is associated with a set of corre-
                      (0)
sponding terms 𝑊𝑖 , whose definitions are also provided             3.1. Data Preprocessing
by the user. At each subsequent time 𝑡, tASKE performs
one or more iterations of the cycle depicted in Figure              Preprocessing is the starting point of the tASKE cycle. At
2. As a first step, tASKE extracts the set of document              each time period 0, … , 𝑡, the model retrieves documents
chunks 𝐾 (𝑡) from the subset 𝐷 (𝑡) ∈ 𝐷 belonging to that            from the period-specific subcorpus 𝐷 (𝑡) .
period. Such document chunks are classified with re-                   Documents are first split into document chunks 𝐾 (𝑡) ,
spect to the topics discovered up to the previous time              each of which can fit into the maximum input length of a
                 ←
period 𝑡 ← , 𝐶 (𝑡 ) . Moreover, tASKE extracts new terms            contextual embedding model. In this case, we employed
Sentence-BERT [12], a modification of the original BERT                         by the aforementioned embedding model. This approach
model, which exploits siamese and triplets networks, be-                        addresses the problem of sense disambiguation, since
ing able to derive semantically meaningful sentence em-                         it maps distinct senses of polysemic words to different
beddings in form of numeric vectors. Such a model is                            embedding vectors.
employed in order to extract the semantic features of                              For each retrieved term sense, the same similarity mea-
term definitions and document chunks and map them                               sure 𝜎 used for classification is exploited in order to com-
into the same vector space.                                                     pute the similarity between w and the vectors represent-
                                                                                ing topics and document chunks. The terms whose sum
3.2. Zero-Shot Classification                                                   of similarities is greater than the hyperparameter 𝛽 be-
                                                                                come candidates for enriching the terminology of the
In the zero-shot classification phase, document-topic clas- topic 𝑐 (𝑡 ← ) :
                                                                                        𝑖
sification relationships are defined. Given the coexistence
of topics and document chunk embeddings in the same                                               (𝑡 ← )   (𝑡) (𝑡)                       (𝑡)
                                                                                              𝑔(𝑐𝑖 , 𝑊𝑖 , 𝐾𝑖 ) = {𝑤 (𝑡) ∈ 𝑊𝑖 ∶
vector space, it is possible to perform a zero-shot clas-                                                   ←
                                                                                                          (𝑡 )                       (𝑡)
                                      ←
sification, 𝑓 ∶ 𝐾 (𝑡) → 𝐶 (𝑡 ) , without having the model                                       𝜎(w(𝑡) , c𝑖 ) + 𝜎(w(𝑡) , k𝑖 ) ≥ 𝛽}
exposed to training examples. A similarity measure 𝜎
                                                                                              (𝑡)
(e.g., cosine similarity) between the embedding vector                             where k𝑖 is the centre of the embeddings of chunks
  (𝑡)                                         (𝑡)          (𝑡)                       (𝑡)
k𝑗 of each document chunk 𝑘𝑗 in 𝐾 and the embed- in 𝐾𝑖 .
                  (𝑡 ← )                       (𝑡 ← )          ←                   The set of candidate terms is sorted in descending
ding vector c𝑖            of each topic 𝑐𝑖            in 𝐶 (𝑡 ) is computed
                                                                                order according to the similarity score. In addition, one
and, eventually, the two are associated if their similarity
                                                                                can also define a learning rate 𝛾, which represents the
is higher than a predefined threshold 𝛼:
                                                                                maximum number of terms that can be associated to
                 (𝑡)          (𝑡 ← )    (𝑡 ←
                                             )            (𝑡) (𝑡 ← )            a certain topic at each iteration. Applying the bounds
      𝑓𝐶 (𝑡 ←) (𝑘𝑗 ) = {𝑐𝑖           ∈𝐶         ∶ 𝜎(k𝑗 , c𝑖 ) ≥ 𝛼}
                                                                                𝛽 and 𝛾 ensures that, at each iteration, the process of
   Tuning hyperparameter 𝛼 is crucial since it may re- terminology enrichment will include only a small set of
markably affect the classification output: for example, terms that are supposed to be meaningful with respect
choosing a high value of 𝛼 could result in a highly precise to the topic at hand.
classification, despite potentially finding only a small set                       Taking as example the topic mentioned in the previous
of document chunks for each topic (low recall).                                 section, ‘causality’, it has been associated, among others,
   Finally, classification relationships are stored in ACG with the following terms and similarity scores: ‘causality’
by considering documents as the simple concatenation (0.773), ‘etiologic’ (0.741), ‘noncausal’ (0.737).
of their chunks, so that a document 𝑑𝑗 is labelled with all
topics its chunks are labelled with.                                            3.4. Topic Formation
   For example, the document chunk
                                                                                Finally, tASKE may generate new topics in a topic forma-
          [...] graphical representations of causation                          tion phase. In this phase a clustering algorithm, such as
          have been used for at least seventy years,                            Affinity Propagation [16], is applied over the embedding
          and the modern development of directed                                                                      (𝑡)                        (𝑡 ← )
                                                                                vectors w of the terms 𝑊𝑖 related to each topic 𝑐𝑖 .
          acyclic graphs to portray causal systems                              According to the results, a different operation is enforced:
          continues the trend. It is sometimes difficult
          to understand, however, what it is about                                                                                               (𝑡 ← )
                                                                                      • derivation: if new clusters, different from 𝑐𝑖
          these diagrams that is causal [...]                                             are formed, each of them becomes a new topic,
                                                                                                             (𝑡 ← )
is classified by tASKE with the topic ‘causality’ with a                                  derived from 𝑐𝑖 , whose label is set equal to the
similarity score of 0.652.                                                                term 𝑤 closer to the cluster center;
                                                                                      • conservation: if no new cluster is formed, the
                                                                                                               (𝑡 ← )
3.3. Terminology Enrichment                                                               original topic 𝑐𝑖            is preserved, represented by
                                                                                          the cluster in which the term 𝑤 corresponding to
                      (𝑡 ← )                                                                                              (𝑡 ← )
For each topic 𝑐𝑖            in the ACG, tASKE retrieves the set of                       the concept label of 𝑐𝑖                is present;
                                (𝑡)                                                                                                 (𝑡)
lemmatized terms 𝑊𝑖 appearing in the subset of docu-                                  • pruning: if a new cluster 𝑐𝑗 is formed but all
                      (𝑡)                            (𝑡 ← )                                                                               (𝑡 ← )
ment chunks 𝐾𝑖 associated with 𝑐𝑖                           by a classification           its member terms belong also to 𝑐𝑖 , the newer
relation. These terms vectors are placed in the same                                      topic is absorbed by the older one.
semantic space, together with K and c, retrieving their
definition 𝑤𝑑 from an external knowledge base, such as                             In the end, term-topic belonging relationships and
WordNet [15], and computing its vector representation w topic-topic derivation relationships are stored in the ACG
together with document-topic classification relationships                ID   Scopus Subject area                  # of pub.
                                                                       1702   Artificial Intelligence              1,024,703
defined in the zero-shot classification phase, building up
                                                                       1800   General Decision Sciences               65,254
the topic genealogy. Topics 𝐶 (𝑡) defined in this phase will
                                                                       1801   Decision Sciences (miscellaneous)       39,058
serve as input for the next iteration.                                 1802   Information Systems and Manage-        377,259
   Considering the topic ‘causality’, consisting of the fol-                  ment
lowing set of terms {‘causality’, ‘etiologic’, ‘noncausal’,            1803   Management Science and Opera-          258,898
‘event’, ‘issue’, ‘circumstance’, ‘interpretation’, ‘explanan-                tions Research
dum’}, the tASKE model has formed three topics with                    1804   Statistics, Probability and Uncer-     168,219
the corrisponding sets of terms: ‘causality’ = {‘causal-                      tainty
ity’, ‘etiologic’, ‘noncausal’}, ‘event’ = {‘event’, ‘issue’, ‘cir-    2613   Statistics and Probability             426,341
cumstance’}, ‘interpretation’ = {‘interpretation’, ‘explanan-         Total                                        2,359,732
dum’}.                                                              Table 1
                                                                  Composition of the ScopusDS corpus used for evaluation
4. Case Study and Evaluation
tASKE is here evaluated on a case study on Data Science
literature. The evaluation framework has to account for
three targets:

     1. correctness of extracted topics,
     2. correctness of the time of extraction,
     3. correctness of topic-topic derivation relation-
        ships.

  First, a “Data Science in Scopus” corpus (hereon Sco-
pusDS Corpus), made of abstract of journal papers rang-
                                                                  Figure 3: Number of documents and keywords per year in
ing from January 2000 to December 2021, is constructed.           ScopusDS corpus.
Then keywords defined by authors of each paper are ex-
ploited to generate a ground truth for all three targets,
and our method is evaluated against the ground truth.
Finally we perform a brief qualitative analysis of results,       4.2. Definition of a Ground Truth
which is complementary to quantitative evaluation.                Keywords provided by the authors of each paper are natu-
                                                                  ral candidates to form a ground truth for topic modelling
4.1. Corpus Construction                                          of scientific papers. Exact matching between keywords
                                                                  and extracted topics, however, would yield no significant
The ScopusDS corpus has been retrieved from Elsevier              result, because topics are defined as sets of terms whereas
Scopus by downloading publications in the time inter-
                                                                  keywords are strings, and author-assigned keywords may
val from January 2000 to December 2021 according to               not be linked to terms in the external knowledge base
selected subject areas that are concerned with the “data          employed in tASKE. Hence we define an alternative evalu-
science” subject. For each publication, eid, year, title, ab-     ation methodology which makes use of a non-contextual
stract, document type, and author-assigned keywords have
                                                                  word embedding model to compute the similarity be-
been downloaded. Furthermore, additional metadata are             tween keywords and extracted topics.
retrieved (e.g., author name and affiliation, journal/con-            For target (1), we compare clusters extracted by tASKE
ference name, ISSN, publication type). The corpus con-            with the set of keywords at each time 𝑡.
tent is described in Table 1 in terms of considered subject           For target (2), we are interested in knowing whether
areas and corresponding number of retrieved publica-              the topics were extracted at the correct time, so we com-
tions.                                                            pare clusters extracted at each time with the entire set of
   Besides the paper abstract, two pieces of metadata were        keywords. A comparison of the resulting metrics with
taken into account in the analysis: the publication date and      the ones obtained for target (1) provides an indicator of
the list of keywords provided by the author(s). We selected       the timeliness of tASKE extraction: if a topic 𝑐, extracted
only documents of type “article” that are accompanied             by tASKE at time 𝑡 is more similar to keywords from time
by at least 3 keywords and are at least 30 words long,            𝑡 ′ ≠ 𝑡 than to the ones from 𝑡, then 𝑐 can be deemed more
finally amounting to 766,867 documents. Figure 3 shows            appropriate to describe the subcorpus at time 𝑡 ′ and was
the number of documents and keywords per year.                    extracted either “too soon” or “too late”.
   Defining target (3) is more complicated, since no ge- mean and the standard deviation of the results for each
nealogical structure is inherently defined on paper key- year.
words. We must first define a set of heuristics to derive
a ground truth from the keyword lists assigned to docu-
ments. Specifically, we say that a subsequent keyword
𝑤 ′ is derived from an antecedent keyword 𝑤 at time 𝑡 if:

     • 𝑤 was associated to any document at any time
       𝑡 ← < 𝑡;
     • 𝑤 ′ has never been associated to any document at
       any 𝑡 ← < 𝑡;
     • the number of keyword co-occurrences at 𝑡, 𝐹𝑡 , is
       such that 𝐹𝑡 (𝑤, 𝑤 ′ ) ≥ 1.

4.3. Quantitative Evaluation                                    Figure 4: Distributions of similarities between topics ex-
                                                                tracted in each year and the closest keyword from the same
We run tASKE on the ScopusDS corpus by selecting                year (blue bars), or the closest keyword from all years (orange
years as time units in which the corpus is split. Since         bars).
tASKE requires to be initialized with a set of input topics
          (0)      (0)
𝐶 (0) = {𝑐1 , … , 𝑐𝑛 }, we exclude papers of year 2000 from
                                                                   Outcomes displayed in Figure 4 are promising, with a
the evaluation and use the set of keywords assigned to
                                                                mean similarity going from 86.99% in 2001 (𝑠𝑑 = 9.98%)
them to derive 𝐶 (0) . This set of terms 𝑊 (0) is first fil-
                                                                to 80.20% in 2021 (𝑠𝑑 = 13.47%), touching a minimum
tered to retain only terms that appear in WordNet, i.e.
                                                                equal to 77.12% (𝑠𝑑 = 10.98%) for year 2006. The figure
the knowledge base used for this evaluation. To avoid
                                                                does not prove only the effectiveness of tASKE for target
the injection of spurious topics into the system, 𝐶 (0) is
                                                                (1), i.e. to discover topics in a corpus, but also for target
further filtered in order to keep only monosemic terms,
                                                                (2), i.e. to discover them at the proper time. Indeed,
i.e. terms that are linked to a single WordNet synset, and
                                                                at each year, matching with keywords from other years
the 100 with the highest frequency are sampled. In order
                                                                yields better similarities only for few topics per year, as is
to retrieve initial topics from this set of terms, we apply
                                                                proven by the overlapping of the similarity distributions.
Affinity Propagation [16], eventually obtaining 𝑛 = 20
                                                                   In the same way as we did for each topic, we can mea-
topic clusters, mostly related to mathematics (e.g. re-
                                                                sure the maximal similarity between each keyword and
gressions analysis = {regression analysis, linear regression,
                                                                the set of topics in each year, which may be considered a
multiple regression}) and computer science (internet =
                                                                proxy for recall. Resulting similarity distributions, going
{internet, information system, bandwidth, world wide web,
                                                                from 34.77% mean (𝑠𝑑 = 11.28%) in 2001 to 65.81% mean
electronic mail}), but also to domain of application (air
                                                                (𝑠𝑑 = 12.76%) in 2021, are displayed in Figure 5. Although
pollution = {air pollution, air transport}).
                                                                maximising recall was not our main interest, we found
    As for hyperparameters, we set thresholds 𝛼 and 𝛽
                                                                that the system gets closer and closer to finding at least
equal to one another so to have a single learning rate,
                                                                a topic for each keyword.
and since we found the system to be effective for 𝛽 ≤ 0.35,
the experiments were conducted with 𝛼 = 𝛽 = 0.35 to
achieve efficiency in terms of computation time.
    To assess the closeness of topics retrieved by tASKE
to the ground truth, we train a Word2Vec model on a
pseudo-corpus whose documents are a concatenation of
document chunks and their ground truth keywords. By
exploiting this model, as was done for instance in [17],
it is possible to embed keywords and extracted terms in
the same vector space. For each year, we define topic
embeddings again as the centroids of the embeddings of
topic-related terms, which may change from year to year
even for the same topic, and we compute cosine similarity
between the resulting vectors and the set of keywords.          Figure 5: Distributions of similarities between keywords from
This is done by single linkage, i.e. finding the closest        each year and the closest topic from the same year (blue bars),
keyword for each topic embedding. Figure 4 reports the          or the closest topic from all years (orange bars).
Figure 6: A sample of the final topic genealogy produced by tASKE.


    As for target (3), we experimented with the same eval-            4.4. Qualitative Analysis
uation method, taking into account the derivation pairs
                                                                            To grasp the potential as well as the current limitations of
defined in the ground truth, of the type (antecedent topic,
                                                                            tASKE in a broader perspective, we looked at the geneal-
subsequent topic), together with the year of derivation.
                                                                            ogy it produces and at the topics having low similarity
Topic and keyword embeddings are concatenated, form-
                                                                            with keywords from the same year, as well as derivation
ing derivation pair embeddings; similarities are then com-
                                                                            relationships they are involved in. An example of the
puted by finding the keyword pair closest to each topic
                                                                            topic genealogy produced by tASKE is shown in Figure
pair.
                                                                            6.
    Results for target (3) are shown in Table 2, both in the
                                                                                We noticed that the number of extracted topics tends
case that accounts only for direct derivation relationships
 (𝑡 ← )    (𝑡)                                                              to grow quadratically in the first iterations, going from
𝑐𝑖      → 𝑐𝑗 and for the one in which indirect derivations                  57 (containing 178 terms) in year 2001 to 3039 (with 8950
                                      (𝑡 ← )       (𝑡)
were considered as well, i.e. 𝑐𝑖             → 𝑐𝑗 if ∃𝑐 (𝜏1 ) , … , 𝑐 (𝜏𝐿 ) terms) in 2009, while slowing down at later iterations,
                                              (𝑡 ← )               (𝜏 )     reaching 6135 topics and 18189 terms in 2021. This be-
with 𝜏1 , … , 𝜏𝐿 ∈ (𝑡 ← , 𝑡) such that 𝑐𝑖             → 𝑐 (𝜏1 ) , 𝑐𝑖 𝑙 →
                                       (𝑡)                                  havior is indicative of the fact that the system accelerates
𝑐 (𝜏𝑙+1 ) ∀𝑙 = 1, … , 𝐿 and 𝑐 (𝜏𝐿 ) → 𝑐𝑗 .                                  until most common knowledge is retrieved. A surplus
                                                                            of generic topics is produced. Such topics contain few
                      Mean                     Std                          terms and also contribute to lower the similarity with
   Only direct        67.24%                   14.51%                       keywords, as most of these belong to domain lexicon.
   derivations
                                                                            For instance, topics ‘diagram’, ‘cast’, ‘fill’, ‘known’, ‘let’,
   Including          69.79%                   14.43%
                                                                            ‘lie’, ‘play’ all have similarity lower than 0.5 with key-
   indirect
   derivations                                                              words from the same year, and give rise to relationships
                                                                            that further diverge from the domain of interest: from
Table 2                                                                     ‘play’ to ‘toy’ and ‘fun’, from ‘diagram’ to ‘display’ and
Mean and standard deviation of similarities between topic ‘drafting’. These are topics that do appear in the form of
derivation pairs and keyword derivation pairs.                              terms in the ScopusDS corpus, but attention has to be
                                                                            put on the system misinterpreting their meaning or their
    Results are naturally better when indirect derivation importance.
relationships are included, but the difference between                          tASKE has proved to be capable of capturing some of
these and the ones that accounted only for direct relation- the topics that marked recent developments or applica-
ships is small enough to assume tASKE can find short- tions in the Data Science domain, such as: ‘face recog-
term derivations, but has more difficulty in managing nition’ (2014, from ‘biometric identification’) (as shown
long-term ones, likely due to cumulative errors.                            in Figure 6), ‘speech production’ (2004, from ‘wavelet’),
‘search engine’ (2004, from ‘internet’), ‘ontology’ (2006, comparing tASKE with other temporal topic modeling
from ‘knowledge’), ‘clustering’ (2006, from ‘class’), ‘nat- methods and by assessing the quality of topics and their
ural language processor’ (2008, from ‘internet’), ‘graphi- genealogy through the evaluation of domain experts.
cal user interface’ (2008, from ‘internet’), ‘cryptanalytic’
(2010, from ‘cryptography’), ‘flight control’ (2012, from
‘flight simulator’), ‘machine readable’ (2017, from ‘inter- References
net’), ‘automatic face recognition’ (2016, from ‘face recog-
                                                               [1] A. Ferrara, S. Picascia, D. Riva,             Context-
nition’ through ‘identity verification’). Another category
                                                                   aware knowledge extraction from legal documents
of topics is the one that includes topics of interest but pro-
                                                                   through zero-shot classification, in: R. Guizzardi,
vides a spurious derivation, e.g. ‘neural network’ (2006),
                                                                   B. Neumayr (Eds.), Advances in Conceptual Model-
here derived from ‘internet’, or ‘cryptography’ (2008), that
                                                                   ing, Springer International Publishing, Cham, 2022,
descends from ‘air pollution’. An even clearer example of
                                                                   pp. 81–90.
the boundaries the external knowledge base imposes on
                                                               [2] T. K. Landauer, P. W. Foltz, D. Laham, An
tASKE is given by the topic ‘percolation’, which may refer
                                                                   introduction to latent semantic analysis, Dis-
to ‘clique percolation technique’ in the documents but is
                                                                   course Processes 25 (1998) 259–284. doi:10.1080/
here linked to ‘air pollution’ due to the absence of any
                                                                   01638539809545028 .
non-physical sense of term ‘percolation’ from WordNet.
                                                               [3] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet
We acknowledged also that most extracted topics are re-
                                                                   allocation, Journal of machine Learning research 3
lated to domains of application, e.g. medicine, physics,
                                                                   (2003) 993–1022.
chemistry, social sciences, etc. Including these topics
                                                               [4] M. Rabinovich, D. Blei, The inverse regression
in a hierarchical class structure may prove beneficial to
                                                                   topic model, in: E. P. Xing, T. Jebara (Eds.), Pro-
simplify visualization of the topic genealogy.
                                                                   ceedings of the 31st International Conference on
                                                                   Machine Learning, volume 32 of Proceedings of Ma-
5. Concluding Remarks                                              chine Learning Research, PMLR, Bejing, China, 2014,
                                                                   pp. 199–207.
Starting from the increasingly current need to understand      [5] A. Gruber, Y. Weiss, M. Rosen-Zvi, Hidden topic
the evolution of ideas and research themes in scientific lit-      markov models, in: M. Meila, X. Shen (Eds.), Pro-
erature, in this work we have presented tASKE, a method            ceedings of the Eleventh International Conference
for identifying topics in a diachronic corpus of scientific        on Artificial Intelligence and Statistics, volume 2 of
articles. Time in tASKE is a crucial aspect, as the goal           Proceedings of Machine Learning Research, PMLR,
is not only to identify the topics in their right temporal         San Juan, Puerto Rico, 2007, pp. 163–170.
collocation, but also to understand how a topic can derive     [6] D. M. Blei, J. D. Lafferty, Dynamic topic models, in:
from previous topics, in order to reconstruct the geneal-          Proceedings of the 23rd International Conference
ogy of the topics in time. tASKE makes it possible to              on Machine Learning, ICML ’06, Association for
achieve these objectives with an unsupervised approach,            Computing Machinery, New York, NY, USA, 2006,
i.e., without the need to resort to large and complex pre-         p. 113–120. doi:10.1145/1143844.1143859 .
annotated datasets. The experimental results, conducted        [7] L. Sun, Y. Yin, Discovering themes and trends
on a corpus of real scientific publications covering a pe-         in transportation research using topic modeling,
riod of 21 years, show how tASKE is able to identify the           Transportation Research Part C: Emerging Tech-
topics deemed relevant by the authors of the papers and            nologies 77 (2017) 49–66. doi:https://doi.org/
expressed by means of thematic keywords. In particular,            10.1016/j.trc.2017.01.013 .
the topics identified by tASKE are not only adequate, but      [8] T. M. Abuhay, Y. G. Nigatie, S. V. Kovalchuk,
also placed in the correct time period and related to each         Towards predicting trend of scientific research
other in a genealogy that described their evolution. Our           topics using topic modeling,          Procedia Com-
current and future work on tASKE is aimed at three main            puter Science 136 (2018) 304–310. doi:https://doi.
goals: i) introduce an adaptive learning rate, with the            org/10.1016/j.procs.2018.08.284 , 7th Interna-
aim of controlling the number of new topics discovered             tional Young Scientists Conference on Computa-
by tASKE for each time period according not only to the            tional Science, YSC2018, 02-06 July2018, Heraklion,
topic relevance but also to the capability of each topic to        Greece.
potentially induce the discovery of new topics in future       [9] A. B. Dieng, F. J. R. Ruiz, D. M. Blei, The dynamic
iterations; ii) make tASKE independent from external               embedded topic model, CoRR abs/1907.05545
knowledge bases, exploiting contextual embeddings, so              (2019). URL: http://arxiv.org/abs/1907.05545.
to avoid restricting a-priori the vocabulary of terms that         arXiv:1907.05545 .
can be extracted; iii) perform further evaluations both by [10] Q. Gao, X. Huang, K. Dong, Z. Liang, J. Wu,
     Semantic-enhanced topic evolution analysis: a com-
     bination of the dynamic topic model and word2vec,
     Scientometrics 127 (2022) 1543–1563. URL: https://
     doi.org/10.1007/s11192-022-04275-z. doi:10.1007/
     s11192- 022- 04275- z .
[11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
     Bert: Pre-training of deep bidirectional transform-
     ers for language understanding, arXiv preprint
     arXiv:1810.04805 (2018).
[12] N. Reimers, I. Gurevych, Sentence-bert: Sentence
     embeddings using siamese bert-networks, CoRR
     abs/1908.10084 (2019). URL: http://arxiv.org/abs/
     1908.10084. arXiv:1908.10084 .
[13] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-
     hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, BART:
     denoising sequence-to-sequence pre-training for
     natural language generation, translation, and com-
     prehension, CoRR abs/1910.13461 (2019). URL: http:
     //arxiv.org/abs/1910.13461. arXiv:1910.13461 .
[14] M.-W. Chang, L.-A. Ratinov, D. Roth, V. Sriku-
     mar, Importance of semantic representation: Data-
     less classification., in: Aaai, volume 2, 2008, pp.
     830–835.
[15] C. Fellbaum (Ed.), WordNet: An Electronic Lexical
     Database, Language, Speech, and Communication,
     MIT Press, Cambridge, MA, 1998.
[16] D. Dueck, Affinity propagation: clustering data by
     passing messages, University of Toronto Toronto,
     ON, Canada, 2009.
[17] F. Role, S. Morbieu, M. Nadif, Unsupervised eval-
     uation of text co-clustering algorithms using neu-
     ral word embeddings, in: Proceedings of the 27th
     ACM International Conference on Information and
     Knowledge Management, CIKM ’18, Association for
     Computing Machinery, New York, NY, USA, 2018,
     p. 1827–1830. URL: https://doi.org/10.1145/3269206.
     3269282. doi:10.1145/3269206.3269282 .