Exploiting Contextual Embeddings to Extract Topic Genealogy from Scientific Literature Alfio Ferrara1 , Stefano Montanelli1 , Sergio Picascia1,∗ and Davide Riva1,∗ 1 Università degli Studi di Milano Department of Computer Science Via Celoria, 18 - 20133 Milano, Italy Abstract Modeling the evolution of topics and forecast future trends is a crucial task when analyzing scientific papers. In this work we propose tASKE (temporal Automated System for Knowledge Extraction), a dynamic topic modeling approach which exploits zero-shot classification and contextual embeddings in order to track topic evolution through time. The approach is evaluated against a corpus of data science papers, assessing the ability of tASKE to correctly classify documents and retrieving relevant derivation relationships between older and new topics in time. Keywords Natural Language Processing, Scientometrics, Topic Genealogy 1. Introduction manner in order to discover topics when they actually ap- pear, and ii) genealogy reconstruction, in which extracted With the amount of published scientific literature increas- topics are placed in a tree structure representing their ing each year, keeping track of newly formulated topics lineage in the history of the discipline. and their derivation process becomes a challenge for re- In this paper, we present tASKE, a method to extract searchers, scholars, and publishers. The problem lies in topics from a diachronic corpus of scientific papers and the fact that the total amount of definitions, theorems, reconstruct their genealogy in a completely unsupervised properties, tasks, and subdomains tends to grow expo- way. Our method is developed upon our Automated nentially, since several of them may be conceived start- System for Knowledge Extraction (ASKE) framework [1], ing from a single one or the interaction of a few ones. which relies on pre-trained contextual embedding models For instance, in the domain of Machine Learning, the to represent documents and topics in the same vector idea of neural networks gave rise to that of deep learning, space and on a cyclical term extraction and clustering which has then been applied to problems such as image phase to extract new topics. Besides presenting tASKE reconstruction and partial differential equations, and was as a time-aware extension of ASKE, we introduce an further deepened with topics such as attention, which evaluation framework and a case study on a corpus of in turn provided the intuition behind transformers and a abstracts of scientific papers related to the Data Science basis for explainability. domain, with the goal of demonstrating the effectiveness Referring to definitions, theorems, properties, tasks, of tASKE both for topic extraction and for extracting subdomains and the like with the generic label of “top- topic-to-topic derivation relationships. ics”, abstract objects a text refers to, it is possible to study The work is organized as follows: Section 2 Related “topic genealogy” in a diachronic corpus, i.e. the descent Work reports on the literature about topic modeling as of topics from older ones over time. The task of extract- well as the technology underlying our method. Section 3 ing topic genealogy falls within the scope of Knowledge Methodology presents the methodology and techniques Extraction (KE), and it consists of two main sub-tasks: i) enforced in tASKE. Section 4 Case Study and Evaluation topic extraction, by which we aim to retrieve topics that presents the case study on a Data Science Literature are important in a written document, possibly in a timely corpus, on which the evaluation was conducted. Sec- tion 5 Concluding Remarks draws some conclusions and SDU2023: The Third AAAI Workshop on Scientific Document Under- sketches some future work. standing, February 14, 2023, Washington, DC ∗ Corresponding author. Envelope-Open alfio.ferrara@unimi.it (A. Ferrara); stefano.montanelli@unimi.it (S. Montanelli); sergio.picascia@unimi.it (S. Picascia); 2. Related Work davide.riva1@unimi.it (D. Riva) Orcid 0000-0002-4991-4984 (A. Ferrara); 0000-0002-6594-6644 The task of classifying large amounts of textual docu- (S. Montanelli); 0000-0001-6863-0082 (S. Picascia); ments without relying on labeled data and presenting la- 0009-0003-9681-9423 (D. Riva) tent features of texts, such as hidden topics, is commonly © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). addressed employing topic modeling techniques. Latent CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings classification derivation belonging derivation Figure 1: The tASKE Conceptual Graph. Semantic Analysis (LSA) [2] was one of the first proposed [11], each of them being tailored to specific tasks, such approaches, exploiting Singular Value Decomposition as semantic similarity [12] and zero-shot learning [13]. (SVD) in order to reduce the number of dimensions of Zero-Shot Learning (ZSL) is a problem setup in the a document-term matrix and to easily compute similar- field of machine learning, where a classifier is required ity between document vector representations. LSA was to predict labels of examples extracted from classes that soon followed by Latent Dirichlet Allocation (LDA) [3], were never observed in the training phase. It was firstly which employs Bayesian analysis in order to optimize referred to as dataless classification in 2008 [14] and has the distributions of documents belonging to topics, and quickly become a subject of interest, particularly in the of words defining these topics. The majority of recent field of NLP. The great advantage of this approach con- works in topic modeling takes its inspiration from the sists in the resulting classifier being able to operate effi- original LDA with several variations proposed, such as ciently in a partially or totally unlabeled environment. Correlated Topic Modeling [4] and Hierarchical Topic tASKE aims at dynamically modeling the presence Modeling [5]. and evolution of latent topics in a diachronic corpus of Common topic modeling methods are not able to cap- documents. It exploits zero-shot learning and contextual ture the changes of topics over time. For this reason, embeddings not only to perform the classification task, techniques of Dynamic Topic Modeling (DTM) are em- but also to extract relevant knowledge from textual data. ployed when dealing with diachronic corpora. Since the first approach (Dynamic LDA [6]) was proposed, the field has been attracting attention among researchers. Among 3. Methodology the possible applications of the designed methods, the The objective of tASKE is to extract a genealogy of topics study of scientific papers, also known as “Scientometrics”, from a diachronic corpus of documents. Every piece of in- was addressed with the aim to assess past and present formation is stored in a graph-based data structure called trends in a specific discipline [7] or to forecast possible tASKE Conceptual Graph (ACG), whose architecture is future subareas of research interest [8]. illustrated in Figure 1. Later studies have been taking into consideration the The nodes in the ACG model belong to three different integration between DTM and word embeddings [9] so to categories: further capture the semantic aspect of the analyzed doc- uments [10]. Embedding techniques are vastly employed • document chunks 𝐾: the object of the analysis, in the field of Natural Language Processing (NLP), in or- they are small portions of the original documents der to represent textual data in a vector space. Several extracted through the application of tokenization models capable of computing contextual token embed- techniques. They are tuples of the form (𝑘, k), dings have been released since the presentation of BERT where 𝑘 is the text of the document chunk and k is its vector representation; zero-shot classification embedding model data ACG terminology knowledge preprocessing enrichment base corpus topic formation Figure 2: The tASKE cycle at time 𝑡. • topics 𝐶: they represent the abstract objects to 𝑊 (𝑡) from the document chunks and assigns them to the ← which documents chunks are assigned and, in topics 𝐶 (𝑡 ) , finally updating the set of current topics practice, they are clusters of related terms. They (𝑡) 𝐶 . (𝑡 ← ) are tuples of the form (𝑐, c), where 𝑐 is the label As a consequence of this process, in ACG a topic 𝑐𝑗 given to the topic and c is its vector representa- can have multiple relations with the other components of tion; (𝑡 ← ) ACG. In particular, for 𝑐𝑗 , we have i) relation classifica- • terms 𝑊: they are extracted from document tion with document chunks 𝐾 (𝑡) ; ii) a relation derivation chunks and clustered together in order to form (𝑡) with terms 𝑊𝑗 discovered from document chunks asso- topics. They are triplets of the form (𝑤𝑠 , 𝑤𝑑 , w), (𝑡 ← ) where 𝑤𝑠 is the label of the term, 𝑤𝑑 is a short ciated with 𝑐𝑗 ; iii) a relation belonging with terms 𝑊𝑗 sentence giving the term definition, and w is the (𝑡) in its cluster; iv) a relation derivation with a new topic 𝑐𝑙 term vector representation. formed by some of the terms in 𝑊𝑗 . It can happen that (𝑡 ← ) The vector representations k of document chunks and topic 𝑐𝑘 is not associated with any document chunk at (𝑡 ← ) w of terms are computed by an embedding model which time 𝑡. This means 𝑐𝑘 is no longer a useful topic with maps a text into a vector space: for k, the embedding respect to the documents of time 𝑡. In this case, the topic model is applied over the document chunk text 𝑘, while (𝑡 ← ) 𝑐𝑘 becomes inactive, together with the set of terms 𝑊𝑘 for w, it is applied over the term definition 𝑤𝑑 . The vector belonging to it, and it will not be able to form new topics. representation c of topics is computed as the mean of the This can be interpreted as the disappearance of interest vectors w𝑖 of all the terms 𝑤𝑖 belonging to 𝑐. The label of towards a certain topic, which emerged in past periods each topic corresponds to the label 𝑤𝑠 of the term 𝑤 that 𝑡 ← but has lost its relevance in the current corpus, 𝐾 (𝑡) . is the closest to c. In the remaining part of this section, we will discuss At the beginning of the analysis (i.e., time 0) the user each phase in Figure 2, explaining in deeper details how is required to define a set of initial topics 𝐶 (0) of interest. each of the aforementioned relations is discovered. (0) Each topic 𝑐𝑖 ∈ 𝐶 (0) is associated with a set of corre- (0) sponding terms 𝑊𝑖 , whose definitions are also provided 3.1. Data Preprocessing by the user. At each subsequent time 𝑡, tASKE performs one or more iterations of the cycle depicted in Figure Preprocessing is the starting point of the tASKE cycle. At 2. As a first step, tASKE extracts the set of document each time period 0, … , 𝑡, the model retrieves documents chunks 𝐾 (𝑡) from the subset 𝐷 (𝑡) ∈ 𝐷 belonging to that from the period-specific subcorpus 𝐷 (𝑡) . period. Such document chunks are classified with re- Documents are first split into document chunks 𝐾 (𝑡) , spect to the topics discovered up to the previous time each of which can fit into the maximum input length of a ← period 𝑡 ← , 𝐶 (𝑡 ) . Moreover, tASKE extracts new terms contextual embedding model. In this case, we employed Sentence-BERT [12], a modification of the original BERT by the aforementioned embedding model. This approach model, which exploits siamese and triplets networks, be- addresses the problem of sense disambiguation, since ing able to derive semantically meaningful sentence em- it maps distinct senses of polysemic words to different beddings in form of numeric vectors. Such a model is embedding vectors. employed in order to extract the semantic features of For each retrieved term sense, the same similarity mea- term definitions and document chunks and map them sure 𝜎 used for classification is exploited in order to com- into the same vector space. pute the similarity between w and the vectors represent- ing topics and document chunks. The terms whose sum 3.2. Zero-Shot Classification of similarities is greater than the hyperparameter 𝛽 be- come candidates for enriching the terminology of the In the zero-shot classification phase, document-topic clas- topic 𝑐 (𝑡 ← ) : 𝑖 sification relationships are defined. Given the coexistence of topics and document chunk embeddings in the same (𝑡 ← ) (𝑡) (𝑡) (𝑡) 𝑔(𝑐𝑖 , 𝑊𝑖 , 𝐾𝑖 ) = {𝑤 (𝑡) ∈ 𝑊𝑖 ∶ vector space, it is possible to perform a zero-shot clas- ← (𝑡 ) (𝑡) ← sification, 𝑓 ∶ 𝐾 (𝑡) → 𝐶 (𝑡 ) , without having the model 𝜎(w(𝑡) , c𝑖 ) + 𝜎(w(𝑡) , k𝑖 ) ≥ 𝛽} exposed to training examples. A similarity measure 𝜎 (𝑡) (e.g., cosine similarity) between the embedding vector where k𝑖 is the centre of the embeddings of chunks (𝑡) (𝑡) (𝑡) (𝑡) k𝑗 of each document chunk 𝑘𝑗 in 𝐾 and the embed- in 𝐾𝑖 . (𝑡 ← ) (𝑡 ← ) ← The set of candidate terms is sorted in descending ding vector c𝑖 of each topic 𝑐𝑖 in 𝐶 (𝑡 ) is computed order according to the similarity score. In addition, one and, eventually, the two are associated if their similarity can also define a learning rate 𝛾, which represents the is higher than a predefined threshold 𝛼: maximum number of terms that can be associated to (𝑡) (𝑡 ← ) (𝑡 ← ) (𝑡) (𝑡 ← ) a certain topic at each iteration. Applying the bounds 𝑓𝐶 (𝑡 ←) (𝑘𝑗 ) = {𝑐𝑖 ∈𝐶 ∶ 𝜎(k𝑗 , c𝑖 ) ≥ 𝛼} 𝛽 and 𝛾 ensures that, at each iteration, the process of Tuning hyperparameter 𝛼 is crucial since it may re- terminology enrichment will include only a small set of markably affect the classification output: for example, terms that are supposed to be meaningful with respect choosing a high value of 𝛼 could result in a highly precise to the topic at hand. classification, despite potentially finding only a small set Taking as example the topic mentioned in the previous of document chunks for each topic (low recall). section, ‘causality’, it has been associated, among others, Finally, classification relationships are stored in ACG with the following terms and similarity scores: ‘causality’ by considering documents as the simple concatenation (0.773), ‘etiologic’ (0.741), ‘noncausal’ (0.737). of their chunks, so that a document 𝑑𝑗 is labelled with all topics its chunks are labelled with. 3.4. Topic Formation For example, the document chunk Finally, tASKE may generate new topics in a topic forma- [...] graphical representations of causation tion phase. In this phase a clustering algorithm, such as have been used for at least seventy years, Affinity Propagation [16], is applied over the embedding and the modern development of directed (𝑡) (𝑡 ← ) vectors w of the terms 𝑊𝑖 related to each topic 𝑐𝑖 . acyclic graphs to portray causal systems According to the results, a different operation is enforced: continues the trend. It is sometimes difficult to understand, however, what it is about (𝑡 ← ) • derivation: if new clusters, different from 𝑐𝑖 these diagrams that is causal [...] are formed, each of them becomes a new topic, (𝑡 ← ) is classified by tASKE with the topic ‘causality’ with a derived from 𝑐𝑖 , whose label is set equal to the similarity score of 0.652. term 𝑤 closer to the cluster center; • conservation: if no new cluster is formed, the (𝑡 ← ) 3.3. Terminology Enrichment original topic 𝑐𝑖 is preserved, represented by the cluster in which the term 𝑤 corresponding to (𝑡 ← ) (𝑡 ← ) For each topic 𝑐𝑖 in the ACG, tASKE retrieves the set of the concept label of 𝑐𝑖 is present; (𝑡) (𝑡) lemmatized terms 𝑊𝑖 appearing in the subset of docu- • pruning: if a new cluster 𝑐𝑗 is formed but all (𝑡) (𝑡 ← ) (𝑡 ← ) ment chunks 𝐾𝑖 associated with 𝑐𝑖 by a classification its member terms belong also to 𝑐𝑖 , the newer relation. These terms vectors are placed in the same topic is absorbed by the older one. semantic space, together with K and c, retrieving their definition 𝑤𝑑 from an external knowledge base, such as In the end, term-topic belonging relationships and WordNet [15], and computing its vector representation w topic-topic derivation relationships are stored in the ACG together with document-topic classification relationships ID Scopus Subject area # of pub. 1702 Artificial Intelligence 1,024,703 defined in the zero-shot classification phase, building up 1800 General Decision Sciences 65,254 the topic genealogy. Topics 𝐶 (𝑡) defined in this phase will 1801 Decision Sciences (miscellaneous) 39,058 serve as input for the next iteration. 1802 Information Systems and Manage- 377,259 Considering the topic ‘causality’, consisting of the fol- ment lowing set of terms {‘causality’, ‘etiologic’, ‘noncausal’, 1803 Management Science and Opera- 258,898 ‘event’, ‘issue’, ‘circumstance’, ‘interpretation’, ‘explanan- tions Research dum’}, the tASKE model has formed three topics with 1804 Statistics, Probability and Uncer- 168,219 the corrisponding sets of terms: ‘causality’ = {‘causal- tainty ity’, ‘etiologic’, ‘noncausal’}, ‘event’ = {‘event’, ‘issue’, ‘cir- 2613 Statistics and Probability 426,341 cumstance’}, ‘interpretation’ = {‘interpretation’, ‘explanan- Total 2,359,732 dum’}. Table 1 Composition of the ScopusDS corpus used for evaluation 4. Case Study and Evaluation tASKE is here evaluated on a case study on Data Science literature. The evaluation framework has to account for three targets: 1. correctness of extracted topics, 2. correctness of the time of extraction, 3. correctness of topic-topic derivation relation- ships. First, a “Data Science in Scopus” corpus (hereon Sco- pusDS Corpus), made of abstract of journal papers rang- Figure 3: Number of documents and keywords per year in ing from January 2000 to December 2021, is constructed. ScopusDS corpus. Then keywords defined by authors of each paper are ex- ploited to generate a ground truth for all three targets, and our method is evaluated against the ground truth. Finally we perform a brief qualitative analysis of results, 4.2. Definition of a Ground Truth which is complementary to quantitative evaluation. Keywords provided by the authors of each paper are natu- ral candidates to form a ground truth for topic modelling 4.1. Corpus Construction of scientific papers. Exact matching between keywords and extracted topics, however, would yield no significant The ScopusDS corpus has been retrieved from Elsevier result, because topics are defined as sets of terms whereas Scopus by downloading publications in the time inter- keywords are strings, and author-assigned keywords may val from January 2000 to December 2021 according to not be linked to terms in the external knowledge base selected subject areas that are concerned with the “data employed in tASKE. Hence we define an alternative evalu- science” subject. For each publication, eid, year, title, ab- ation methodology which makes use of a non-contextual stract, document type, and author-assigned keywords have word embedding model to compute the similarity be- been downloaded. Furthermore, additional metadata are tween keywords and extracted topics. retrieved (e.g., author name and affiliation, journal/con- For target (1), we compare clusters extracted by tASKE ference name, ISSN, publication type). The corpus con- with the set of keywords at each time 𝑡. tent is described in Table 1 in terms of considered subject For target (2), we are interested in knowing whether areas and corresponding number of retrieved publica- the topics were extracted at the correct time, so we com- tions. pare clusters extracted at each time with the entire set of Besides the paper abstract, two pieces of metadata were keywords. A comparison of the resulting metrics with taken into account in the analysis: the publication date and the ones obtained for target (1) provides an indicator of the list of keywords provided by the author(s). We selected the timeliness of tASKE extraction: if a topic 𝑐, extracted only documents of type “article” that are accompanied by tASKE at time 𝑡 is more similar to keywords from time by at least 3 keywords and are at least 30 words long, 𝑡 ′ ≠ 𝑡 than to the ones from 𝑡, then 𝑐 can be deemed more finally amounting to 766,867 documents. Figure 3 shows appropriate to describe the subcorpus at time 𝑡 ′ and was the number of documents and keywords per year. extracted either “too soon” or “too late”. Defining target (3) is more complicated, since no ge- mean and the standard deviation of the results for each nealogical structure is inherently defined on paper key- year. words. We must first define a set of heuristics to derive a ground truth from the keyword lists assigned to docu- ments. Specifically, we say that a subsequent keyword 𝑤 ′ is derived from an antecedent keyword 𝑤 at time 𝑡 if: • 𝑤 was associated to any document at any time 𝑡 ← < 𝑡; • 𝑤 ′ has never been associated to any document at any 𝑡 ← < 𝑡; • the number of keyword co-occurrences at 𝑡, 𝐹𝑡 , is such that 𝐹𝑡 (𝑤, 𝑤 ′ ) ≥ 1. 4.3. Quantitative Evaluation Figure 4: Distributions of similarities between topics ex- tracted in each year and the closest keyword from the same We run tASKE on the ScopusDS corpus by selecting year (blue bars), or the closest keyword from all years (orange years as time units in which the corpus is split. Since bars). tASKE requires to be initialized with a set of input topics (0) (0) 𝐶 (0) = {𝑐1 , … , 𝑐𝑛 }, we exclude papers of year 2000 from Outcomes displayed in Figure 4 are promising, with a the evaluation and use the set of keywords assigned to mean similarity going from 86.99% in 2001 (𝑠𝑑 = 9.98%) them to derive 𝐶 (0) . This set of terms 𝑊 (0) is first fil- to 80.20% in 2021 (𝑠𝑑 = 13.47%), touching a minimum tered to retain only terms that appear in WordNet, i.e. equal to 77.12% (𝑠𝑑 = 10.98%) for year 2006. The figure the knowledge base used for this evaluation. To avoid does not prove only the effectiveness of tASKE for target the injection of spurious topics into the system, 𝐶 (0) is (1), i.e. to discover topics in a corpus, but also for target further filtered in order to keep only monosemic terms, (2), i.e. to discover them at the proper time. Indeed, i.e. terms that are linked to a single WordNet synset, and at each year, matching with keywords from other years the 100 with the highest frequency are sampled. In order yields better similarities only for few topics per year, as is to retrieve initial topics from this set of terms, we apply proven by the overlapping of the similarity distributions. Affinity Propagation [16], eventually obtaining 𝑛 = 20 In the same way as we did for each topic, we can mea- topic clusters, mostly related to mathematics (e.g. re- sure the maximal similarity between each keyword and gressions analysis = {regression analysis, linear regression, the set of topics in each year, which may be considered a multiple regression}) and computer science (internet = proxy for recall. Resulting similarity distributions, going {internet, information system, bandwidth, world wide web, from 34.77% mean (𝑠𝑑 = 11.28%) in 2001 to 65.81% mean electronic mail}), but also to domain of application (air (𝑠𝑑 = 12.76%) in 2021, are displayed in Figure 5. Although pollution = {air pollution, air transport}). maximising recall was not our main interest, we found As for hyperparameters, we set thresholds 𝛼 and 𝛽 that the system gets closer and closer to finding at least equal to one another so to have a single learning rate, a topic for each keyword. and since we found the system to be effective for 𝛽 ≤ 0.35, the experiments were conducted with 𝛼 = 𝛽 = 0.35 to achieve efficiency in terms of computation time. To assess the closeness of topics retrieved by tASKE to the ground truth, we train a Word2Vec model on a pseudo-corpus whose documents are a concatenation of document chunks and their ground truth keywords. By exploiting this model, as was done for instance in [17], it is possible to embed keywords and extracted terms in the same vector space. For each year, we define topic embeddings again as the centroids of the embeddings of topic-related terms, which may change from year to year even for the same topic, and we compute cosine similarity between the resulting vectors and the set of keywords. Figure 5: Distributions of similarities between keywords from This is done by single linkage, i.e. finding the closest each year and the closest topic from the same year (blue bars), keyword for each topic embedding. Figure 4 reports the or the closest topic from all years (orange bars). Figure 6: A sample of the final topic genealogy produced by tASKE. As for target (3), we experimented with the same eval- 4.4. Qualitative Analysis uation method, taking into account the derivation pairs To grasp the potential as well as the current limitations of defined in the ground truth, of the type (antecedent topic, tASKE in a broader perspective, we looked at the geneal- subsequent topic), together with the year of derivation. ogy it produces and at the topics having low similarity Topic and keyword embeddings are concatenated, form- with keywords from the same year, as well as derivation ing derivation pair embeddings; similarities are then com- relationships they are involved in. An example of the puted by finding the keyword pair closest to each topic topic genealogy produced by tASKE is shown in Figure pair. 6. Results for target (3) are shown in Table 2, both in the We noticed that the number of extracted topics tends case that accounts only for direct derivation relationships (𝑡 ← ) (𝑡) to grow quadratically in the first iterations, going from 𝑐𝑖 → 𝑐𝑗 and for the one in which indirect derivations 57 (containing 178 terms) in year 2001 to 3039 (with 8950 (𝑡 ← ) (𝑡) were considered as well, i.e. 𝑐𝑖 → 𝑐𝑗 if ∃𝑐 (𝜏1 ) , … , 𝑐 (𝜏𝐿 ) terms) in 2009, while slowing down at later iterations, (𝑡 ← ) (𝜏 ) reaching 6135 topics and 18189 terms in 2021. This be- with 𝜏1 , … , 𝜏𝐿 ∈ (𝑡 ← , 𝑡) such that 𝑐𝑖 → 𝑐 (𝜏1 ) , 𝑐𝑖 𝑙 → (𝑡) havior is indicative of the fact that the system accelerates 𝑐 (𝜏𝑙+1 ) ∀𝑙 = 1, … , 𝐿 and 𝑐 (𝜏𝐿 ) → 𝑐𝑗 . until most common knowledge is retrieved. A surplus of generic topics is produced. Such topics contain few Mean Std terms and also contribute to lower the similarity with Only direct 67.24% 14.51% keywords, as most of these belong to domain lexicon. derivations For instance, topics ‘diagram’, ‘cast’, ‘fill’, ‘known’, ‘let’, Including 69.79% 14.43% ‘lie’, ‘play’ all have similarity lower than 0.5 with key- indirect derivations words from the same year, and give rise to relationships that further diverge from the domain of interest: from Table 2 ‘play’ to ‘toy’ and ‘fun’, from ‘diagram’ to ‘display’ and Mean and standard deviation of similarities between topic ‘drafting’. These are topics that do appear in the form of derivation pairs and keyword derivation pairs. terms in the ScopusDS corpus, but attention has to be put on the system misinterpreting their meaning or their Results are naturally better when indirect derivation importance. relationships are included, but the difference between tASKE has proved to be capable of capturing some of these and the ones that accounted only for direct relation- the topics that marked recent developments or applica- ships is small enough to assume tASKE can find short- tions in the Data Science domain, such as: ‘face recog- term derivations, but has more difficulty in managing nition’ (2014, from ‘biometric identification’) (as shown long-term ones, likely due to cumulative errors. in Figure 6), ‘speech production’ (2004, from ‘wavelet’), ‘search engine’ (2004, from ‘internet’), ‘ontology’ (2006, comparing tASKE with other temporal topic modeling from ‘knowledge’), ‘clustering’ (2006, from ‘class’), ‘nat- methods and by assessing the quality of topics and their ural language processor’ (2008, from ‘internet’), ‘graphi- genealogy through the evaluation of domain experts. cal user interface’ (2008, from ‘internet’), ‘cryptanalytic’ (2010, from ‘cryptography’), ‘flight control’ (2012, from ‘flight simulator’), ‘machine readable’ (2017, from ‘inter- References net’), ‘automatic face recognition’ (2016, from ‘face recog- [1] A. Ferrara, S. Picascia, D. Riva, Context- nition’ through ‘identity verification’). Another category aware knowledge extraction from legal documents of topics is the one that includes topics of interest but pro- through zero-shot classification, in: R. Guizzardi, vides a spurious derivation, e.g. ‘neural network’ (2006), B. Neumayr (Eds.), Advances in Conceptual Model- here derived from ‘internet’, or ‘cryptography’ (2008), that ing, Springer International Publishing, Cham, 2022, descends from ‘air pollution’. An even clearer example of pp. 81–90. the boundaries the external knowledge base imposes on [2] T. K. Landauer, P. W. Foltz, D. Laham, An tASKE is given by the topic ‘percolation’, which may refer introduction to latent semantic analysis, Dis- to ‘clique percolation technique’ in the documents but is course Processes 25 (1998) 259–284. doi:10.1080/ here linked to ‘air pollution’ due to the absence of any 01638539809545028 . non-physical sense of term ‘percolation’ from WordNet. [3] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet We acknowledged also that most extracted topics are re- allocation, Journal of machine Learning research 3 lated to domains of application, e.g. medicine, physics, (2003) 993–1022. chemistry, social sciences, etc. Including these topics [4] M. Rabinovich, D. Blei, The inverse regression in a hierarchical class structure may prove beneficial to topic model, in: E. P. Xing, T. Jebara (Eds.), Pro- simplify visualization of the topic genealogy. ceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Ma- 5. Concluding Remarks chine Learning Research, PMLR, Bejing, China, 2014, pp. 199–207. Starting from the increasingly current need to understand [5] A. Gruber, Y. Weiss, M. Rosen-Zvi, Hidden topic the evolution of ideas and research themes in scientific lit- markov models, in: M. Meila, X. Shen (Eds.), Pro- erature, in this work we have presented tASKE, a method ceedings of the Eleventh International Conference for identifying topics in a diachronic corpus of scientific on Artificial Intelligence and Statistics, volume 2 of articles. Time in tASKE is a crucial aspect, as the goal Proceedings of Machine Learning Research, PMLR, is not only to identify the topics in their right temporal San Juan, Puerto Rico, 2007, pp. 163–170. collocation, but also to understand how a topic can derive [6] D. M. Blei, J. D. Lafferty, Dynamic topic models, in: from previous topics, in order to reconstruct the geneal- Proceedings of the 23rd International Conference ogy of the topics in time. tASKE makes it possible to on Machine Learning, ICML ’06, Association for achieve these objectives with an unsupervised approach, Computing Machinery, New York, NY, USA, 2006, i.e., without the need to resort to large and complex pre- p. 113–120. doi:10.1145/1143844.1143859 . annotated datasets. The experimental results, conducted [7] L. Sun, Y. Yin, Discovering themes and trends on a corpus of real scientific publications covering a pe- in transportation research using topic modeling, riod of 21 years, show how tASKE is able to identify the Transportation Research Part C: Emerging Tech- topics deemed relevant by the authors of the papers and nologies 77 (2017) 49–66. doi:https://doi.org/ expressed by means of thematic keywords. In particular, 10.1016/j.trc.2017.01.013 . the topics identified by tASKE are not only adequate, but [8] T. M. Abuhay, Y. G. Nigatie, S. V. Kovalchuk, also placed in the correct time period and related to each Towards predicting trend of scientific research other in a genealogy that described their evolution. Our topics using topic modeling, Procedia Com- current and future work on tASKE is aimed at three main puter Science 136 (2018) 304–310. doi:https://doi. goals: i) introduce an adaptive learning rate, with the org/10.1016/j.procs.2018.08.284 , 7th Interna- aim of controlling the number of new topics discovered tional Young Scientists Conference on Computa- by tASKE for each time period according not only to the tional Science, YSC2018, 02-06 July2018, Heraklion, topic relevance but also to the capability of each topic to Greece. potentially induce the discovery of new topics in future [9] A. B. Dieng, F. J. R. Ruiz, D. M. Blei, The dynamic iterations; ii) make tASKE independent from external embedded topic model, CoRR abs/1907.05545 knowledge bases, exploiting contextual embeddings, so (2019). URL: http://arxiv.org/abs/1907.05545. to avoid restricting a-priori the vocabulary of terms that arXiv:1907.05545 . can be extracted; iii) perform further evaluations both by [10] Q. Gao, X. Huang, K. Dong, Z. Liang, J. Wu, Semantic-enhanced topic evolution analysis: a com- bination of the dynamic topic model and word2vec, Scientometrics 127 (2022) 1543–1563. URL: https:// doi.org/10.1007/s11192-022-04275-z. doi:10.1007/ s11192- 022- 04275- z . [11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transform- ers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [12] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, CoRR abs/1908.10084 (2019). URL: http://arxiv.org/abs/ 1908.10084. arXiv:1908.10084 . [13] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo- hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and com- prehension, CoRR abs/1910.13461 (2019). URL: http: //arxiv.org/abs/1910.13461. arXiv:1910.13461 . [14] M.-W. Chang, L.-A. Ratinov, D. Roth, V. Sriku- mar, Importance of semantic representation: Data- less classification., in: Aaai, volume 2, 2008, pp. 830–835. [15] C. Fellbaum (Ed.), WordNet: An Electronic Lexical Database, Language, Speech, and Communication, MIT Press, Cambridge, MA, 1998. [16] D. Dueck, Affinity propagation: clustering data by passing messages, University of Toronto Toronto, ON, Canada, 2009. [17] F. Role, S. Morbieu, M. Nadif, Unsupervised eval- uation of text co-clustering algorithms using neu- ral word embeddings, in: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM ’18, Association for Computing Machinery, New York, NY, USA, 2018, p. 1827–1830. URL: https://doi.org/10.1145/3269206. 3269282. doi:10.1145/3269206.3269282 .