<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Contextual Embeddings to Extract Topic Genealogy from Scientific Literature</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alfio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ferrara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Montanelli</string-name>
          <email>stefano.montanelli@unimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergio Picascia</string-name>
          <email>sergio.picascia@unimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Riva</string-name>
          <email>davide.riva1@unimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Natural Language Processing</institution>
          ,
          <addr-line>Scientometrics, Topic Genealogy</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Via Celoria</institution>
          ,
          <addr-line>18 - 20133 Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Modeling the evolution of topics and forecast future trends is a crucial task when analyzing scientific papers. In this work we propose tASKE (temporal Automated System for Knowledge Extraction), a dynamic topic modeling approach which exploits zero-shot classification and contextual embeddings in order to track topic evolution through time. The approach is evaluated against a corpus of data science papers, assessing the ability of tASKE to correctly classify documents and retrieving relevant derivation relationships between older and new topics in time.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>With the amount of published scientific literature
increasing each year, keeping track of newly formulated topics
and their derivation process becomes a challenge for
researchers, scholars, and publishers. The problem lies in
the fact that the total amount of definitions, theorems,
properties, tasks, and subdomains tends to grow
exponentially, since several of them may be conceived
starting from a single one or the interaction of a few ones.
For instance, in the domain of Machine Learning, the
idea of neural networks gave rise to that of deep learning,
which has then been applied to problems such as image
reconstruction and partial diferential equations , and was
further deepened with topics such as attention, which
in turn provided the intuition behind transformers and a
basis for explainability.</p>
      <p>Referring to definitions, theorems, properties, tasks,
ics”, abstract objects a text refers to, it is possible to study
“topic genealogy” in a diachronic corpus, i.e. the descent
of topics from older ones over time. The task of
extracting topic genealogy falls within the scope of Knowledge
Extraction (KE), and it consists of two main sub-tasks: i)
topic extraction, by which we aim to retrieve topics that
are important in a written document, possibly in a timely
nEvelop-O
SDU2023: The Third AAAI Workshop on Scientific Document
Understanding, February 14, 2023, Washington, DC
CEUR
htp:/ceur-ws.org
ISN1613-073</p>
      <p>CEUR</p>
      <p>Workshop Proceedings (CEUR-WS.org)
manner in order to discover topics when they actually
appear, and ii) genealogy reconstruction, in which extracted
topics are placed in a tree structure representing their
lineage in the history of the discipline.</p>
      <p>
        In this paper, we present tASKE, a method to extract
topics from a diachronic corpus of scientific papers and
reconstruct their genealogy in a completely unsupervised
way. Our method is developed upon our Automated
System for Knowledge Extraction (ASKE) framework [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
which relies on pre-trained contextual embedding models
to represent documents and topics in the same vector
space and on a cyclical term extraction and clustering
phase to extract new topics. Besides presenting tASKE
as a time-aware extension of ASKE, we introduce an
evaluation framework and a case study on a corpus of
domain, with the goal of demonstrating the efectiveness
of tASKE both for topic extraction and for extracting
      </p>
      <p>The work is organized as follows: Section 2 Related
Work reports on the literature about topic modeling as
well as the technology underlying our method. Section 3
Methodology presents the methodology and techniques
enforced in tASKE. Section 4 Case Study and Evaluation
presents the case study on a Data Science Literature
corpus, on which the evaluation was conducted.
Section 5 Concluding Remarks draws some conclusions and
sketches some future work.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related</title>
    </sec>
    <sec id="sec-4">
      <title>Work</title>
      <p>The task of classifying large amounts of textual
documents without relying on labeled data and presenting
laaddressed employing topic modeling techniques. Latent
classification
derivation
belonging
derivation</p>
      <p>
        Semantic Analysis (LSA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] was one of the first proposed [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], each of them being tailored to specific tasks, such
approaches, exploiting Singular Value Decomposition as semantic similarity [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and zero-shot learning [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
(SVD) in order to reduce the number of dimensions of Zero-Shot Learning (ZSL) is a problem setup in the
a document-term matrix and to easily compute similar- field of machine learning, where a classifier is required
ity between document vector representations. LSA was to predict labels of examples extracted from classes that
soon followed by Latent Dirichlet Allocation (LDA) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], were never observed in the training phase. It was firstly
which employs Bayesian analysis in order to optimize referred to as dataless classification in 2008 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and has
the distributions of documents belonging to topics, and quickly become a subject of interest, particularly in the
of words defining these topics. The majority of recent ifeld of NLP. The great advantage of this approach
conworks in topic modeling takes its inspiration from the sists in the resulting classifier being able to operate
efioriginal LDA with several variations proposed, such as ciently in a partially or totally unlabeled environment.
Correlated Topic Modeling [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Hierarchical Topic tASKE aims at dynamically modeling the presence
Modeling [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. and evolution of latent topics in a diachronic corpus of
      </p>
      <p>
        Common topic modeling methods are not able to cap- documents. It exploits zero-shot learning and contextual
ture the changes of topics over time. For this reason, embeddings not only to perform the classification task,
techniques of Dynamic Topic Modeling (DTM) are em- but also to extract relevant knowledge from textual data.
ployed when dealing with diachronic corpora. Since the
ifrst approach (Dynamic LDA [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) was proposed, the field
has been attracting attention among researchers. Among 3. Methodology
the possible applications of the designed methods, the The objective of tASKE is to extract a genealogy of topics
study of scientific papers, also known as “Scientometrics”, from a diachronic corpus of documents. Every piece of
inwas addressed with the aim to assess past and present formation is stored in a graph-based data structure called
trends in a specific discipline [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or to forecast possible tASKE Conceptual Graph (ACG), whose architecture is
future subareas of research interest [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. illustrated in Figure 1.
      </p>
      <p>
        Later studies have been taking into consideration the The nodes in the ACG model belong to three diferent
integration between DTM and word embeddings [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] so to categories:
further capture the semantic aspect of the analyzed
documents [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Embedding techniques are vastly employed
in the field of Natural Language Processing (NLP), in
order to represent textual data in a vector space. Several
models capable of computing contextual token
embeddings have been released since the presentation of BERT
• document chunks  : the object of the analysis,
they are small portions of the original documents
extracted through the application of tokenization
techniques. They are tuples of the form (, k),
where  is the text of the document chunk and k
is its vector representation;
embedding
model
corpus
      </p>
      <p>data
preprocessing</p>
      <p>ACG
topic
formation
tion;
• terms  : they are extracted from document
chunks and clustered together in order to form
topics. They are triplets of the form (  ,   , w),
where   is the label of the term,   is a short
sentence giving the term definition, and w is the
term vector representation.
is the closest to c.</p>
      <p>The vector representations k of document chunks and
w of terms are computed by an embedding model which
maps a text into a vector space: for k, the embedding
model is applied over the document chunk text  , while
for w, it is applied over the term definition   . The vector
representation c of topics is computed as the mean of the
vectors w of all the terms   belonging to  . The label of
each topic corresponds to the label   of the term  that</p>
      <p>At the beginning of the analysis (i.e., time 0) the user
Each topic  
sponding terms  
(0) ∈  (0) is associated with a set of
corre</p>
      <p>(0), whose definitions are also provided
by the user. At each subsequent time  , tASKE performs
one or more iterations of the cycle depicted in Figure
2. As a first step, tASKE extracts the set of document
chunks  () from the subset  () ∈  belonging to that
period. Such document chunks are classified with
respect to the topics discovered up to the previous time
period  ←,  ( ←). Moreover, tASKE extracts new terms
• topics  : they represent the abstract objects to  () from the document chunks and assigns them to the
which documents chunks are assigned and, in
practice, they are clusters of related terms. They
are tuples of the form (, c), where  is the label</p>
      <p>() .</p>
      <p>given to the topic and c is its vector representa- can have multiple relations with the other components of
is required to define a set of initial topics  (0) of interest. each of the aforementioned relations is discovered.
terminology
enrichment
knowledge</p>
      <p>base
topics  ( ←), finally updating the set of current topics
As a consequence of this process, in ACG a topic</p>
      <p>( ←), we have i) relation
classificaACG. In particular, for  
tion with document chunks  () ; ii) a relation derivation
with terms  
ciated with  
() discovered from document chunks
asso( ←); iii) a relation belonging with terms  
topic  
in its cluster; iv) a relation derivation with a new topic 
formed by some of the terms in   . It can happen that
( ←) is not associated with any document chunk at

()
time  . This means 

( ←) is no longer a useful topic with

respect to the documents of time  . In this case, the topic
( ←) becomes inactive, together with the set of terms  

belonging to it, and it will not be able to form new topics.</p>
      <p>This can be interpreted as the disappearance of interest
towards a certain topic, which emerged in past periods
 ← but has lost its relevance in the current corpus,  () .</p>
      <p>In the remaining part of this section, we will discuss
each phase in Figure 2, explaining in deeper details how</p>
      <sec id="sec-4-1">
        <title>3.1. Data Preprocessing</title>
        <p>from the period-specific subcorpus  () .</p>
        <p>Preprocessing is the starting point of the tASKE cycle. At
each time period 0, … ,  , the model retrieves documents</p>
        <p>
          Documents are first split into document chunks  () ,
each of which can fit into the maximum input length of a
contextual embedding model. In this case, we employed
Sentence-BERT [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], a modification of the original BERT
by the aforementioned embedding model. This approach
model, which exploits siamese and triplets networks, be- addresses the problem of sense disambiguation, since
ing able to derive semantically meaningful sentence em- it maps distinct senses of polysemic words to diferent
beddings in form of numeric vectors. Such a model is
embedding vectors.
employed in order to extract the semantic features of
term definitions and document chunks and map them
into the same vector space.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Zero-Shot Classification</title>
        <p>sification relationships are defined. Given the coexistence
In the zero-shot classification phase, document-topic clas- topic  
( ←):
of topics and document chunk embeddings in the same
vector space, it is possible to perform a zero-shot
classification,  ∶</p>
        <p>() →  ( ←), without having the model
exposed to training examples. A similarity measure 
(e.g., cosine similarity) between the embedding vector
k() of each document chunk</p>
        <p>() in  () and the embed- in   () .
ding vector c
( ←) of each topic</p>
        <p>( ←) in  ( ←) is computed
and, eventually, the two are associated if their similarity
is higher than a predefined threshold  :</p>
        <p>For each retrieved term sense, the same similarity
measure  used for classification is exploited in order to
compute the similarity between w and the vectors
representing topics and document chunks. The terms whose sum
of similarities is greater than the hyperparameter 
become candidates for enriching the terminology of the
( 
Tuning hyperparameter  is crucial since it may re- terminology enrichment will include only a small set of
( ←).

where k() is the centre of the embeddings of chunks</p>
        <p>The set of candidate terms is sorted in descending
order according to the similarity score. In addition, one
can also define a learning rate  , which represents the
maximum number of terms that can be associated to
a certain topic at each iteration. Applying the bounds
 and  ensures that, at each iteration, the process of
terms that are supposed to be meaningful with respect
to the topic at hand.</p>
        <p>Taking as example the topic mentioned in the previous
section, ‘causality’, it has been associated, among others,
with the following terms and similarity scores: ‘causality’
(0.773), ‘etiologic’ (0.741), ‘noncausal’ (0.737).</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.4. Topic Formation</title>
        <p>
          Finally, tASKE may generate new topics in a topic
formation phase. In this phase a clustering algorithm, such as
Afinity Propagation [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], is applied over the embedding
vectors w of the terms
        </p>
        <p>() related to each topic  
According to the results, a diferent operation is enforced:
  ( ←)( 
() ) = { 
∈  ( ←) ∶  ( k()
, c
( ←)) ≥  }
markably afect the classification output: for example,
choosing a high value of  could result in a highly precise
classification, despite potentially finding only a small set
of document chunks for each topic (low recall).</p>
        <p>Finally, classification</p>
        <p>relationships are stored in ACG
topics its chunks are labelled with.
by considering documents as the simple concatenation
of their chunks, so that a document   is labelled with all
For example, the document chunk
[...] graphical representations of causation
have been used for at least seventy years,
and the modern development of directed
acyclic graphs to portray causal systems
continues the trend. It is sometimes dificult
to understand, however, what it is about
these diagrams that is causal [...]
is classified by tASKE with the topic ‘causality’ with a
similarity score of 0.652.</p>
      </sec>
      <sec id="sec-4-4">
        <title>3.3. Terminology Enrichment</title>
        <p>For each topic  
lemmatized terms  
ment chunks  
( ←) in the ACG, tASKE retrieves the set of</p>
        <p>
          () appearing in the subset of
docu() associated with  
( ←) by a classification
relation. These terms vectors are placed in the same
semantic space, together with K and c, retrieving their
definition   from an external knowledge base, such as
WordNet [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], and computing its vector representation w
original topic  
the concept label of  
• derivation: if new clusters, diferent from 
are formed, each of them becomes a new topic,
derived from  
term  closer to the cluster center;
        </p>
        <p>( ←), whose label is set equal to the
• conservation: if no new cluster is formed, the
( ←) is preserved, represented by
the cluster in which the term  corresponding to
( ←) is present;
• pruning: if a new cluster  
its member terms belong also to  
topic is absorbed by the older one.</p>
        <p>() is formed but all
( ←), the newer</p>
        <p>In the end, term-topic belonging relationships and
topic-topic derivation relationships are stored in the ACG
together with document-topic classification relationships
defined in the zero-shot classification phase, building up
the topic genealogy. Topics  () defined in this phase will
serve as input for the next iteration.</p>
        <p>Considering the topic ‘causality’, consisting of the
following set of terms {‘causality’, ‘etiologic’, ‘noncausal’,
‘event’, ‘issue’, ‘circumstance’, ‘interpretation’,
‘explanandum’}, the tASKE model has formed three topics with
the corrisponding sets of terms: ‘causality’ =
{‘causality’, ‘etiologic’, ‘noncausal’}, ‘event’ = {‘event’, ‘issue’,
‘circumstance’}, ‘interpretation’ = {‘interpretation’,
‘explanandum’}.
4. Case Study and Evaluation
tASKE is here evaluated on a case study on Data Science
literature. The evaluation framework has to account for
three targets:
1. correctness of extracted topics,
2. correctness of the time of extraction,
3. correctness of topic-topic derivation
relationships.</p>
        <p>First, a “Data Science in Scopus” corpus (hereon
ScopusDS Corpus), made of abstract of journal papers
ranging from January 2000 to December 2021, is constructed.
Then keywords defined by authors of each paper are
exploited to generate a ground truth for all three targets,
and our method is evaluated against the ground truth.
Finally we perform a brief qualitative analysis of results,
which is complementary to quantitative evaluation.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.2. Definition of a Ground Truth</title>
        <p>Keywords provided by the authors of each paper are
natural candidates to form a ground truth for topic modelling
4.1. Corpus Construction of scientific papers. Exact matching between keywords
and extracted topics, however, would yield no significant
The ScopusDS corpus has been retrieved from Elsevier result, because topics are defined as sets of terms whereas
Scopus by downloading publications in the time inter- keywords are strings, and author-assigned keywords may
val from January 2000 to December 2021 according to not be linked to terms in the external knowledge base
selected subject areas that are concerned with the “data employed in tASKE. Hence we define an alternative
evaluscience” subject. For each publication, eid, year, title, ab- ation methodology which makes use of a non-contextual
stract, document type, and author-assigned keywords have word embedding model to compute the similarity
bebeen downloaded. Furthermore, additional metadata are tween keywords and extracted topics.
retrieved (e.g., author name and afiliation, journal/con- For target (1), we compare clusters extracted by tASKE
ference name, ISSN, publication type). The corpus con- with the set of keywords at each time  .
tent is described in Table 1 in terms of considered subject For target (2), we are interested in knowing whether
areas and corresponding number of retrieved publica- the topics were extracted at the correct time, so we
comtions. pare clusters extracted at each time with the entire set of</p>
        <p>Besides the paper abstract, two pieces of metadata were keywords. A comparison of the resulting metrics with
taken into account in the analysis: the publication date and the ones obtained for target (1) provides an indicator of
the list of keywords provided by the author(s). We selected the timeliness of tASKE extraction: if a topic  , extracted
only documents of type “article” that are accompanied by tASKE at time  is more similar to keywords from time
by at least 3 keywords and are at least 30 words long,  ′ ≠  than to the ones from  , then  can be deemed more
ifnally amounting to 766,867 documents. Figure 3 shows appropriate to describe the subcorpus at time  ′ and was
the number of documents and keywords per year. extracted either “too soon” or “too late”.</p>
        <p>Defining target (3) is more complicated, since no
genealogical structure is inherently defined on paper
keywords. We must first define a set of heuristics to derive
a ground truth from the keyword lists assigned to
documents. Specifically, we say that a subsequent keyword
 ′ is derived from an antecedent keyword  at time  if:
•  was associated to any document at any time</p>
        <p>← &lt;  ;
•  ′ has never been associated to any document at</p>
        <p>any  ← &lt;  ;
• the number of keyword co-occurrences at  ,   , is</p>
        <p>such that   ( ,  ′) ≥ 1.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.3. Quantitative Evaluation</title>
        <p>
          We run tASKE on the ScopusDS corpus by selecting
years as time units in which the corpus is split. Since
tASKE requires to be initialized with a set of input topics
mean and the standard deviation of the results for each
year.
 (0) = { 1(0), … ,  (0)}, we exclude papers of year 2000 from Outcomes displayed in Figure 4 are promising, with a
the evaluation and use the set of keywords assigned to mean similarity going from 86.99% in 2001 ( = 9.98% )
them to derive  (0). This set of terms  (0) is first fil- to 80.20% in 2021 ( = 13.47% ), touching a minimum
tered to retain only terms that appear in WordNet, i.e. equal to 77.12% ( = 10.98% ) for year 2006. The figure
the knowledge base used for this evaluation. To avoid does not prove only the efectiveness of tASKE for target
the injection of spurious topics into the system,  (0) is (1), i.e. to discover topics in a corpus, but also for target
further filtered in order to keep only monosemic terms, (2), i.e. to discover them at the proper time. Indeed,
i.e. terms that are linked to a single WordNet synset, and at each year, matching with keywords from other years
the 100 with the highest frequency are sampled. In order yields better similarities only for few topics per year, as is
to retrieve initial topics from this set of terms, we apply proven by the overlapping of the similarity distributions.
Afinity Propagation [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], eventually obtaining  = 20 In the same way as we did for each topic, we can
meatopic clusters, mostly related to mathematics (e.g. re- sure the maximal similarity between each keyword and
gressions analysis = {regression analysis, linear regression, the set of topics in each year, which may be considered a
multiple regression}) and computer science (internet = proxy for recall. Resulting similarity distributions, going
{internet, information system, bandwidth, world wide web, from 34.77% mean ( = 11.28% ) in 2001 to 65.81% mean
electronic mail}), but also to domain of application (air ( = 12.76% ) in 2021, are displayed in Figure 5. Although
pollution = {air pollution, air transport}). maximising recall was not our main interest, we found
        </p>
        <p>As for hyperparameters, we set thresholds  and  that the system gets closer and closer to finding at least
equal to one another so to have a single learning rate, a topic for each keyword.
and since we found the system to be efective for  ≤ 0.35 ,
the experiments were conducted with  =  = 0.35 to
achieve eficiency in terms of computation time.</p>
        <p>
          To assess the closeness of topics retrieved by tASKE
to the ground truth, we train a Word2Vec model on a
pseudo-corpus whose documents are a concatenation of
document chunks and their ground truth keywords. By
exploiting this model, as was done for instance in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ],
it is possible to embed keywords and extracted terms in
the same vector space. For each year, we define topic
embeddings again as the centroids of the embeddings of
topic-related terms, which may change from year to year
even for the same topic, and we compute cosine similarity
between the resulting vectors and the set of keywords. Figure 5: Distributions of similarities between keywords from
This is done by single linkage, i.e. finding the closest each year and the closest topic from the same year (blue bars),
keyword for each topic embedding. Figure 4 reports the or the closest topic from all years (orange bars).
        </p>
        <p>As for target (3), we experimented with the same eval- 4.4. Qualitative Analysis
these and the ones that accounted only for direct relation- the topics that marked recent developments or
applicaships is small enough to assume tASKE can find
shortterm derivations, but has more dificulty in managing
long-term ones, likely due to cumulative errors.
tions in the Data Science domain, such as: ‘face
recognition’ (2014, from ‘biometric identification’ ) (as shown
in Figure 6), ‘speech production’ (2004, from ‘wavelet’),
‘search engine’ (2004, from ‘internet’), ‘ontology’ (2006, comparing tASKE with other temporal topic modeling
from ‘knowledge’), ‘clustering’ (2006, from ‘class’), ‘nat- methods and by assessing the quality of topics and their
ural language processor’ (2008, from ‘internet’), ‘graphi- genealogy through the evaluation of domain experts.
cal user interface’ (2008, from ‘internet’), ‘cryptanalytic’
(2010, from ‘cryptography’), ‘flight control’ (2012, from
‘flight simulator’ ), ‘machine readable’ (2017, from ‘inter- References
net’), ‘automatic face recognition’ (2016, from ‘face
recognition’ through ‘identity verification’ ). Another category
of topics is the one that includes topics of interest but
provides a spurious derivation, e.g. ‘neural network’ (2006),
here derived from ‘internet’, or ‘cryptography’ (2008), that
descends from ‘air pollution’. An even clearer example of
the boundaries the external knowledge base imposes on
tASKE is given by the topic ‘percolation’, which may refer
to ‘clique percolation technique’ in the documents but is
here linked to ‘air pollution’ due to the absence of any
non-physical sense of term ‘percolation’ from WordNet.</p>
        <p>We acknowledged also that most extracted topics are
related to domains of application, e.g. medicine, physics,
chemistry, social sciences, etc. Including these topics
in a hierarchical class structure may prove beneficial to
simplify visualization of the topic genealogy.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Concluding Remarks</title>
      <p>Starting from the increasingly current need to understand
the evolution of ideas and research themes in scientific
literature, in this work we have presented tASKE, a method
for identifying topics in a diachronic corpus of scientific
articles. Time in tASKE is a crucial aspect, as the goal
is not only to identify the topics in their right temporal
collocation, but also to understand how a topic can derive
from previous topics, in order to reconstruct the
genealogy of the topics in time. tASKE makes it possible to
achieve these objectives with an unsupervised approach,
i.e., without the need to resort to large and complex
preannotated datasets. The experimental results, conducted
on a corpus of real scientific publications covering a
period of 21 years, show how tASKE is able to identify the
topics deemed relevant by the authors of the papers and
expressed by means of thematic keywords. In particular,
the topics identified by tASKE are not only adequate, but
also placed in the correct time period and related to each
other in a genealogy that described their evolution. Our
current and future work on tASKE is aimed at three main
goals: i) introduce an adaptive learning rate, with the
aim of controlling the number of new topics discovered
by tASKE for each time period according not only to the
topic relevance but also to the capability of each topic to
potentially induce the discovery of new topics in future
iterations; ii) make tASKE independent from external
knowledge bases, exploiting contextual embeddings, so
to avoid restricting a-priori the vocabulary of terms that
can be extracted; iii) perform further evaluations both by</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferrara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Picascia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Riva</surname>
          </string-name>
          ,
          <article-title>Contextaware knowledge extraction from legal documents through zero-shot classification</article-title>
          , in: R.
          <string-name>
            <surname>Guizzardi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          Neumayr (Eds.),
          <source>Advances in Conceptual Modeling</source>
          , Springer International Publishing, Cham,
          <year>2022</year>
          , pp.
          <fpage>81</fpage>
          -
          <lpage>90</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Landauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. W.</given-names>
            <surname>Foltz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Laham</surname>
          </string-name>
          ,
          <article-title>An introduction to latent semantic analysis</article-title>
          ,
          <source>Discourse Processes</source>
          <volume>25</volume>
          (
          <year>1998</year>
          )
          <fpage>259</fpage>
          -
          <lpage>284</lpage>
          . doi:
          <volume>10</volume>
          .1080/ 01638539809545028.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. I. Jordan</surname>
          </string-name>
          ,
          <article-title>Latent dirichlet allocation</article-title>
          ,
          <source>Journal of machine Learning research 3</source>
          (
          <year>2003</year>
          )
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rabinovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <article-title>The inverse regression topic model</article-title>
          , in: E. P. Xing, T. Jebara (Eds.),
          <source>Proceedings of the 31st International Conference on Machine Learning</source>
          , volume
          <volume>32</volume>
          <source>of Proceedings of Machine Learning Research</source>
          , PMLR, Bejing, China,
          <year>2014</year>
          , pp.
          <fpage>199</fpage>
          -
          <lpage>207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gruber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rosen-Zvi</surname>
          </string-name>
          ,
          <article-title>Hidden topic markov models</article-title>
          , in: M.
          <string-name>
            <surname>Meila</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          Shen (Eds.),
          <source>Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics</source>
          , volume
          <volume>2</volume>
          <source>of Proceedings of Machine Learning Research</source>
          , PMLR, San Juan, Puerto Rico,
          <year>2007</year>
          , pp.
          <fpage>163</fpage>
          -
          <lpage>170</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Laferty</surname>
          </string-name>
          ,
          <article-title>Dynamic topic models</article-title>
          ,
          <source>in: Proceedings of the 23rd International Conference on Machine Learning</source>
          , ICML '06,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2006</year>
          , p.
          <fpage>113</fpage>
          -
          <lpage>120</lpage>
          . doi:
          <volume>10</volume>
          .1145/1143844.1143859.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <article-title>Discovering themes and trends in transportation research using topic modeling</article-title>
          ,
          <source>Transportation Research Part C: Emerging Technologies</source>
          <volume>77</volume>
          (
          <year>2017</year>
          )
          <fpage>49</fpage>
          -
          <lpage>66</lpage>
          . doi:https://doi.org/ 10.1016/j.trc.
          <year>2017</year>
          .
          <volume>01</volume>
          .013.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Abuhay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. G.</given-names>
            <surname>Nigatie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V.</given-names>
            <surname>Kovalchuk</surname>
          </string-name>
          ,
          <article-title>Towards predicting trend of scientific research topics using topic modeling</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>136</volume>
          (
          <year>2018</year>
          )
          <fpage>304</fpage>
          -
          <lpage>310</lpage>
          . doi:https://doi. org/10.1016/j.procs.
          <year>2018</year>
          .
          <volume>08</volume>
          .284, 7th International Young Scientists Conference on Computational Science,
          <volume>YSC2018</volume>
          ,
          <fpage>02</fpage>
          -
          <lpage>06</lpage>
          July2018, Heraklion, Greece.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Dieng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. J. R.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <article-title>The dynamic embedded topic model</article-title>
          , CoRR abs/
          <year>1907</year>
          .05545 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1907</year>
          .05545. arXiv:
          <year>1907</year>
          .05545.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liang</surname>
          </string-name>
          , J. Wu,
          <article-title>Semantic-enhanced topic evolution analysis: a combination of the dynamic topic model and word2vec</article-title>
          ,
          <source>Scientometrics</source>
          <volume>127</volume>
          (
          <year>2022</year>
          )
          <fpage>1543</fpage>
          -
          <lpage>1563</lpage>
          . URL: https:// doi.org/10.1007/s11192-022-04275-z. doi:
          <volume>10</volume>
          .1007/ s11192- 022- 04275- z.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          , CoRR abs/
          <year>1908</year>
          .10084 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1908</year>
          .10084. arXiv:
          <year>1908</year>
          .10084.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          , L. Zettlemoyer,
          <article-title>BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension</article-title>
          , CoRR abs/
          <year>1910</year>
          .13461 (
          <year>2019</year>
          ). URL: http: //arxiv.org/abs/
          <year>1910</year>
          .13461. arXiv:
          <year>1910</year>
          .13461.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>M.-W. Chang</surname>
            ,
            <given-names>L.-A.</given-names>
          </string-name>
          <string-name>
            <surname>Ratinov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Srikumar</surname>
          </string-name>
          ,
          <article-title>Importance of semantic representation: Dataless classification</article-title>
          ., in: Aaai, volume
          <volume>2</volume>
          ,
          <year>2008</year>
          , pp.
          <fpage>830</fpage>
          -
          <lpage>835</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Fellbaum</surname>
          </string-name>
          (Ed.),
          <source>WordNet: An Electronic Lexical Database</source>
          , Language, Speech, and Communication, MIT Press, Cambridge, MA,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dueck</surname>
          </string-name>
          ,
          <article-title>Afinity propagation: clustering data by passing messages</article-title>
          , University of Toronto Toronto, ON, Canada,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Role</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Morbieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nadif</surname>
          </string-name>
          ,
          <article-title>Unsupervised evaluation of text co-clustering algorithms using neural word embeddings</article-title>
          ,
          <source>in: Proceedings of the 27th ACM International Conference on Information and Knowledge Management</source>
          , CIKM '18,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2018</year>
          , p.
          <fpage>1827</fpage>
          -
          <lpage>1830</lpage>
          . URL: https://doi.org/10.1145/3269206. 3269282. doi:
          <volume>10</volume>
          .1145/3269206.3269282.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>