<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>from Academic Articles⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yang Yang</string-name>
          <email>yangyang1221@njust.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yi Xiang</string-name>
          <email>xiangyi@njust.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chengzhi Zhang</string-name>
          <email>zhangcz@njust.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Management, Nanjing University of Science and Technology 210094 Nanjing</institution>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <fpage>15</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>With the exponential growth of academic articles, identifying emerging topics from vast amounts of literature has become a critical task. Currently, researchers employ methods such as bibliometrics and natural language processing to accomplish the task. This study attempts to identify emerging topics by focusing on novel future work sentences. These sentences describe authors' prospects for subsequent research directions and provide a reference for grasping the latest research trends. This study focuses on the field of Natural Language Processing (NLP) and constructs a corpus of future work sentences. We then demonstrate the effectiveness of future work sentences in the identification of emerging topics. Finally, we apply the life-index novelty measurement method to assess the novelty of entities in future work sentences and filter emerging entities based on their novelty and influence. Building on this, we identify emerging research topics in conjunction with the corresponding research tasks of the papers. The results indicate that optimizations and applications of pre-trained language models represent a significant emerging research topic in this domain.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Future work sentence</kwd>
        <kwd>Emerging research topics</kwd>
        <kwd>Entity extraction</kwd>
        <kwd>Academic articles</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The identification of emerging topics can provide researchers with valuable insights into future
research directions and help funding agencies to optimize the allocation of research funds [1].
Currently, most researchers focus on historical data such as citations to identify emerging topics.
However, identifying emerging topics through past topics has a time lag and does not meet the
predictive needs of policy makers and researchers [2]. Predicting future research topics is often
uncertain, making this task even more challenging.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Introduction</title>
      <p>The identification of emerging topics can provide researchers with valuable insights into future
research directions and help funding agencies to optimize the allocation of research funds [1].
Currently, most researchers focus on historical data such as citations to identify emerging topics.
However, identifying emerging topics through past topics has a time lag and does not meet the
predictive needs of policy makers and researchers [2]. Predicting future research topics is often
uncertain, making this task even more challenging.</p>
      <p>In the conclusion of academic articles, authors outline their perspectives on future research
directions, termed future work sentences. Future work sentences emphasize potential directions for
improvement and are more forward-looking compared to titles or abstracts, making them valuable
clues for identifying emerging research topics. However, researchers' expressions of future work
sentences are either explicit or ambiguous. Some sentences with low reference significance will be
found when reading them [3]. As illustrated in Figure 1, the paragraph includes four future work
sentences. The first sentence highlights the need to improve word prediction accuracy but does not
specify methodologies, while the following sentences provide detailed directions for enhancement.</p>
      <p>Thus, future work sentences must be screened to enhance understanding of subsequent
research. Current research effectively identified future work in papers and summarizes its
characteristics by extracting keywords. However, most studies primarily analyzed word frequency
in sentences, neglecting to explore whether annual future work sentences can reflect emerging
topics of the time [4]. Therefore, this paper takes the field of Natural Language Processing (NLP) as
an example to further analyze the specific content of future work sentences, combining emerging
entities with research topics to discover emerging research topics.</p>
      <p>Specifically, we first construct a corpus of future work sentences in the field of NLP, then
analyze their effectiveness in identifying emerging research topics through semantic similarity
analysis. We also assess how many years' worth of future work sentences can contribute to this
identification. Finally, we extract emerging entities from each year's future work sentences using
the entity novelty measurement method, enabling us to infer emerging research topics in NLP.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Related Work</title>
      <p>This section provides an overview of related work from two perspectives: the identification and
analysis of future work sentences, and the prediction of emerging research topics.</p>
      <sec id="sec-3-1">
        <title>3.1. Extraction and Content Analysis of Future Work Sentences</title>
        <p>Current research on future work sentences primarily falls into identification and content analysis.
In identification, researchers mainly employ rule-based matching and machine learning or deep
learning methods. While rule-based matching achieves high accuracy, its reliance on numerous
rules makes it impractical for large academic texts. Consequently, recent studies increasingly
utilize machine learning and deep learning models. Hao et al. provided an annotated dataset of
future work sentences from the Association for Computational Linguistics (ACL) conference [5].
Zhang et al. expanded this dataset and used it as a training set to train a machine learning
classification model for identifying future work sentences [4]. Zhu et al. utilized the BERT model
for the automatic extraction of future work sentences [6].</p>
        <p>In terms of content mining of future work sentences, Hu and Wan first categorized future work
sentences into four types and analyzed the distribution of keywords in different research areas
within computational linguistics [7]. Li et al. matched keywords in future work sentences with
those in titles and abstracts to explore the conceptual connections between scientific papers and
their future work sentences [8]. Hao et al. further categorized future work sentences in the NLP
domain into six major categories and seventeen subcategories, analyzing the specific distribution of
each type [5]. Qian et al. specifically analyzed the distribution characteristics of six future work
11 https://aclanthology.org/2018.iwslt-1.3/
sentence types in the field of natural language processing, as well as the focus of future work on
different tasks in this field [9]. Zhang et al. analyzed research tasks and hot topics in future work
sentences within NLP, and demonstrated the feasibility of using future work sentences to predict
future research priorities [4]. Song et al. utilized future work sentences to generate academic
innovation topics, offering new insights for technological innovation [10]. Xie et al. examined
future work in integrated publishing to explore future research focuses and the evolution of
frontier topics, providing valuable references for subsequent studies [11]. Suray et al. analyzed
future work sentences from 29 papers presented at the SOUPS symposium, finding most sentences
to be broad and vague, with limited impact on subsequent citations [12].</p>
        <p>In summary, the identification of future work sentences has become a relatively straightforward
task, with machine learning algorithms enabling accurate and efficient recognition. However, in
terms of content analysis, current research mainly focuses on the frequency of keywords. Further
studies are needed to conduct more fine-grained analyses of future work sentences.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Identifying of Emerging Research Topics</title>
        <p>In recent years, identifying emerging topics has become a significant focus in academia, yet
researchers lack a unified consensus on the definition and related attributes of this concept. [13].
Rotolo et al. identified several attributes of emerging technologies: radical novelty, rapid growth,
coherence, significant impact, and uncertainty [14]. On this basis, Wang provided a comprehensive
definition of emerging research topics as those that are novelty, rapidly growing, coherent, and
influential [1]. This definition serves as the primary detection criteria for most researchers
identifying emerging topics.</p>
        <p>The study of emerging research topics dates back to 1965 [15]. Identifying these topics primarily
relies on bibliometric and NLP methods. Some researchers applied bibliometric methods based on
citation networks, including direct citation, co-citation, and citation coupling networks [16][17]
[18]. Shibata et al. argued that direct citation network can better discover emerging topics from the
perspectives of visibility, speed, and relevance [19]. Kwon et al. demonstrated through direct
citations that the emergence of emerging concepts is directly proportional to their future influence
[20]. Meanwhile, Boyack et al. verified that using citation coupling for identification the best
results from the perspectives of text coupling and network centrality [21].</p>
        <p>Additionally, researchers utilized NLP techniques for identification. Ohniwa et al. formed
emerging themes through co-word analysis of keywords [22]. Liu et al. combined keyword
cooccurrence networks, co-citation networks to study emerging research trends [23]. Xu et al.
considered four attributes of emerging topics and employed various machine learning methods for
emerging topic identification [24]. Ma et al. combined LDA, SAO analysis, machine learning and
expert judgement to identify potential development opportunities for emerging technologies [25].
Alattar and Shaalan applied a filtered-LDA model to discover emerging themes [26]. Yang et al.
used ecological theories to assess the emergence potential of keywords and identify emerging
topics [27]. Wei et al. framed the detection of emerging topics as a cover article prediction problem,
using various machine learning methods to predict cover papers [28]. Song et al. proposed a
method combining the BERT model with semantic analysis to identify the proportion of emerging
technologies [29].</p>
        <p>In addition to identifying emerging research topics, researchers have also made efforts to
predict future emerging topics. Jung et al. constructed a thematic network to analyze the evolution
of themes, enabling them to prospectively predict subsequent research topics [30]. Yang et al.
employed an LSTM model to predict the future emerging index of entity features, thereby
identifying emerging research topics [31].</p>
        <p>In summary, the development of NLP techniques has transcended the limitations of
citationbased analysis, enabling researchers to explore more textual features for deeper and more
comprehensive insights. Additionally, researchers are making efforts to further predict emerging
research topics in the future. This study broadens the analytical perspective on this task and
conducts the analysis by referencing existing emerging indicators.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>This research aims to analyze and summarize emerging entities found in future work sentences
to identify emerging research topics in specific fields. The overall research process is illustrated in
Figure 2, which is divided into three main steps. The first step involves constructing a future work
sentence corpus. The second step is to validate the effectiveness of these sentences in identifying
emerging topics. The third part involves calculating entity novelty to identify emerging entities
and recognizing emerging research topics in conjunction with specific research tasks.</p>
      <sec id="sec-4-1">
        <title>4.1. Construction of future work sentences corpus in the field of NLP</title>
        <p>We choose the field of NLP due to its significant role in handling vast amounts of data and
advancing artificial intelligence, especially with the emergence of large models like GPT, making it
a widely recognized area of research in recent years. Currently, the future work sentence corpus in
the field of NLP primarily consists of collections from three highly regarded conferences: the
Association for Computational Linguistics (ACL), Empirical Methods in Natural Language
Processing (EMNLP), and the North American Chapter of the Association for Computational
Linguistics (NAACL) [4]. Identifying emerging research topics in a field requires analysis based on
a substantial body of existing research. Therefore, to provide reliable data support for subsequent
analyses of future work sentences, we construct a more comprehensive corpus of future work
sentences in the NLP field.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.1.1. Data filtering and acquisition</title>
        <p>The ACL Anthology2 includes major conference and journal papers in Computational Linguistics
(CL) and Natural Language Processing (NLP). Some researchers have already built NLP databases
based on this site [32]. However, most datasets were constructed earlier and have not been updated
in a timely manner. We selected the newly released ACL OCL dataset, which includes the
structured full texts of 74,000 academic papers as of September 2022 [33]. This corpus significantly
reduces the time cost of data acquisition. Considering factors such as paper quality, type, and
language, we select all long and short papers in English from 2000 to 2022 across 46 major
conferences, excluding demo and workshop papers. These conferences are authoritative in the NLP
field and provide a comprehensive overview of various research topics and the latest
advancements.</p>
        <p>Additionally, to ensure data completeness, we supplement the corpus with papers published
after September 2022 that were initially missing, resulting in a total of 37,791 academic articles. The
number of papers published each year is shown in Figure 3. The lower counts in some years
compared to the previous year can be attributed to certain conferences not being held that year.
Overall, the number of research papers has shown an upward trend, particularly after 2019, when
the volume nearly doubled. This indicates the accelerating pace of development in the NLP field in
recent years, further highlighting the growing need to quickly grasp the latest research trends.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.1.2. Identification of future work sentences in the field of NLP</title>
        <p>First, we apply a set of rules to initially extract paragraphs likely to contain future work. If a paper
has a dedicated Future Work section, all sentences within that section are included as candidates.
Otherwise, we only select paragraphs containing any of 42 relevant phrases, such as "In the future",
"Future research”, or "Future direction". Finally, we segment the selected paragraphs into individual
sentences.</p>
        <p>To further identify which sentences are future work sentences, we employ the future work
sentence identification model developed by Zhang et al. [4]. They trained a future work sentence
identification model based on Naive Bayes using over 9,000 labeled samples, achieving the F1 score
of 90.73%. Thus, using this model allows for accurate identification of future work sentences in our
dataset. We use the training set they provided and apply the future work sentence recognition
model to automatically identify all the selected target sentences. Additionally, to ensure greater
accuracy and completeness of the extraction, we manually perform further filtering by removing
sentences in the past tense. We also identify sentences starting with pronouns and merge them
with the preceding ones.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.1.3. Fine-grained entity extraction in future work sentences</title>
        <p>To thoroughly analyze the content value of future work sentences, we extract fine-grained
knowledge entities from these sentences for further research. We utilize the entity extraction
model developed by Zhang et al. [34], which focuses on the NLP domain. They randomly selected
50 NLP papers for entity annotation, categorizing entities into four types: methods, tools, metrics,
and datasets, as summarized in Table 1. These four types of entities effectively encompass the
research content in the NLP field, particularly the method entities, which serve as the main driving
force behind advancements in NLP research [35].</p>
        <p>Moreover, the entity recognition model they developed is based on SciBERT and employs a
cascading binary tagging framework. To enhance the model's performance and robustness, they
designed a semi-supervised approach to expand the dataset. This was achieved by matching
sentences containing annotated entities from unannotated abstracts, with only those samples
where all matched entities were already annotated being included. An equal number of sample data
points were added to the training set, effectively addressing data scarcity.</p>
        <sec id="sec-4-4-1">
          <title>Algorithms or models to tackle NLP tasks</title>
          <p>Example
SVM,
BERT,</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>LSTM,</title>
        </sec>
        <sec id="sec-4-4-3">
          <title>Dataset</title>
        </sec>
        <sec id="sec-4-4-4">
          <title>Relevant data resources</title>
        </sec>
        <sec id="sec-4-4-5">
          <title>Twitter, WordNet</title>
        </sec>
        <sec id="sec-4-4-6">
          <title>Metric</title>
        </sec>
        <sec id="sec-4-4-7">
          <title>Tool</title>
        </sec>
        <sec id="sec-4-4-8">
          <title>Evaluation metrics tailored to specific tasks</title>
        </sec>
        <sec id="sec-4-4-9">
          <title>Accuracy, BLEU</title>
        </sec>
        <sec id="sec-4-4-10">
          <title>Open-source tools etc. used in the experiment</title>
        </sec>
        <sec id="sec-4-4-11">
          <title>Python, SQL</title>
          <p>The entity extraction model constructed using this method outperforms existing baseline
models, achieving an F1 score of 87%, and demonstrates strong performance on the SciERC and
TDM open datasets. This indicates that the model is well-suited for entity extraction tasks in the
NLP field, fulfilling the needs of our research for extracting entities from future work sentences.</p>
          <p>Finally, due to the existence of different expressions for the same entity, we normalize them by
constructing an entity normalization dictionary to eliminate their impact on the experimental
results. Specifically, we first match high-frequency abbreviated entities with their full forms, while
low-frequency abbreviations are identified by searching for their full forms within the
corresponding papers. Additionally, we perform lemmatization and remove plural forms. By
calculating the similarity between entities, we standardize different expressions of the same entity,
determining whether entities with high similarity share the same meaning, and replace them with
the most formal full form. Lastly, we replace all entities requiring conversion with their
standardized representations.</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>4.2. Analysis of content differences and effectiveness of future work sentences</title>
        <p>Researchers generally believe that future work sentences can, to some extent, reflect the
development trends of a field. However, there is a lack of empirical studies confirming their role in
this regard. Therefore, we aim to analyze whether future work sentences can capture the evolution
of research fields by examining the content differences across years and the similarity between
future work sentences and subsequent research. This analysis will help determine which years'
future work sentences should be used to identify emerging research topics.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.2.1. Analysis of changes in the content of future work sentences</title>
        <p>We calculate the similarity differences among future work statements by year ， to analyze
whether the annual variations in future work content can reflect shifts in research directions.
Specifically, we extract the set of knowledge entities contained in all future work sentences,
denoted as W= {w1, …, wn}, where n is the total number of entities. For each year's future work
sentences, we construct an n-dimensional vector, with each dimension corresponding to a
knowledge entity. The value of each dimension is determined by the frequency of the
corresponding entity appearing in the sentences of that year. After constructing the feature vectors
for each year’s future work, we calculate the cosine similarity between the future work vector of a
specific year and the future work vectors of subsequent years to reflect the differences among the
collections of future work sentence across different years. The formula for cosine similarity is as
follows:
cos ( vi , v j )=</p>
        <p>
          vi⋅ v j ,
|vi|⋅|v j|
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
        </p>
        <p>Where, vi represents the collection of future work sentences for year i, and vj represents the
collection for year j, where  &gt; j.</p>
      </sec>
      <sec id="sec-4-7">
        <title>4.2.2. Effectiveness analysis of using future work sentences to identify emerging topics</title>
        <p>
          Furthermore, we analyze the relevance between the content of future work sentences from specific
years and the content of abstracts from subsequent research papers. Based on this, we can
determine how many years of past future work content are needed to predict subsequent research
directions by analyzing the trend of similarity changes. Specifically, we again construct
entitybased vector representations for the future work vector fk of year k, and the abstract vector ak+n of
year k+n following the same steps as before. We then calculate the cosine similarity between the
future work sentence vector from year k and the abstract vector from year k+n:
cos ( f k , ak+n )= f k⋅ ak+n ( n&gt;0 ) (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
        </p>
        <p>|f k|⋅|ak+n|</p>
        <p>If the similarity is high, it indicates that the future work proposed by researchers in that year is
reflected in subsequent research, demonstrating its reference value.</p>
      </sec>
      <sec id="sec-4-8">
        <title>4.3. Identifying emerging research topics through novelty analysis of entities in future work sentences</title>
        <p>After confirming that future work sentences can reflect changes in research content, we extract
fine-grained entities from the future work sentences of each year for analysis. The rationale for
selecting knowledge entities lies in their ability to clearly represent the improvement directions
researchers focus on, while also minimizing the noise often introduced by keyword extraction.</p>
        <p>Novelty is a critical metric for evaluating emerging research topics. Researchers commonly
assess the novelty of academic papers through combinatorial innovation [36]. However, at a
finegrained semantic level, greater emphasis is placed on identifying new content within existing work
[37]. We use the life-index novelty measurement to characterize the novelty of each entity [41].
This method introduces the concept of a term life index to characterize the recency of terms. In this
paper, the life index of a single entity e in a future work sentence S is calculated as follows:</p>
        <p>Lifeindex ( e )= N ( e )× ln ( T S−T e+1 )</p>
        <p>Where, TS represents the time when the future work sentence S was proposed; Te denotes the
time when the entity e first appeared in the dataset; and N(e) indicates the number of times the
entity e appeared in the future work sentence dataset during the period [Te，TS]. The smaller the
value of Lifeindex(e), the shorter the lifecycle of the entity.</p>
        <p>To better represent the novelty of each entity, we define the novelty score for each entity as
follows:</p>
        <p>Lif en(e)=1−</p>
        <p>ln ( xi+1 )
lnmax ( xi+1 )</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
        </p>
        <p>Where, xi represents the Lifeindex(e) value of entity e in a specific year, and max(xi+1) is the
maximum value of all Lifeindex(e) values for that entity. We do not consider the case where max(xi)
is 0, as this indicates that the entity was mentioned only once across all years, potentially
compromising the reliability of the results due to noise.</p>
        <p>Next, we filter out the emerging entities for each year based on their novelty. These entities
must appear for the first time in a given year and be mentioned in subsequent research to ensure
their continuity and impact. Additionally, we annotate the research tasks in abstracts that contain
these emerging entities. Although many research papers have multiple tasks, and some tasks may
overlap, our annotation focuses only on the primary and most specific task of each paper. Finally,
we analyze the co-occurrence of these entities with research tasks to identify emerging research
topics in the NLP field.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Result</title>
      <p>In this section, we present the results of entity extraction from future work sentences, analyze the
changes in the content of these sentences and their similarity to subsequent research. Finally, by
incorporating the novelty of the entities, we identify emerging entities within the future work
sentences and summarize the emerging topics in the field of NLP.</p>
      <sec id="sec-5-1">
        <title>5.1. Entity extraction results for future working sentences in NLP</title>
        <p>Using the above method, we identify 19,730 academic articles containing future work sentences,
accounting for 42.4% of all papers. A total of 45,799 future work sentences are collected, mainly
appearing in sections such as Conclusion, Conclusion and Future Work, and Future Work, with
most found in the Conclusion section.</p>
        <p>Subsequently, we conduct entity extraction and normalization on the future work sentences. A
total of 22,358 future work sentences yield 28,533 extracted knowledge entities. The entity
extraction model categorizes these entities into four main types: methods, dataset, metrics, and
tools, with the specific distribution detailed in Table 2. Among these, method-related entities are
the most prevalent, with a total of 19,246, followed by data-related entities.</p>
        <p>Finally, we examine the high-frequency entities among the four types identified in the
sentences pertaining to future work, with Table 3 illustrating the top five entities by frequency. It
is evident that BERT is a prominent topic within the method entities, being referenced with greater
frequency than other entities. Language models also receive considerable mentions, as they serve
as foundational elements in the NLP field and remain crucial across various developmental phases.
They have evolved from early statistical models like N-grams to contemporary pre-trained models
like BERT. Regarding dataset entities, researchers primarily concentrate on two categories: one
includes datasets related to various domains of knowledge and information, such as Wikipedia and
Twitter, which are suitable for practical NLP tasks. The other category consists of datasets
primarily containing semantic information and syntactic data, like WordNet, Treebank, and
FrameNet, which serve to enhance the performance of language models. Overall, these two
categories of data underpin all research tasks in the NLP field, underscoring the importance of data
resources in this area.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results of the analysis of the changes in the content and validity of future work sentences</title>
        <p>To analyze whether future work sentences can be used to discover emerging research topics, we
consider two aspects: the evolution of future work content and its similarity to subsequent
research. On one hand, we can assess whether the content of future work changes each year in
alignment with shifts in research trends. On the other hand, we can examine whether future work
is reflected in later research, helping us identify which years' future work sentences are valuable
for detecting emerging topics. The detailed experimental results are presented below.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.2.1. Results of changes in the content of future work sentences in different years</title>
        <p>The results of the content similarity calculations for future work sentences across different years
are shown in Figure 4. Overall, there are noticeable differences in future work sentences from
different years, with greater disparities observed as the year gap increases. This trend is
particularly evident in recent years, especially after 2016, when the future work sentences from
that year are primarily similar to those from the previous two years. Notably, the similarity of the
future work in 2019 to the previous two years is relatively low. This may be attributed to disruptive
developments in the NLP field during 2018-2019, prompting researchers to make significant
adjustments in their research outlooks. After 2019, each year's future work exhibits substantial
differences from the content before 2019, while the similarity to the previous year has noticeably
increased. This indicates a major shift in research directions in the field after 2019, with some
research outcomes receiving widespread attention and consensus among researchers.</p>
        <p>In summary, the NLP field has witnessed some disruptive research outcomes in recent years,
capturing the attention of researchers. This shift is clearly evident in the changes in future work,
suggesting that using future work sentences to infer emerging research topics is a feasible
approach.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.2.2. Analysis of the results of the validity measure of future work sentences</title>
        <p>The cosine similarity calculations between the future work sentences of each year and the
subsequent yearly abstract collections are shown in Figure 5. There is a certain degree of similarity
between the content of each year's future work sentences and the content of subsequent research
papers, with the similarity tending to decrease as the year gap increases. This indicates that the
field is developing rapidly, making it challenging to predict recent research topics based on earlier
future work.</p>
        <p>It is noteworthy that before 2019, the similarity values between each year's future work and the
research content of subsequent years do not vary much. However, there is a significant decline in
similarity after 2019, indicating that influential new research outcomes emerge in 2018, resulting in
substantial differences from previous future work. After 2019, the future work becomes more
similar to the subsequent research content. Therefore, to grasp the current emerging research
topics in the field of NLP, it is crucial to pay particular attention to the future work produced after
2018.</p>
        <p>In conclusion, future work sentences effectively reflect the research trends in the field of NLP.
Significant changes in research content occur between 2018 and 2019, resulting in substantial shifts
in future work. Therefore, the period from 2018 to 2021 can be used as the time window for
identifying emerging research topics in subsequent studies.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.3. Results of emerging topic identification based on emerging entities in future work sentences</title>
        <p>In the previous section, we validated the reliability of future work sentences for identifying
emerging topics. In this section, we present the emerging entities found in future work sentences
in the NLP field in recent years, as well as the emerging topics.</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.3.1. Emerging entities in future work sentences</title>
        <p>Based on the above research findings, we filter emerging entities from the future work sentences of
the years 2018 to 2021. To ensure the novelty of these entities and their impact on subsequent
research, we only select those that first appear in a specific year with a total frequency of at least
five occurrences and continue to appear in subsequent years. Statistical analysis reveals that novel
entities related to tools, evaluations, and datasets rarely have consistent occurrences. Therefore, we
ultimately focus only on novel entities related to methods.</p>
        <p>Based on the above screening criteria, we identify emerging entities in future work sentences
from 2018 to 2021. Figure 6 illustrates the novelty score changes for some of these entities. It can be
seen that a rapid decline in novelty over the years for these entities. This indicates that these
entities have been widely referenced in subsequent years, representing current research interests.
Notably, the BERT entity, in its first year of appearance in future work sentences, is mentioned in
65 research papers, highlighting its significance.</p>
        <p>Note: The numbers in parentheses indicate the total frequency of each entity when it first appears
in the future work sentences.</p>
        <p>Figure 6: Changes in novelty scores of some emerging entities in future work sentences from 2018 and 2019</p>
        <p>Additionally, we observe a few emerging entities with a decline in frequency in future work
sentences, so we analyze their frequency changes in abstracts. Statistics show that 17 of the top 20
entities appearing in abstracts after 2017 are among the emerging entities we selected, indicating
our method effectively captures current research trends.</p>
        <p>Figure 7 presents the frequency changes of the top 15 emerging entities over the past five years,
revealing a gradual increase, which suggests a rise in related research papers. Entities like BERT
and Transformer are frequently mentioned, but their mention rates have declined recently, likely
due to fewer research papers published since 2020, as well as the emergence of some
wellperforming model variants, which led researchers to only mention those optimized models instead.
Notably, ELMo shows a growth trend in 2018-2019 but declines afterward, appearing only once in
2022. Thus, we conclude that the influence of ELMo has decreased and it should not be regarded as
an emerging research entity.</p>
        <p>
          In summary, we consider the novelty and impact of entities in future work sentences to identify
emerging entities, as shown in Table 4. We find that most entities relate to research methods
associated with pre-trained language models, with the Transformer and BERT models being the
most prominent. Many of the subsequent emerging entities evolve from these two foundations.
Transformer model(15), Self-attention(
          <xref ref-type="bibr" rid="ref7">7</xref>
          ), Few-shot learning(
          <xref ref-type="bibr" rid="ref7">7</xref>
          ),
Backtranslation(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ), Meta-learning(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ), Generative adversarial network(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ), Graph
neural network(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), Knowledge graph embedding model(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), Multi-headed
attention mechanism(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
BERT(65), Data augmentation techniques(18), Pre-trained language model(12),
Masked language model(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ), Curriculum learning(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ), GPT(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ), XLNet(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ),
RoBERTa(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), Knowledge distillation(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
GPT-2(15), T5(
          <xref ref-type="bibr" rid="ref6">6</xref>
          ), ALBERT(
          <xref ref-type="bibr" rid="ref5">5</xref>
          ), BART(
          <xref ref-type="bibr" rid="ref4">4</xref>
          ), GPT-3(
          <xref ref-type="bibr" rid="ref4">4</xref>
          ), XLM-R(
          <xref ref-type="bibr" rid="ref4">4</xref>
          ), DistillBERT(
          <xref ref-type="bibr" rid="ref4">4</xref>
          ),
Multilingual NMT(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ), Non-autoregressive Transformer(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), Multilingual
BERT(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), Transformer-based language models(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
Longformer(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ), ELECTRA(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ), DeBERTa(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), CharacterBERT(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), VisualBERT (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ),
SpanBERT(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
Note: The number following the entity represents the frequency of that entity appearing in that
year. The bolded entities are those ranked among the top 20 in frequency within abstracts
published after 2017.
        </p>
      </sec>
      <sec id="sec-5-7">
        <title>5.3.2. Results of emerging topic identification</title>
        <p>Finally, we filter out the research papers from 2021 and 2022 that contain emerging entities,
resulting in a total of 1,629 papers. We annotate the research tasks for each paper. Next, we
construct a co-occurrence network of emerging entities and research tasks, as shown in Figure 8.</p>
        <p>It can be observed that most emerging entities are concentrated on the analysis and
optimization of pre-trained language models, particularly focusing on model compression
techniques like knowledge distillation and applying techniques such as data augmentation and
post-editing to enhance the transferability of models to specific tasks. Additionally, researchers
widely apply pre-trained models to research tasks such as machine translation, sentiment analysis,
and named entity recognition, while also optimizing models to meet the demands of specific
domains like Healthcare and Finance. Overall, the emerging research topics in the field of NLP
primarily center around pre-trained language models. Based on the co-occurrence network and
literature review, we summarize the following emerging research topics.</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Optimization and development of existing pre-trained language models
        </p>
        <p>Since 2018, with the introduction of Transformer and pre-trained language models like BERT
and GPT, researchers have evaluated the reliability of these models from various perspectives. To
ensure model performance in specific tasks—especially in low-shot, zero-shot, and cross-lingual
tasks—researchers have implemented improvements such as few-shot or zero-shot learning
techniques to reduce reliance on data annotation. Additionally, model compression methods like
knowledge distillation and pruning have been employed to lower the cost of pre-trained models.
These enhancements are widely reflected in tasks such as machine translation, sentiment analysis,
and text generation.</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) Application of pre-trained language models and large models in other fields
        </p>
        <p>The transferability of pre-trained models to other domains has also been further explored,
particularly with large pre-trained models like GPT-3, which possess vast amounts of training data
and demonstrate strong transfer capabilities. These models not only excel in NLP tasks but also
find effective applications in specific fields such as Social Sciences and Healthcare. Especially, as
researchers' interest in applying NLP to the social sciences grows, numerous research tasks, such as
those addressing harmful comments, have emerged [39]. Therefore, the success of pre-trained
language models drives further advancements in other domains, which in turn stimulates
researchers to pursue more in-depth studies on domain adaptability.</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) Development of multimodal and multilingual models
        </p>
        <p>Additionally, we observe that recent optimization models integrate more diverse data sources,
such as audio and video. These optimized models are widely applied in tasks like machine
translation, dialogue, and interactive systems. Therefore, how to combine multiple modalities to
overcome the limitations of text data and achieve good results in low-resource language tasks is
currently a key concern for researchers in the field of natural language processing.
Note: Blue circles represent entities, red circles represent research tasks, and only research tasks
that appear five times or more between 2021 and 2022 are shown.</p>
        <p>Figure 8: Emerging entities and tasks co-occurrence network</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and future works</title>
      <p>With advancements in technology and the continuous increase in academic articles, the demand
for grasping the latest research trends in specific fields is also on the rise. This paper focused on the
field of NLP, constructing a corpus of future work sentences within this domain and employing
entity extraction techniques to collect a set of entities from these sentences. Based on this, we
validated the reliability of future work sentences in recognizing emerging topics by comparing the
similarities among future work sentence sets and with subsequent research content. Finally, we
filtered the emerging entities for each year based on entity novelty analysis and summarized the
emerging topics in the NLP field in conjunction with the research tasks in this domain. The results
indicated that pre-trained language models have garnered widespread attention in the NLP field,
suggesting that more research and analysis in this area will emerge in the future.</p>
      <p>However, this paper also has some limitations. Firstly, relying solely on entity extraction makes
it difficult to capture the true semantic information, resulting in somewhat rough predictions.
Therefore, fine-grained analysis of the semantic content in future work sentences is needed.
Secondly, we have only considered the novelty of entities in future work sentences; in the future,
we will incorporate other content from the text and use entity co-occurrence networks to better
assess entity novelty [40]. Finally, we have only identified emerging research topics in the NLP
field without evaluating the quality of the research outcomes. Consequently, we will further
evaluate the reliability of the research results in the future, such as by using topic modeling for
analysis and comparison [41].</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This study is supported by the National Natural Science Foundation of China (Grant No. 72074113).
The paper is presented at the second Workshop on “Innovation Measurement for Scientific
Communication (IMSC) in the Era of Big Data” at 2024 ACM/IEEE Joint Conference on Digital
Libraries (JCDL).</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4 in order to correct grammatical
errors, typos, and other writing mistakes. After using this tool, the authors reviewed and edited the
content as needed and takes full responsibility for the publication’s content.
[12] J. Suray, et al., How the future works at SOUPS: Analyzing future work statements and their
impact on usable security and privacy research, arXiv preprint arXiv:2405.20785, (2024).
[13] S. Xu, L. Hao, X. An, H. Pang, T. Li, Review on emerging research topics with key-route main
path analysis, Scientometrics, 122 (2020) 607–624.
[14] D. Rotolo, D. Hicks, B. R. Martin, What is an emerging technology?, Research Policy, 44 (2015)
1827–1843.
[15] D. J. D. S. Price, Networks of scientific papers: The pattern of bibliographic references
indicates the nature of the scientific research front, Science, 149 (1965) 510–515.
[16] Y. Kajikawa, et al., Tracking emerging technologies in energy research: Toward a roadmap for
sustainable energy, Technological Forecasting and Social Change, 75 (2008) 771–782.
[17] H. Small, K. W. Boyack, R. Klavans, Identifying emerging topics in science and technology,</p>
      <p>Research Policy, 43 (2014) 1450–1467.
[18] M.-H. Huang, C.-P. Chang, Detecting research fronts in OLED field using bibliographic
coupling with sliding window, Scientometrics, 98 (2014) 1721–1744.
[19] N. Shibata, et al., Comparative study on methods of detecting research fronts using different
types of citation, Journal of the American Society for Information Science and Technology, 60
(2009) 571–580.
[20] S. Kwon, et al., Research addressing emerging technological ideas has greater scientific impact,</p>
      <p>Research Policy, 48 (2019) 103834.
[21] K. W. Boyack, R. Klavans, Co‐citation analysis, bibliographic coupling, and direct citation:
Which citation approach represents the research front most accurately?, Journal of the
American Society for Information Science and Technology, 61 (2010) 2389–2404.
[22] R. Ohniwa, A. Hibino, K. Takeyasu, Trends in research foci in life science fields over the last
30 years monitored by emerging topics, Scientometrics, 85 (2010) 111–127.
[23] Z. Liu, et al., Visualizing the intellectual structure and evolution of innovation systems
research: A bibliometric analysis, Scientometrics, 103 (2015) 135–158.
[24] S. Xu, et al., Emerging research topics detection with multiple machine learning models,</p>
      <p>Journal of Informetrics, 13 (2019) 100983.
[25] T. Ma, et al., Combining topic modeling and SAO semantic analysis to identify technological
opportunities of emerging technologies, Technological Forecasting and Social Change, 173
(2021) 121159.
[26] F. Alattar, K. Shaalan, Emerging research topic detection using filtered-LDA, AI, 2 (2021) 578–
599.
[27] J. Yang, et al., A novel emerging topic detection method: A knowledge ecology perspective,</p>
      <p>Information Processing &amp; Management, 59 (2022) 102843.
[28] W. Wei, H. Liu, Z. Sun, Cover papers of top journals are reliable source for emerging topics
detection: A machine learning based prediction framework, Scientometrics, 127 (2022) 4315–
4333.
[29] B. Song, C. Luan, D. Liang, Identification of emerging technology topics (ETTs) using
BERTbased model and semantic analysis: A perspective of multiple-field characteristics of patented
inventions (MFCOPIs), Scientometrics, 128 (2023) 5883–5904.
[30] S. Jung, R. Datta, A. Segev, Identification and prediction of emerging topics through their
relationships to existing topics, in 2020 IEEE International Conference on Big Data (Big Data),
(2020) 5078–5087.
[31] Z. Yang, W. Zhang, Z. Wang, et al., A deep learning-based method for predicting the emerging
degree of research topics using emerging index, Scientometrics, 129 (2024) 4021–4042.
doi:10.1007/s11192-024-05068-2.
[32] Bollmann, Marcel, et al. Two decades of the ACL Anthology: Development, impact, and open
challenges, Proceedings of the 3rd Workshop for Natural Language Processing Open Source
Software (NLP-OSS 2023), (2023) 83-94.
[33] S. Rohatgi, et al., The ACL OCL corpus: Advancing open science in computational linguistics,
arXiv preprint arXiv:2305.14996, (2023).
[34] H. Zhang, C. Zhang, Y. Wang, Revealing the technology development of natural language
processing: A scientific entity-centric perspective, Information Processing &amp; Management, 61
(2024) 103574.
[35] A. Pramanick, et al., A diachronic analysis of paradigm shifts in NLP research: When, how,
and why?, arXiv preprint arXiv:2305.12920, (2023).
[36] B. Uzzi, et al., Atypical combinations and scientific impact, Science, 342.6157 (2013) 468-472.
[37] T. Ghosal, et al., Novelty detection: A perspective from natural language processing,</p>
      <p>Computational Linguistics, 48.1 (2022) 77-117.
[38] Z. Luo, et al., Combination of research questions and methods: A new measurement of
scientific novelty, Journal of Informetrics, 16 (2022) 101282.
[39] A. Pramanick, et al., The Nature of NLP: Analyzing Contributions in NLP Papers, arXiv
preprint arXiv:2409.19505, (2024).
[40] Z. Wang, et al., Content-based quality evaluation of scientific papers using coarse feature and
knowledge entity network, Journal of King Saud University-Computer and Information
Sciences 36.6, (2024) 102119.
[41] P. Savov, et al., Identifying breakthrough scientific papers, Information Processing &amp;
Management 57.2, (2020) 102168.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>A bibliometric model for identifying emerging research topics</article-title>
          ,
          <source>Journal of the Association for Information Science and Technology</source>
          ,
          <volume>69</volume>
          (
          <year>2018</year>
          )
          <fpage>290</fpage>
          -
          <lpage>304</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liang</surname>
          </string-name>
          , et al.,
          <article-title>Combining deep neural network and bibliometric indicator for emerging research topic prediction</article-title>
          ,
          <source>Information Processing &amp; Management</source>
          ,
          <volume>58</volume>
          (
          <year>2021</year>
          )
          <fpage>102611</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Teufel</surname>
          </string-name>
          ,
          <article-title>Do "Future Work" sections have a real purpose? Citation links and entailment for global scientometric questions</article-title>
          ,
          <source>Proceedings of the 2nd Joint Workshop on Bibliometricenhanced Information Retrieval and Natural Language Processing for Digital Libraries</source>
          , Tokyo, Japan, (
          <year>2017</year>
          )
          <fpage>7</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <article-title>Automatic recognition and classification of future work sentences from academic articles in a specific domain</article-title>
          ,
          <source>Journal of Informetrics</source>
          ,
          <volume>17</volume>
          (
          <year>2023</year>
          )
          <fpage>101373</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Hao</surname>
          </string-name>
          , et al.,
          <article-title>The ACL FWS-RC: A dataset for recognition and classification of sentences about future works</article-title>
          ,
          <source>Proceedings of the ACM/IEEE Joint Conference on Digital Libraries</source>
          , (
          <year>2020</year>
          )
          <fpage>261</fpage>
          -
          <lpage>269</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <article-title>Recognizing sentences concerning future research from the full text of JASIST, Proceedings of the Association for Information Science</article-title>
          and Technology,
          <volume>56</volume>
          (
          <year>2019</year>
          )
          <fpage>858</fpage>
          -
          <lpage>859</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <article-title>Mining and analyzing the future works in scientific articles</article-title>
          ,
          <source>arXiv preprint arXiv:1507.02140</source>
          , (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <article-title>Using a keyword extraction pipeline to understand concepts in future work sections of research papers</article-title>
          ,
          <source>Proceedings of International Conference on Scientometrics &amp; Informetrics</source>
          , (
          <year>2019</year>
          )
          <fpage>87</fpage>
          -
          <lpage>98</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qian</surname>
          </string-name>
          , et al.,
          <article-title>Using future work sentences to explore research trends of different tasks in a special domain</article-title>
          ,
          <source>Proceedings of the Association for Information Science and Technology</source>
          ,
          <volume>58</volume>
          (
          <year>2021</year>
          )
          <fpage>532</fpage>
          -
          <lpage>536</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <article-title>Identifying academic creative concept topics based on future work of scientific papers</article-title>
          ,
          <source>Data Analysis and Knowledge Discovery</source>
          ,
          <volume>5</volume>
          (
          <year>2021</year>
          )
          <fpage>10</fpage>
          -
          <lpage>20</lpage>
          . doi:
          <volume>10</volume>
          .11925/infotech.2096-
          <fpage>3467</fpage>
          .
          <year>2020</year>
          .
          <volume>1275</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. Zhang,</surname>
          </string-name>
          <article-title>Research on mining future work sentences of academic papers for discovering cutting-edge topics in integrated publishing</article-title>
          ,
          <source>Information Engineering</source>
          ,
          <volume>9</volume>
          (
          <year>2023</year>
          )
          <fpage>123</fpage>
          -
          <lpage>138</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>