=Paper= {{Paper |id=Vol-3745/paper4 |storemode=property |title=A Research Topic Evolution Prediction Approach Based on Multiplex-graph Representation Learning |pdfUrl=https://ceur-ws.org/Vol-3745/paper4.pdf |volume=Vol-3745 |authors=Yang Zheng,Kaiwen Shi,Yuhang Dong,Xiaoguang Wang,Hongyu Wang |dblpUrl=https://dblp.org/rec/conf/eeke/ZhengSDWW24 }} ==A Research Topic Evolution Prediction Approach Based on Multiplex-graph Representation Learning== https://ceur-ws.org/Vol-3745/paper4.pdf
                                A research topic evolution prediction approach based on
                                multiplex-graph representation learning ⋆
                                Yang Zheng1, Kaiwen Shi2, Yuhang Dong1, Xiaoguang Wang3 and Hongyu Wang1,∗
                                1 School of Management, Wuhan University of Technology, Wuhan, China

                                2 School of Information Engineering, Zhongnan University of Economics and Law, Wuhan, China

                                3 School of Information Management, Wuhan University, Wuhan, China




                                                    Abstract
                                                    The intensification of international technological innovation competition and the evolution of
                                                    scientific research paradigms have led to a continuous expansion of scientific literature, making
                                                    information analysis increasingly complex and diversified. To address the challenges of
                                                    accurately assessing topic evolution within the context of vast literature big data, traditional
                                                    methods of expert evaluation or visualization analysis based on scientific knowledge networks
                                                    are inadequate. From the perspectives of artificial intelligence and big data, this paper proposes
                                                    a universal method for automated and intelligent discrimination and prediction of research topic
                                                    evolution hotness. This method involves integrating content and structural features of keywords
                                                    to track the evolution of keyword frequency strength over time in research topic networks
                                                    characterized by keywords. This study conducts a case analysis in the field of information science.
                                                    The results demonstrate that the prediction of keyword strength is improved after integrating
                                                    content and structural features, which has significant reference value for tasks such as future
                                                    research topic evolution trend discrimination, research direction, and policy planning.

                                                    Keywords
                                                    topic evolution, keyword citation network, text mining, graph representation learning 1



                                1. Introduction                                                                          keywords representing knowledge innovation within
                                                                                                                         a vast array of scientific literature, track the evolution
                                With the intensification of international technological                                  of research topics, and represent them on multiple
                                innovation competition and the evolution of the                                          knowledge networks that contain knowledge units
                                fourth paradigm of scientific research driven by big                                     and their complex interactions, so as to judge the
                                data development, the growing volume of scientific                                       future evolutionary trends of research topics, is a key
                                literature, shifting scholarly interests, and the                                        direction for science and technology information
                                emergence of new research topics pose significant                                        construction and services, as well as a research focus
                                challenges to traditional methods of research topic                                      in the fields of informetrics and scientometrics.
                                analysis[1,2,3]. How to comprehensively and finely                                           Current analyses of the evolution of scientific
                                reveal the research topics and their characteristic                                      research topics largely unfold across three


                                Joint Workshop of the 5th Extraction and Evaluation of Knowledge
                                Entities from Scientific Documents and the 4th AI + Informetrics
                                (EEKE-AII2024), April 23~24, 2024, Changchun, China and Online
                                ∗ Corresponding author.

                                   zhengyang2002@whut.edu.cn (Y. Zheng);
                                shikaiwen@stu.zuel.edu.cn (K. Shi); dongyuhang@whut.edu.cn
                                (Y. Dong); wxguang@whu.edu.cn (X. Wang);
                                hongyuwang@whut.edu.cn (H. Wang)
                                   0000-0001-5635-1131 (Y. Zheng); 0000-0002-3563-982X (K.
                                Shi); 0009-0005-4618-5906 (Y. Dong); 0000-0003-1284-7164
                                (X. Wang); 0000-0002-5063-9166 (H. Wang)
                                              © Copyright 2024 for this paper by its authors. Use permitted under
                                              Creative Commons License Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                    40
dimensions: content, structure, and strength[4,5].                                                                    features of research topic evolution
Utilizing the powerful representation and feature                                                                     representations on knowledge networks?
learning capabilities of deep representation learning                                                                 After tracking these multi-dimensional
algorithms such as word embedding and graph                                                                           evolutionary features, will the assessment of
embedding[6], it is possible to model the complex                                                                     topic evolution trends become more
nonlinear relationships between entities represented                                                                  accurate?
by keywords, the smallest units of knowledge[7,8].
Tracking changes in topic strength, as indicated by                                                     2. Methodology
keyword frequency over time, can reflect the evolving
trends of topics[9,10,11].                                                                              This study aims to predict the strength of keyword
    Therefore, this paper utilizes a multiplex-graph                                                    frequency, using changes in keyword frequency
representation learning method combined with                                                            strength to reflect variation in topic evolution hotness.
interactions in topic keyword content and structure to                                                      The research process is divided into 3 steps: Step
assess changes in topic strength, achieving prediction                                                  1 involves retrieving and cleaning the source data to
of topic evolution hotness. And select the field of                                                     obtain all the data needed for subsequent experiment.
"Information Science" for case analysis, aiming to                                                      Step 2 involves obtaining content, structure, and
address the following two scientific questions:                                                         strength representations of keywords. Step 3 utilizes
                                                                                                        deep learning models to integrate multi-dimensional
    1.        How to reveal the evolution process of topics                                             data representations and conduct prediction of topic
              at the micro level, thereby tracking the                                                  evolution hotness with the integrated experimental
              evolutionary trends of research topics?                                                   data. Figure 1 shows the detailed research process.
    2.        How to effectively and comprehensively
              model and integrate the multi-dimensional
  Step1:Data Preparation                                                                        Step2:Multidimensional data representation
                                                                                                                                                  Dataset of titles, abstracts, and keywords
                                                                   Data extraction                                                                Dataset of keyword citation relationships
 - Dataset : WOS                       Data cleaning                                                                                              Dataset of keyword frequencies
                                                             Year : 2010 .. 2022 2023
 - Language : English
                                                                                                                                         GloVe, Numpy, Pandas and other models and toolkits
 - Collection: SCI & SSCI                                                                         Year: 2010 .. 2022 2023
 - Query : ‘Information Science’       Hump naming                                                                                     Year: 2010 .. 2022 2023          Year: 2010 .. 2022 2023


                                                                                                                 Distance
                                   Null-data validation                                                          calculation

 - Year : 2010 – 2023
 - Amount: 58,119                                                     Datasets for
                                       data process
 - Fields : TI, AB, DE, CR,                                            2010-2023
 PY, DOI                                                                                       Representation of content Representation of reference structure Representation of strength


  Step3: Evolution prediction
                                                                                                                 Fusion of representations                              Hotness
                                                                                     +
                                                 Word     Frequency

           MAE            MSE                                                        +
                                                  Ai         16

                                                BigData      84                      +   +                                                                              Citation
                                                                                Representation fusion
                                                  Gcn        26                                                                                          GCN
         Evaluation and testing                                                 experimental group
                                                  Cat         9
    Forecasting task   MAE     MSE                                                                                                                                                        Represent
                                                   ...       ...
                                                                                         MLP                                                                            Semantic            data
                       1.93    13.41
                                                  Bert       33
                       1.91    12.63                                                         forecast                                                    GAT
                                                  Gpt        124
                       1.90    13.26                                            Word frequency
                       1.88    12.24           Example of result                                          MAE
                                                                                                                               Back-Propagation




Figure 1: Research process.                                                                             2.2. Multi-dimensional feature extraction
2.1. Data preparation
                                                                                                        To prepare for multi-dimensional feature integration,
Field-specific literature is selected from databases                                                    this study will perform feature extraction on keyword
such as Web of Science and Scopus, with titles,                                                         data across three representational dimensions:
keywords, abstracts, and references extracted as basic                                                  content, structure, and strength of keywords.
data. After cleaning and filtering the data, an original                                                    (1)Content feature extraction
dataset 𝑈 is constructed, which specifically includes                                                   In the process of extracting keyword content features,
the keyword citation relationship dataset 𝐶 , the                                                       this study opts to use the 𝐺𝑙𝑜𝑉𝑒 static word
keyword frequency dataset 𝐾 , and the integrated                                                        embedding method to capture the semantic
dataset 𝑁 containing titles, abstracts, and keywords.                                                   relationships of keywords in their global context[12],
                                                                                                        facilitating the embedding of keywords, as it offers
                                                                                                        greater stability and requires less computational
                                                                                                        resources[13]. The principle is as shown in 2-1 and 2-




                                                                                                41
2, where 𝑋𝑖𝑘 is the number of times word 𝑘 appears in                matrix 𝐸𝐻 = 𝐸𝐻𝑡 = 𝑒ℎ𝑤  𝑡
                                                                                              . While extracting strength
the context of word 𝑖, 𝑋𝑖 is the total number of words               features, this also generates the strength
appearing in the context of word 𝑖 , and 𝑃𝑖𝑗 is the                  representation 𝐻 of keywords.
probability of word 𝑗 appeare in the context of word 𝑖.
                                                                     2.3. Model construction and prediction
                    𝑋𝑖 = ∑ 𝑋𝑖𝑘                          (2-1)
                                                                     Graph Attention Network(GAT)based on the attention
                              𝑘
                             𝑋𝑖𝑗               (2-2)                 mechanism, can effectively capture complex semantic
             𝑃𝑖𝑗 = 𝑃 𝑗 | 𝑖 =                                         dependencies between keywords, while Graph
                             𝑋𝑖
    The generated word vector is then used to                        Convolutional Network(GCN) can efficiently process
calculate the cosine similarity between words using                  the strcture information of the graph structure itself
formula 2-3. This process results in obtaining the                   by aggregating the features of neighboring nodes.
                                        𝑡
semantic distance matrix 𝐸𝑆 = 𝐸𝑆𝑡 = (𝑒𝑠𝑖,𝑗 ), 𝑖 = 𝑗, for             Therefore, this study chooses to use GAT and GCN
keyword content feature extraction.                                  graph neural network models to capture the
                                                                     relationships between nodes in graph-structured data
   𝐴 = [𝑎1 , 𝑎2 ,     , 𝑎𝑛 ], B = [𝑏1 , 𝑏2 ,   , 𝑏𝑛 ]                from content and structure perspectives, respectively,
                            ∑𝑛𝑖=1 𝐴𝑖 × 𝐵𝑖               (2-3)        and employs an Multilayer Perceptron(MLP)
   𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 =                                                        regression model to integrate multi-dimensional
                    √∑𝑛𝑖=1 𝐴𝑖 2 × √∑𝑛𝑖=1 𝐵𝑖 2
                                                                     features of keywords for strength prediction.
    (2)Structure feature extraction
                                                                         This study constructs an ablation experiment
Some scholars have proposed using a "keyword-
                                                                     group, as shown in Table 1, for predicting the hotness
citation-keyword" method to construct keyword
                                                                     of topic evolution. And details of the model settings
citation networks[14], meaning when literature 𝑇1
                                                                     are shown in Figure 1. It's worth noting that set an
cites literature 𝑇2 , there exists a "Keyword-Cartesian
                                                                     initial identity matrix allows the neural network to
product mapping" citation relationship between the
                                                                     gradually adjust and optimize feature representations
keywords of the two literatures. The specific principle
                                                                     during the learning process. Therefore, after
is shown in Figure 2. Based on this theory, this study
                                                                     obtaining the content matrix 𝐸𝑆 and the structure
constructs a keyword citation network with citation
                                                                     matrix 𝐸𝐶 , GAT and GCN are used to perform
frequency as edge weight, resulting in a keyword
                                   𝑡                                 convolution operations on these two matrices on a
citation matrix 𝐸𝐶 = 𝐸𝐶𝑡 = (𝑒𝑐𝑖,𝑗    ), 𝑖 = 𝑗.
                                                                     predefined 50-dimensional identity matrix. After
                                                                     obtaining content representation 𝑆 and structure
                                                                     representation 𝐶 of keywords, the three
                                                                     representation data are concatenated directly for
                                                                     integration, and prediction is made based on MLP[15].

                                                                     Table 1
                                                                     Deep Learning model group setting
                                                                          Tasks         Model               Composition
                                                                         𝐺𝑟𝑜𝑢𝑝1         𝑀𝑜𝑑𝑒𝑙1                            𝑀𝐿𝑃
                                                                         𝐺𝑟𝑜𝑢𝑝2         𝑀𝑜𝑑𝑒𝑙2                 𝐺𝐴𝑇        𝑀𝐿𝑃
                                                                         𝐺𝑟𝑜𝑢𝑝3         𝑀𝑜𝑑𝑒𝑙3                 𝐺𝐶𝑁        𝑀𝐿𝑃
                                                                         𝐺𝑟𝑜𝑢𝑝4         𝑀𝑜𝑑𝑒𝑙4         𝐺𝐴𝑇     𝐺𝐶𝑁        𝑀𝐿𝑃


                                                                        After obtaining the prediction results, the study
                                                                     chooses to use two metrics, Mean Square Error (MSE)
                                                                     and Mean Absolute Error (MAE), to measure the
                                                                     predictive capability of the model[16,17]. The specific
Figure2: Construction of keyword citation network                    formulas are as follows.
   (3)Strength feature extraction
The size of keyword frequency reflects the strength of                                   ∑𝑛𝑖=1|𝑦𝑖 − 𝑦̂𝑖 |
                                                                                 𝑀𝐴𝐸 =                               (2-4)
the keyword. In generating the keyword frequency                                               𝑛
matrix, this study chooses to use 𝑁𝑢𝑚𝑝𝑦 and 𝑃𝑎𝑛𝑑𝑎𝑠
                                                                                        ∑𝑛𝑖=1 𝑦𝑖 − 𝑦̂𝑖 2             (2-5)
packages to process the keyword frequencies 𝑘𝑤    𝑡 in
                                                                                𝑀𝑆𝐸 =
dataset 𝐾, constructing a "frequency-year" frequency                                           𝑛




                                                                42
    Above, 𝑦𝑖 represents the ith element of 𝑦, and 𝑛 is           predicting topic hotness, the MAE and MSE between
the number of elements.                                           the predicted and actual values are 1.93484 and
                                                                  13.41658, respectively. However, after integrating the
3. Experiment                                                     content representation 𝑆 or structure representation
                                                                  𝐶, the values of MAE and MSE both decrease. The best
The detailed process of data acquisition can be found             result for predicting topic hotness are achieved by
in the appendix under 𝐵. 𝐷𝑎𝑡𝑎 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠.                          integrating all three types of representations,
                                                                  resulting in the lowest values of MAE and MSE.
3.1. Experiment Preparation                                           The results indicate that predicting the evolution
When extracting features from experimental data, this             hotness of research topics by integrating multi-
study choose to work with four sets of yearly data:               dimensional features such as content and structure
2017-2019, 2018-2020, 2019-2021, and 2020-2022.                   through multiplex-graph representation learning is
The first three sets are used as the training groups,             more accurate than traditional prediction methods.
and the last set as the test group. Specifically, the data
from 2019 to 2022 serve as the basis for operations               Table 3
                                                                  Evaluation of research topic evolution hotness
(all operations on yearly data will follow this four-
                                                                  prediction results in 2023
year standard). Taking 2019 as an example, semantic
content matrix 𝐸𝑆, citation structure matrix 𝐸𝐶, and                Forecasting task          MAE            MSE
frequency strength matrix 𝐸𝐻 are constructed for                          𝐻𝑡                1.93484      13.41658
that year's keyword data.                                               𝐻𝑡 𝐶𝑡               1.91702      12.63112
                                                                        𝐻𝑡 𝑆𝑡               1.90118      13.26141
3.2. Model training and prediction                                    𝐻𝑡 𝑆𝑡 𝐶𝑡              1.88017      12.24382
The objective is to determine the optimal parameters
for various model groups in order to ensure accurate              4. Conclusion
predictions. This study has set four hyperparameters:
learning rate (1e-2, 1e-3, 1e-4), number of training              This study proposes a novel approach based on
epochs (10, 50, 100, 200), hidden layer dimensions                multiplex-graph representation learning to predict
(10, 30, 50), and stopping steps (5, 10, 20), using the           the evolution of research topics. And the
training data from 2019 to 2021 to train models                   contributions are follows: First, in feature modeling,
across four experimental groups, and evaluating the               GCN and GAT graph neural network models are used
final MAE and MSE results on the training set to                  to perform convolution operations on content and
determine the most suitable hyperparameters for                   structure features on unit matrices of specified
each model. The optimal parameter settings for                    dimensions, adaptively aligning data across different
different models are shown in Table 2.                            dimensions and time windows to ensure
                                                                  comparability. Second, this study integrates semantic
Table 2                                                           content features, citation structure features, and
Optimal setting of model parameters                               frequency strength features of keywords for research
 Model LearningRate Epoch EarlyStop HiddenDim                     topic hotness prediction, showcasing the interaction
 𝑀𝑜𝑑𝑒𝑙1    0.01      100     20         50                        between knowledge structures and cognitive
 𝑀𝑜𝑑𝑒𝑙2   0.001      200     20         50                        structures from a multidimensional perspective,
 𝑀𝑜𝑑𝑒𝑙3    0.01      200     20         50                        offering a deeper insight into predicting research
 𝑀𝑜𝑑𝑒𝑙4    0.01      100     20         50                        topic evolution hotness. Third, after integrating
                                                                  content and structure features, a domain case analysis
                                                                  is conducted, and the result indicates that combining
Utilizing the model groups above and employing the                these two types of features indeed makes the
test data of 2022 to conduct the prediction of topic              prediction of research topic evolution hotness more
evolution hotness for 2023.                                       accurate.
                                                                      Owing to the desire to directly validate whether
3.3. Results and Discusstion                                      integrating multiple representations of topic
The prediction results are evaluated using two                    evolution enhances the accuracy of topic evolution
indicators: MAE and MSE, with the evaluation results              analysis, this paper choose to predict the future
listed in Table 3. From the table, it can be observed             frequency of topic keywords, which has certain
that using only the strength representation 𝐻 for                 limitations. Subsequent tasks such as research topic




                                                             43
trend discrimination, research direction, and policy                 keywords in building an initial reading list of
planning can be developed based on the effective                     research papers in scientific paper retrieval and
analysis results of this study.                                      recommender systems." Information Processing
                                                                     & Management 53.3 (2017): 577-594.
Acknowledgements                                                [10] Yoon, Young Seog, et al. "Exploring the dynamic
                                                                     knowledge structure of studies on the Internet
This work was funded by the National Natural Science                 of things: Keyword analysis." ETRI Journal 40.6
Fund of China (No. 71874129), the Open-end Fund of                   (2018): 745-758.
Information Engineering Lab of ISTIC and the                    [11] Ohniwa, Ryosuke L., and Aiko Hibino.
Independent Innovation Foundation of Wuhan                           "Generating process of emerging topics in the
University of Technology (No. 233103002).                            life sciences." Scientometrics 121.3 (2019):
                                                                     1549-1561.
References                                                      [12] Pennington, Jeffrey, Richard Socher, and
[1]   Zhu, Hengmin, et al. "Evolution analysis of online             Christopher D. Manning. "Glove: Global vectors
      topics based on ‘word-topic’coupling network."                 for word representation." Proceedings of the
      Scientometrics 127.7 (2022): 3767-3792.                        2014 conference on empirical methods in
[2]   Hu, Kai, et al. "Understanding the topic evolution             natural language processing (EMNLP). 2014.
      of scientific literatures like an evolving city:          [13] Wang, Yuxuan, et al. "From static to dynamic
      Using Google Word2Vec model and spatial                        word representations: a survey." International
      autocorrelation         analysis."    Information              Journal of Machine Learning and Cybernetics 11
      Processing & Management 56.4 (2019): 1185-                     (2020): 1611-1630.
      1203.                                                     [14] Q. Chen, J. Wang, and W. Lu. "Discovering
[3]   Huo, Chaoguang, Shutian Ma, and Xiaozhong Liu.                 Domain Vocabularies Based on Citation Co-word
      "Hotness prediction of scientific topics based on              Network" Data Analysis and Knowledge
      a bibliographic knowledge graph." Information                  Discovery 3.6 (2019): 57-65. (in Chinese)
      Processing & Management 59.4 (2022): 102980.              [15] Liu, Weijia, et al. "Category-universal witness
[4]   Z. Liu, X. Wang, and R. Bai. "Research on                      discovery with attention mechanism in social
      Visualization Analysis Method of Discipline                    network."      Information      Processing      &
      Topics Evolution from the Perspective of Multi                 Management 59.4 (2022): 102947.
      Dimensions:A Case Study of the Big Data in the            [16] Yan, Yuwei, et al. "Data mining of customer
      Field of Library and Information Science in                    choice behavior in internet of things within
      China." Journal of Library Science in China 42.6               relationship network." International Journal of
      (2016): 67-84. (in Chinese)                                    Information Management 50 (2020): 566-574.
[5]   K. Cui. The Research and lmplementation of                [17] Gandhudi, Manoranjan, et al. "Causal aware
      Topic Evolution Based on LDA. Diss. National                   parameterized quantum stochastic gradient
      University of Defense Technology 2010. (in                     descent for analyzing marketing advertisements
      Chinese)                                                       and sales forecasting." Information Processing &
[6]   Zhou, Yuan, et al. "A deep learning framework to               Management 60.5 (2023): 103473.
      early identify emerging technologies in large-
      scale outlier patents: An empirical study of CNC          A. Online Resources
      machine tool." Scientometrics 126 (2021): 969-            The resources of this article can be downloaded at
      994.                                                      https://github.com/Hipkevin/EEKE-hotness.
[7]   Şenel, Lütfi Kerem, et al. "Learning interpretable
      word embeddings via bidirectional alignment of
      dimensions         with     semantic    concepts."
                                                                B. Data resources
      Information Processing & Management 59.3                  This study uses "Information Science" as a case study
      (2022): 102925.                                           topic, selecting the SCI and SSCI core databases in
[8]   Shi, Bin, et al. "RelaGraph: Improving embedding          WOS. Conducting literature searches in the
      on small-scale sparse knowledge graphs by                 "Information Science & Library Science" field with the
      neighborhood           relations."    Information         search query "Document Types: Article or Review
      Processing & Management 60.5 (2023): 103447.              Article; Languages: English," ultimately selecting
[9]   Raamkumar, Aravind Sesagiri, Schubert Foo,                literature from 2010 to 2023, totaling 58,119 articles,
      and Natalie Pang. "Using author-specified                 as experimental data.




                                                           44