=Paper=
{{Paper
|id=Vol-3745/paper4
|storemode=property
|title=A Research Topic Evolution Prediction Approach Based on Multiplex-graph Representation Learning
|pdfUrl=https://ceur-ws.org/Vol-3745/paper4.pdf
|volume=Vol-3745
|authors=Yang Zheng,Kaiwen Shi,Yuhang Dong,Xiaoguang Wang,Hongyu Wang
|dblpUrl=https://dblp.org/rec/conf/eeke/ZhengSDWW24
}}
==A Research Topic Evolution Prediction Approach Based on Multiplex-graph Representation Learning==
A research topic evolution prediction approach based on multiplex-graph representation learning ⋆ Yang Zheng1, Kaiwen Shi2, Yuhang Dong1, Xiaoguang Wang3 and Hongyu Wang1,∗ 1 School of Management, Wuhan University of Technology, Wuhan, China 2 School of Information Engineering, Zhongnan University of Economics and Law, Wuhan, China 3 School of Information Management, Wuhan University, Wuhan, China Abstract The intensification of international technological innovation competition and the evolution of scientific research paradigms have led to a continuous expansion of scientific literature, making information analysis increasingly complex and diversified. To address the challenges of accurately assessing topic evolution within the context of vast literature big data, traditional methods of expert evaluation or visualization analysis based on scientific knowledge networks are inadequate. From the perspectives of artificial intelligence and big data, this paper proposes a universal method for automated and intelligent discrimination and prediction of research topic evolution hotness. This method involves integrating content and structural features of keywords to track the evolution of keyword frequency strength over time in research topic networks characterized by keywords. This study conducts a case analysis in the field of information science. The results demonstrate that the prediction of keyword strength is improved after integrating content and structural features, which has significant reference value for tasks such as future research topic evolution trend discrimination, research direction, and policy planning. Keywords topic evolution, keyword citation network, text mining, graph representation learning 1 1. Introduction keywords representing knowledge innovation within a vast array of scientific literature, track the evolution With the intensification of international technological of research topics, and represent them on multiple innovation competition and the evolution of the knowledge networks that contain knowledge units fourth paradigm of scientific research driven by big and their complex interactions, so as to judge the data development, the growing volume of scientific future evolutionary trends of research topics, is a key literature, shifting scholarly interests, and the direction for science and technology information emergence of new research topics pose significant construction and services, as well as a research focus challenges to traditional methods of research topic in the fields of informetrics and scientometrics. analysis[1,2,3]. How to comprehensively and finely Current analyses of the evolution of scientific reveal the research topics and their characteristic research topics largely unfold across three Joint Workshop of the 5th Extraction and Evaluation of Knowledge Entities from Scientific Documents and the 4th AI + Informetrics (EEKE-AII2024), April 23~24, 2024, Changchun, China and Online ∗ Corresponding author. zhengyang2002@whut.edu.cn (Y. Zheng); shikaiwen@stu.zuel.edu.cn (K. Shi); dongyuhang@whut.edu.cn (Y. Dong); wxguang@whu.edu.cn (X. Wang); hongyuwang@whut.edu.cn (H. Wang) 0000-0001-5635-1131 (Y. Zheng); 0000-0002-3563-982X (K. Shi); 0009-0005-4618-5906 (Y. Dong); 0000-0003-1284-7164 (X. Wang); 0000-0002-5063-9166 (H. Wang) © Copyright 2024 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 40 dimensions: content, structure, and strength[4,5]. features of research topic evolution Utilizing the powerful representation and feature representations on knowledge networks? learning capabilities of deep representation learning After tracking these multi-dimensional algorithms such as word embedding and graph evolutionary features, will the assessment of embedding[6], it is possible to model the complex topic evolution trends become more nonlinear relationships between entities represented accurate? by keywords, the smallest units of knowledge[7,8]. Tracking changes in topic strength, as indicated by 2. Methodology keyword frequency over time, can reflect the evolving trends of topics[9,10,11]. This study aims to predict the strength of keyword Therefore, this paper utilizes a multiplex-graph frequency, using changes in keyword frequency representation learning method combined with strength to reflect variation in topic evolution hotness. interactions in topic keyword content and structure to The research process is divided into 3 steps: Step assess changes in topic strength, achieving prediction 1 involves retrieving and cleaning the source data to of topic evolution hotness. And select the field of obtain all the data needed for subsequent experiment. "Information Science" for case analysis, aiming to Step 2 involves obtaining content, structure, and address the following two scientific questions: strength representations of keywords. Step 3 utilizes deep learning models to integrate multi-dimensional 1. How to reveal the evolution process of topics data representations and conduct prediction of topic at the micro level, thereby tracking the evolution hotness with the integrated experimental evolutionary trends of research topics? data. Figure 1 shows the detailed research process. 2. How to effectively and comprehensively model and integrate the multi-dimensional Step1:Data Preparation Step2:Multidimensional data representation Dataset of titles, abstracts, and keywords Data extraction Dataset of keyword citation relationships - Dataset : WOS Data cleaning Dataset of keyword frequencies Year : 2010 .. 2022 2023 - Language : English GloVe, Numpy, Pandas and other models and toolkits - Collection: SCI & SSCI Year: 2010 .. 2022 2023 - Query : ‘Information Science’ Hump naming Year: 2010 .. 2022 2023 Year: 2010 .. 2022 2023 Distance Null-data validation calculation - Year : 2010 – 2023 - Amount: 58,119 Datasets for data process - Fields : TI, AB, DE, CR, 2010-2023 PY, DOI Representation of content Representation of reference structure Representation of strength Step3: Evolution prediction Fusion of representations Hotness + Word Frequency MAE MSE + Ai 16 BigData 84 + + Citation Representation fusion Gcn 26 GCN Evaluation and testing experimental group Cat 9 Forecasting task MAE MSE Represent ... ... MLP Semantic data 1.93 13.41 Bert 33 1.91 12.63 forecast GAT Gpt 124 1.90 13.26 Word frequency 1.88 12.24 Example of result MAE Back-Propagation Figure 1: Research process. 2.2. Multi-dimensional feature extraction 2.1. Data preparation To prepare for multi-dimensional feature integration, Field-specific literature is selected from databases this study will perform feature extraction on keyword such as Web of Science and Scopus, with titles, data across three representational dimensions: keywords, abstracts, and references extracted as basic content, structure, and strength of keywords. data. After cleaning and filtering the data, an original (1)Content feature extraction dataset 𝑈 is constructed, which specifically includes In the process of extracting keyword content features, the keyword citation relationship dataset 𝐶 , the this study opts to use the 𝐺𝑙𝑜𝑉𝑒 static word keyword frequency dataset 𝐾 , and the integrated embedding method to capture the semantic dataset 𝑁 containing titles, abstracts, and keywords. relationships of keywords in their global context[12], facilitating the embedding of keywords, as it offers greater stability and requires less computational resources[13]. The principle is as shown in 2-1 and 2- 41 2, where 𝑋𝑖𝑘 is the number of times word 𝑘 appears in matrix 𝐸𝐻 = 𝐸𝐻𝑡 = 𝑒ℎ𝑤 𝑡 . While extracting strength the context of word 𝑖, 𝑋𝑖 is the total number of words features, this also generates the strength appearing in the context of word 𝑖 , and 𝑃𝑖𝑗 is the representation 𝐻 of keywords. probability of word 𝑗 appeare in the context of word 𝑖. 2.3. Model construction and prediction 𝑋𝑖 = ∑ 𝑋𝑖𝑘 (2-1) Graph Attention Network(GAT)based on the attention 𝑘 𝑋𝑖𝑗 (2-2) mechanism, can effectively capture complex semantic 𝑃𝑖𝑗 = 𝑃 𝑗 | 𝑖 = dependencies between keywords, while Graph 𝑋𝑖 The generated word vector is then used to Convolutional Network(GCN) can efficiently process calculate the cosine similarity between words using the strcture information of the graph structure itself formula 2-3. This process results in obtaining the by aggregating the features of neighboring nodes. 𝑡 semantic distance matrix 𝐸𝑆 = 𝐸𝑆𝑡 = (𝑒𝑠𝑖,𝑗 ), 𝑖 = 𝑗, for Therefore, this study chooses to use GAT and GCN keyword content feature extraction. graph neural network models to capture the relationships between nodes in graph-structured data 𝐴 = [𝑎1 , 𝑎2 , , 𝑎𝑛 ], B = [𝑏1 , 𝑏2 , , 𝑏𝑛 ] from content and structure perspectives, respectively, ∑𝑛𝑖=1 𝐴𝑖 × 𝐵𝑖 (2-3) and employs an Multilayer Perceptron(MLP) 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = regression model to integrate multi-dimensional √∑𝑛𝑖=1 𝐴𝑖 2 × √∑𝑛𝑖=1 𝐵𝑖 2 features of keywords for strength prediction. (2)Structure feature extraction This study constructs an ablation experiment Some scholars have proposed using a "keyword- group, as shown in Table 1, for predicting the hotness citation-keyword" method to construct keyword of topic evolution. And details of the model settings citation networks[14], meaning when literature 𝑇1 are shown in Figure 1. It's worth noting that set an cites literature 𝑇2 , there exists a "Keyword-Cartesian initial identity matrix allows the neural network to product mapping" citation relationship between the gradually adjust and optimize feature representations keywords of the two literatures. The specific principle during the learning process. Therefore, after is shown in Figure 2. Based on this theory, this study obtaining the content matrix 𝐸𝑆 and the structure constructs a keyword citation network with citation matrix 𝐸𝐶 , GAT and GCN are used to perform frequency as edge weight, resulting in a keyword 𝑡 convolution operations on these two matrices on a citation matrix 𝐸𝐶 = 𝐸𝐶𝑡 = (𝑒𝑐𝑖,𝑗 ), 𝑖 = 𝑗. predefined 50-dimensional identity matrix. After obtaining content representation 𝑆 and structure representation 𝐶 of keywords, the three representation data are concatenated directly for integration, and prediction is made based on MLP[15]. Table 1 Deep Learning model group setting Tasks Model Composition 𝐺𝑟𝑜𝑢𝑝1 𝑀𝑜𝑑𝑒𝑙1 𝑀𝐿𝑃 𝐺𝑟𝑜𝑢𝑝2 𝑀𝑜𝑑𝑒𝑙2 𝐺𝐴𝑇 𝑀𝐿𝑃 𝐺𝑟𝑜𝑢𝑝3 𝑀𝑜𝑑𝑒𝑙3 𝐺𝐶𝑁 𝑀𝐿𝑃 𝐺𝑟𝑜𝑢𝑝4 𝑀𝑜𝑑𝑒𝑙4 𝐺𝐴𝑇 𝐺𝐶𝑁 𝑀𝐿𝑃 After obtaining the prediction results, the study chooses to use two metrics, Mean Square Error (MSE) and Mean Absolute Error (MAE), to measure the predictive capability of the model[16,17]. The specific Figure2: Construction of keyword citation network formulas are as follows. (3)Strength feature extraction The size of keyword frequency reflects the strength of ∑𝑛𝑖=1|𝑦𝑖 − 𝑦̂𝑖 | 𝑀𝐴𝐸 = (2-4) the keyword. In generating the keyword frequency 𝑛 matrix, this study chooses to use 𝑁𝑢𝑚𝑝𝑦 and 𝑃𝑎𝑛𝑑𝑎𝑠 ∑𝑛𝑖=1 𝑦𝑖 − 𝑦̂𝑖 2 (2-5) packages to process the keyword frequencies 𝑘𝑤 𝑡 in 𝑀𝑆𝐸 = dataset 𝐾, constructing a "frequency-year" frequency 𝑛 42 Above, 𝑦𝑖 represents the ith element of 𝑦, and 𝑛 is predicting topic hotness, the MAE and MSE between the number of elements. the predicted and actual values are 1.93484 and 13.41658, respectively. However, after integrating the 3. Experiment content representation 𝑆 or structure representation 𝐶, the values of MAE and MSE both decrease. The best The detailed process of data acquisition can be found result for predicting topic hotness are achieved by in the appendix under 𝐵. 𝐷𝑎𝑡𝑎 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠. integrating all three types of representations, resulting in the lowest values of MAE and MSE. 3.1. Experiment Preparation The results indicate that predicting the evolution When extracting features from experimental data, this hotness of research topics by integrating multi- study choose to work with four sets of yearly data: dimensional features such as content and structure 2017-2019, 2018-2020, 2019-2021, and 2020-2022. through multiplex-graph representation learning is The first three sets are used as the training groups, more accurate than traditional prediction methods. and the last set as the test group. Specifically, the data from 2019 to 2022 serve as the basis for operations Table 3 Evaluation of research topic evolution hotness (all operations on yearly data will follow this four- prediction results in 2023 year standard). Taking 2019 as an example, semantic content matrix 𝐸𝑆, citation structure matrix 𝐸𝐶, and Forecasting task MAE MSE frequency strength matrix 𝐸𝐻 are constructed for 𝐻𝑡 1.93484 13.41658 that year's keyword data. 𝐻𝑡 𝐶𝑡 1.91702 12.63112 𝐻𝑡 𝑆𝑡 1.90118 13.26141 3.2. Model training and prediction 𝐻𝑡 𝑆𝑡 𝐶𝑡 1.88017 12.24382 The objective is to determine the optimal parameters for various model groups in order to ensure accurate 4. Conclusion predictions. This study has set four hyperparameters: learning rate (1e-2, 1e-3, 1e-4), number of training This study proposes a novel approach based on epochs (10, 50, 100, 200), hidden layer dimensions multiplex-graph representation learning to predict (10, 30, 50), and stopping steps (5, 10, 20), using the the evolution of research topics. And the training data from 2019 to 2021 to train models contributions are follows: First, in feature modeling, across four experimental groups, and evaluating the GCN and GAT graph neural network models are used final MAE and MSE results on the training set to to perform convolution operations on content and determine the most suitable hyperparameters for structure features on unit matrices of specified each model. The optimal parameter settings for dimensions, adaptively aligning data across different different models are shown in Table 2. dimensions and time windows to ensure comparability. Second, this study integrates semantic Table 2 content features, citation structure features, and Optimal setting of model parameters frequency strength features of keywords for research Model LearningRate Epoch EarlyStop HiddenDim topic hotness prediction, showcasing the interaction 𝑀𝑜𝑑𝑒𝑙1 0.01 100 20 50 between knowledge structures and cognitive 𝑀𝑜𝑑𝑒𝑙2 0.001 200 20 50 structures from a multidimensional perspective, 𝑀𝑜𝑑𝑒𝑙3 0.01 200 20 50 offering a deeper insight into predicting research 𝑀𝑜𝑑𝑒𝑙4 0.01 100 20 50 topic evolution hotness. Third, after integrating content and structure features, a domain case analysis is conducted, and the result indicates that combining Utilizing the model groups above and employing the these two types of features indeed makes the test data of 2022 to conduct the prediction of topic prediction of research topic evolution hotness more evolution hotness for 2023. accurate. Owing to the desire to directly validate whether 3.3. Results and Discusstion integrating multiple representations of topic The prediction results are evaluated using two evolution enhances the accuracy of topic evolution indicators: MAE and MSE, with the evaluation results analysis, this paper choose to predict the future listed in Table 3. From the table, it can be observed frequency of topic keywords, which has certain that using only the strength representation 𝐻 for limitations. Subsequent tasks such as research topic 43 trend discrimination, research direction, and policy keywords in building an initial reading list of planning can be developed based on the effective research papers in scientific paper retrieval and analysis results of this study. recommender systems." Information Processing & Management 53.3 (2017): 577-594. Acknowledgements [10] Yoon, Young Seog, et al. "Exploring the dynamic knowledge structure of studies on the Internet This work was funded by the National Natural Science of things: Keyword analysis." ETRI Journal 40.6 Fund of China (No. 71874129), the Open-end Fund of (2018): 745-758. Information Engineering Lab of ISTIC and the [11] Ohniwa, Ryosuke L., and Aiko Hibino. Independent Innovation Foundation of Wuhan "Generating process of emerging topics in the University of Technology (No. 233103002). life sciences." Scientometrics 121.3 (2019): 1549-1561. References [12] Pennington, Jeffrey, Richard Socher, and [1] Zhu, Hengmin, et al. "Evolution analysis of online Christopher D. Manning. "Glove: Global vectors topics based on ‘word-topic’coupling network." for word representation." Proceedings of the Scientometrics 127.7 (2022): 3767-3792. 2014 conference on empirical methods in [2] Hu, Kai, et al. "Understanding the topic evolution natural language processing (EMNLP). 2014. of scientific literatures like an evolving city: [13] Wang, Yuxuan, et al. "From static to dynamic Using Google Word2Vec model and spatial word representations: a survey." International autocorrelation analysis." Information Journal of Machine Learning and Cybernetics 11 Processing & Management 56.4 (2019): 1185- (2020): 1611-1630. 1203. [14] Q. Chen, J. Wang, and W. Lu. "Discovering [3] Huo, Chaoguang, Shutian Ma, and Xiaozhong Liu. Domain Vocabularies Based on Citation Co-word "Hotness prediction of scientific topics based on Network" Data Analysis and Knowledge a bibliographic knowledge graph." Information Discovery 3.6 (2019): 57-65. (in Chinese) Processing & Management 59.4 (2022): 102980. [15] Liu, Weijia, et al. "Category-universal witness [4] Z. Liu, X. Wang, and R. Bai. "Research on discovery with attention mechanism in social Visualization Analysis Method of Discipline network." Information Processing & Topics Evolution from the Perspective of Multi Management 59.4 (2022): 102947. Dimensions:A Case Study of the Big Data in the [16] Yan, Yuwei, et al. "Data mining of customer Field of Library and Information Science in choice behavior in internet of things within China." Journal of Library Science in China 42.6 relationship network." International Journal of (2016): 67-84. (in Chinese) Information Management 50 (2020): 566-574. [5] K. Cui. The Research and lmplementation of [17] Gandhudi, Manoranjan, et al. "Causal aware Topic Evolution Based on LDA. Diss. National parameterized quantum stochastic gradient University of Defense Technology 2010. (in descent for analyzing marketing advertisements Chinese) and sales forecasting." Information Processing & [6] Zhou, Yuan, et al. "A deep learning framework to Management 60.5 (2023): 103473. early identify emerging technologies in large- scale outlier patents: An empirical study of CNC A. Online Resources machine tool." Scientometrics 126 (2021): 969- The resources of this article can be downloaded at 994. https://github.com/Hipkevin/EEKE-hotness. [7] Şenel, Lütfi Kerem, et al. "Learning interpretable word embeddings via bidirectional alignment of dimensions with semantic concepts." B. Data resources Information Processing & Management 59.3 This study uses "Information Science" as a case study (2022): 102925. topic, selecting the SCI and SSCI core databases in [8] Shi, Bin, et al. "RelaGraph: Improving embedding WOS. Conducting literature searches in the on small-scale sparse knowledge graphs by "Information Science & Library Science" field with the neighborhood relations." Information search query "Document Types: Article or Review Processing & Management 60.5 (2023): 103447. Article; Languages: English," ultimately selecting [9] Raamkumar, Aravind Sesagiri, Schubert Foo, literature from 2010 to 2023, totaling 58,119 articles, and Natalie Pang. "Using author-specified as experimental data. 44