=Paper=
{{Paper
|id=Vol-3745/paper17
|storemode=property
|title=Identification of Core Technological Topics in the New Energy Vehicle Industry: The SAO-BERTopic Topic Modeling Approach Based on Patent Text Mining
|pdfUrl=https://ceur-ws.org/Vol-3745/paper17.pdf
|volume=Vol-3745
|authors=Jianxin Zhu,Yutong Chuang,Zhinan Wang,Yunke Li
|dblpUrl=https://dblp.org/rec/conf/eeke/ZhuCWL24
}}
==Identification of Core Technological Topics in the New Energy Vehicle Industry: The SAO-BERTopic Topic Modeling Approach Based on Patent Text Mining==
Jianxin Zhu1,2, Yutong Chuang1, Zhinan Wang1,2, ∗, Yunke Li1 1 Harbin Engineering University, School of Economics and Management 150001, China 2 Key Laboratory of Big Data and Business Intelligence Technology, Ministry of Industry and Information Technology Abstract In the new energy vehicle industry, precise identification of core technologies is the key to promoting innovation and maintaining market competitiveness. In this article, a comprehensive approach combining information weight method and SAO-BERTopic topic model is proposed to extract and analyze core technologies from large-scale patent data. Through in-depth analysis of Incopat patent database, we use the information weight method to select high-quality core patents from four dimensions: technological, strategic, law and market value. These selected patents form the basis of the research data, which is then applied to the SAO-BERTopic model, which combines the advanced semantic understanding capabilities of BERTopic with the fine-structured characteristics of SAO analysis, greatly improving the efficiency and accuracy of identifying complex technical topics. This innovation of this research is not only applicable to the analysis of the technological development of the new energy vehicle industry, but also can provide valuable reference for other high-tech industries such as biomedicine, renewable energy and information technology. These fields also need to identify core technologies from a large number of patents. SAO-BERTopic's structured analysis framework can help these industries to insight into technology development trends, identify innovation opportunities, and provide data-driven decision support for enterprises, research institutions and governments, thus playing an important role in technology planning and market commercialization. Keywords technical identification, patent analysis, new energy vehicle, SAO-BERTopic China has made remarkable progress in this field[2,3]. However, in the face of the In a globalized economic environment, strategic layout and potential containment of the rapid development of the new energy traditional automobile powers such as the automobile industry has become an United States, Europe, Japan and South Korea important symbol of technological in terms of technology and industrial chain, innovation and industrial transformation[1]. the sustainable growth and competitiveness of China's new energy automobile industry presents new challenges [4,5,6,7]. Joint Workshop of the 5th Extraction and Evaluation of Knowledge Entities from Scientific Documents and the 4th AI + Informetrics (EEKE-AII2024), April 23~24, 2024, Changchun, China and Online zhjx@vip.163.com (Jianxin Zhu); chuangyutong@hrbeu.edu.cn (Yutong Chuang); wzn6768@163.com (Zhinan Wang); liyunke1209@163.com (Yunke Li) © Copyright 2024 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 116 In the research of technological progress patents from large-scale data in the patent and innovation, patents and scientific papers analysis of the new energy vehicle industry are indispensable resources. However, due to for the first time, and improve the efficiency the large scale of these literatures, traditional and accuracy of technology identification. manual analysis methods are difficult to cope with, and expert analysis is subjective. Therefore, based on the large amount of We adopt an innovative hybrid approach, technical information contained in patents utilizing a phased scientific process to [8], many scholars have devoted themselves identify the core technological topics within to using data mining methods in recent years the new energy vehicle industry. This [9]. Among them, the topic model is an process is divided into six detailed steps (as approach that can automatically extract key shown in Figure 1), designed to ensure the topics from documents, revealing technology efficient and systematic extraction, analysis, trends and areas of innovation. Latent and determination of core technologies from Dirichlet Allocation (LDA) is a word a broad range of patent data. After selecting frequency-based models that may struggle to patents related to core technologies using the capture semantic complexity [10,11,12,13]. information weight method from a large pool Latent Semantic analysis (LSA) has of patents, the SAO-BERTopic model is limitations in handling word sense diversity applied for an in-depth analysis of the and polysemy, which can lead to information selected patents. This model identifies the loss when analyzing specialized technical core technologies' SAO triplets and defines documents [14,15]. Correlated Topic models these triplets as specific topics. (CTM) high computational complexity limits Step One: Patent Selection — The first its rapid application on large document step is to use Incopat patent database to collections [11,16] . download patents, and use the information In view of the limitations of existing topic weight method to establish indicators in four models in the identification of core dimensions of patent technical, strategic, technologies in the new energy vehicle legal and market value; Step Two: Textual industry, we propose an innovative topic Vector Representation — This step is text model— SAO-BERTopic. The model preprocessing and then vectorization via combines SAO analysis and BERTopic, an BERT(Bidirectional encoder representation advanced natural language processing from the transformer) model; Step Three: technology, to improve the accuracy and Dimensionality Reduction and Clustering — efficiency of technology recognition. In this step, UMAP(Uniform manifold The main innovations of this article are: Approximation and projection) algorithm is 1. By combining SAO analysis and used to reduce the dimensionality of the text BERTopic topic model, we construct SAO- vector, and then HDBSCAN(hierarchical BERTopic method, which effectively solves densi-based applied spatial clustering with the problem of identifying deep-level noise) algorithm is used to cluster the topic; technology trends and key innovation points Step Four: SAO Triplet Extraction — Using in the new energy vehicle industry. the C-TF-IDF(class-based word frequency- 2. By using the combination of the inverse document frequency) algorithm, SAO information weight method and the SAO- triples are extracted from the clusters and the BERTopic model, we realize the selection of topic representation is refined using the highly innovative and market-valued core MMR(Maximum marginal correlation) 117 algorithm to balance correlation and integration with electric vehicles. Topic 3 is diversity; Step Five: Result Validation — The "Electric Vehicle Battery Pack Configuration Calinski-Harabasz index and Davies-Bouldin and Structural Design", covering various index were used to compare the LDA, aspects of the physical configuration, BERTopic and SAO-LDA topic models to structural design of battery packs, and how evaluate the clustering effectiveness. In these designs impact battery pack addition, dimensionality reduction performance. Topic 4 is "Cathode Materials visualization via UMAP provides an intuitive and Composition for Secondary Batteries", comparison of cluster distributions; Step Six: focusing on the composition and design of Strategic Recommendations — Based on the cathode materials for batteries, especially results, recommendations are made from the secondary batteries, which directly impact corporate, industry and governance battery performance aspects like capacity, perspectives. energy density, charge/ discharge speed, and Massive patents in the new energy automobile industry Core technology patent text Generating Topic Clusters using lifespan. HDBSCAN Information Patent technical value POS tagging Patent strategic value Tokenization weight method preprocessing Patent legal value Stop words filtering Patent market value Lemmatization Dimensionality Text Vectorization with Core technology patent Reduction of Text BERT Vectors using UMAP Step1 Patent Screening Step2 Vector Representation of Text Step3 Cluster Representation of Documents This research successfully applied the information weight method to screen high- Extracting SAO Triples LDA BERTopic Technology quality core patents from a large-scale patent dataset, and employed the SAO-BERTopic from Topic Clusters SAO- Identification using C-TF-IDF SAO-LDA BERTopic Suggestions to Calinski harabasz index model to identify core technological topics Index comparison method Davies bouldin index Enterprise Visual comparison Dimensionality reduction method visualization Industry Adjusting Topic from the filtered data. Our empirical analysis Comparison of Government Representation using Clustering Results MMR Step5 Step6 results highlight the superior clustering Step4 Topic Verification Put forward suggestions Topic Results Figure 1: Overall research process performance of SAO-BERTopic in identifying technological topics in the new energy vehicle domain compared to traditional This article uses the Incopat patent models. The approach presented in this database to collect data pertinent to the new research, through its efficient processing of energy vehicle industry, the topics are finally technical terms and complex concepts and its refined into five core topics. precise screening of patent data, Topic 0 is "Energy Management and demonstrates the ability to identify core Power Transmission Technology in Hybrid technologies and drive innovation across a Electric Vehicles". Topic 1 is "Electric Vehicle wide range of fields. Therefore, this research Charging Systems and Energy Management", not only has a direct contribution to the where the SAO triples contained in topic 1 technological progress of the new energy focus on describing the functionalities of vehicle industry, but also provides a new receiving, managing, and providing energy in perspective and tool for technology charging panels and systems. Topic 2 is identification and innovation management in "Battery System Integration and Energy various fields, which is helpful to promote Efficiency Management", where the SAO the technological development and triples collectively depict various aspects of innovation strategy formulation in a wider energy storage system design, management, range of fields. and application, including modular composition, energy control and management, system power supply, and 118 [9] N. Su, Z. Tan, Review and vision for the future of the research on technology [1] Y. Li, The impact of economic systems opportunity analysis methods, Inf. Stud. and financial systems on new energy Theory Appl. 43 (2020) 179–186. vehicle industry, Adv. Econ. Manag. [10] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent Polit. Sci. 33 (2023) 162-170. dirichlet allocation, J. Mach. Learn. Res. doi:10.54254/2754-1169/33/20231622. 3 (2003) 993-1022. [2] Z. Liu, H. Hao, X. Cheng, F. Zhao, [11] T. L. Griffiths, M. Steyvers, Finding Critical issues of energy efficient and scientific topics, Proc. Natl. Acad. Sci. U. new energy vehicles development in S. A. 101 (2004) 5228-5235. china, Energy Policy 115 (2018) 92-97. doi:10.1073/PNAS.0307752101. doi:10.1016/J.ENPOL.2018.01.006. [12] J. Lafferty, D. Blei, Topic models, in: A. [3] P. Yu, J. Zhang, D. Yang, X. Lin, T. Xu, N. Srivastava and M. Sahami (Eds.) Text The evolution of China's new energy mining, Chapman and Hall/CRC, New vehicle industry from the perspective of York, NY, 2009, pp. 71-93. a technology-market-policy framework, doi:10.1201/9781420059458.CH4. Sustainability 11 (2019) 1711. [13] M. Hoffman, F. Bach, D. Blei, Online doi:10.3390/SU11061711. learning for latent dirichlet allocation, in: [4] M. Kendall, Fuel cell development for 24th Annual Conference on Neural new energy vehicles (NEVs) and clean Information Processing Systems 2010, air in china, Prog. Nat. Sci. Mater. Int. 28 NeurIPS, New York, NY, 2010, pp. 856– (2018) 113-120. 864. doi:10.1016/J.PNSC.2018.03.001. [14] S. Deerwester, S. T. Dumais, G. W. [5] T. Yang, C. Xing, X. Li, Evaluation and Furnas, T. K. Landauer, R. Harshman, analysis of new-energy vehicle industry Indexing by latent semantic analysis, J. policies in the context of technical Am. Soc. Inf. Sci. 41 (1990) 391-407. innovation in china, J. Clean. Prod. 281 doi:10.1002/(SICI)1097- (2021) 125126. 4571(199009)41:6<391::AID- doi:10.1016/J.JCLEPRO.2020.125126. ASI1>3.0.CO;2-9. [6] X. L. Xu, H. H. Chen, Exploring the [15] T. K. Landauer, P. W. Foltz, D. Laham, innovation efficiency of new energy An introduction to latent semantic vehicle enterprises in china, Clean analysis, Discourse Process 25 (1998) Technol. Environ. Policy 22 (2020) 1671- 259-284. doi:10.1080/01638539809545028. 1685. doi:10.1007/S10098-020-01908-W. [16] J. Lafferty, D. Blei, Correlated topic [7] S. Chen, Y. Feng, C. Lin, Z. Liao, X. Mei, models, in: Proceedings of the 18th Research on the technology innovation International Conference on Neural efficiency of China's listed new energy Information Processing Systems, MIT vehicle enterprises, Math. Probl. Eng. Press, Cambridge, MA, 2005, pp. 147– 2021 (2021) 6613602. 154. doi:10.5555/2976248.2976267. doi:10.1155/2021/6613602. [8] C. Lee, A review of data analytics in technological forecasting, Technol. Forecast. Soc. Change 166 (2021) 120646. doi:10.1016/J.TECHFORE.2021.120646. 119