=Paper= {{Paper |id=Vol-3745/paper17 |storemode=property |title=Identification of Core Technological Topics in the New Energy Vehicle Industry: The SAO-BERTopic Topic Modeling Approach Based on Patent Text Mining |pdfUrl=https://ceur-ws.org/Vol-3745/paper17.pdf |volume=Vol-3745 |authors=Jianxin Zhu,Yutong Chuang,Zhinan Wang,Yunke Li |dblpUrl=https://dblp.org/rec/conf/eeke/ZhuCWL24 }} ==Identification of Core Technological Topics in the New Energy Vehicle Industry: The SAO-BERTopic Topic Modeling Approach Based on Patent Text Mining== https://ceur-ws.org/Vol-3745/paper17.pdf
                                Jianxin Zhu1,2, Yutong Chuang1, Zhinan Wang1,2, ∗, Yunke Li1
                                1 Harbin Engineering University, School of Economics and Management 150001, China
                                2 Key Laboratory of Big Data and Business Intelligence Technology, Ministry of Industry and Information
                                Technology

                                                    Abstract
                                                    In the new energy vehicle industry, precise identification of core technologies is the key to
                                                    promoting innovation and maintaining market competitiveness. In this article, a comprehensive
                                                    approach combining information weight method and SAO-BERTopic topic model is proposed to
                                                    extract and analyze core technologies from large-scale patent data. Through in-depth analysis of
                                                    Incopat patent database, we use the information weight method to select high-quality core patents
                                                    from four dimensions: technological, strategic, law and market value. These selected patents form
                                                    the basis of the research data, which is then applied to the SAO-BERTopic model, which combines
                                                    the advanced semantic understanding capabilities of BERTopic with the fine-structured
                                                    characteristics of SAO analysis, greatly improving the efficiency and accuracy of identifying
                                                    complex technical topics. This innovation of this research is not only applicable to the analysis of
                                                    the technological development of the new energy vehicle industry, but also can provide valuable
                                                    reference for other high-tech industries such as biomedicine, renewable energy and information
                                                    technology. These fields also need to identify core technologies from a large number of patents.
                                                    SAO-BERTopic's structured analysis framework can help these industries to insight into technology
                                                    development trends, identify innovation opportunities, and provide data-driven decision support
                                                    for enterprises, research institutions and governments, thus playing an important role in
                                                    technology planning and market commercialization.

                                                    Keywords
                                                    technical identification, patent analysis, new energy vehicle, SAO-BERTopic


                                                                                                                           China has made remarkable progress in this
                                                                                                                           field[2,3]. However, in the face of the
                                   In a globalized economic environment,                                                   strategic layout and potential containment of
                                the rapid development of the new energy                                                    traditional automobile powers such as the
                                automobile industry has become an                                                          United States, Europe, Japan and South Korea
                                important    symbol      of    technological                                               in terms of technology and industrial chain,
                                innovation and industrial transformation[1].                                               the sustainable growth and competitiveness
                                                                                                                           of China's new energy automobile industry
                                                                                                                           presents new challenges [4,5,6,7].
                                Joint Workshop of the 5th Extraction and Evaluation of Knowledge
                                Entities from Scientific Documents and the 4th AI + Informetrics
                                (EEKE-AII2024), April 23~24, 2024, Changchun, China and Online
                                   zhjx@vip.163.com (Jianxin Zhu); chuangyutong@hrbeu.edu.cn
                                (Yutong Chuang); wzn6768@163.com (Zhinan Wang);
                                liyunke1209@163.com (Yunke Li)
                                               © Copyright 2024 for this paper by its authors. Use permitted under
                                               Creative Commons License Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                     116
    In the research of technological progress             patents from large-scale data in the patent
and innovation, patents and scientific papers             analysis of the new energy vehicle industry
are indispensable resources. However, due to              for the first time, and improve the efficiency
the large scale of these literatures, traditional         and accuracy of technology identification.
manual analysis methods are difficult to cope
with, and expert analysis is subjective.
Therefore, based on the large amount of                       We adopt an innovative hybrid approach,
technical information contained in patents                utilizing a phased scientific process to
[8], many scholars have devoted themselves                identify the core technological topics within
to using data mining methods in recent years              the new energy vehicle industry. This
[9]. Among them, the topic model is an                    process is divided into six detailed steps (as
approach that can automatically extract key               shown in Figure 1), designed to ensure the
topics from documents, revealing technology               efficient and systematic extraction, analysis,
trends and areas of innovation. Latent                    and determination of core technologies from
Dirichlet Allocation (LDA) is a word                      a broad range of patent data. After selecting
frequency-based models that may struggle to               patents related to core technologies using the
capture semantic complexity [10,11,12,13].                information weight method from a large pool
Latent Semantic analysis (LSA) has                        of patents, the SAO-BERTopic model is
limitations in handling word sense diversity              applied for an in-depth analysis of the
and polysemy, which can lead to information               selected patents. This model identifies the
loss when analyzing specialized technical                 core technologies' SAO triplets and defines
documents [14,15]. Correlated Topic models                these triplets as specific topics.
(CTM) high computational complexity limits                    Step One: Patent Selection — The first
its rapid application on large document                   step is to use Incopat patent database to
collections [11,16] .                                     download patents, and use the information
    In view of the limitations of existing topic          weight method to establish indicators in four
models in the identification of core                      dimensions of patent technical, strategic,
technologies in the new energy vehicle                    legal and market value; Step Two: Textual
industry, we propose an innovative topic                  Vector Representation — This step is text
model—       SAO-BERTopic.        The     model           preprocessing and then vectorization via
combines SAO analysis and BERTopic, an                    BERT(Bidirectional encoder representation
advanced natural language processing                      from the transformer) model; Step Three:
technology, to improve the accuracy and                   Dimensionality Reduction and Clustering —
efficiency of technology recognition.                     In this step, UMAP(Uniform manifold
    The main innovations of this article are:             Approximation and projection) algorithm is
    1. By combining SAO analysis and                      used to reduce the dimensionality of the text
BERTopic topic model, we construct SAO-                   vector, and then HDBSCAN(hierarchical
BERTopic method, which effectively solves                 densi-based applied spatial clustering with
the problem of identifying deep-level                     noise) algorithm is used to cluster the topic;
technology trends and key innovation points               Step Four: SAO Triplet Extraction — Using
in the new energy vehicle industry.                       the C-TF-IDF(class-based word frequency-
    2. By using the combination of the                    inverse document frequency) algorithm, SAO
information weight method and the SAO-                    triples are extracted from the clusters and the
BERTopic model, we realize the selection of               topic representation is refined using the
highly innovative and market-valued core                  MMR(Maximum           marginal     correlation)




                                                    117
algorithm to balance correlation and                                                                                             integration with electric vehicles. Topic 3 is
diversity; Step Five: Result Validation — The                                                                                    "Electric Vehicle Battery Pack Configuration
Calinski-Harabasz index and Davies-Bouldin                                                                                       and Structural Design", covering various
index were used to compare the LDA,                                                                                              aspects of the physical configuration,
BERTopic and SAO-LDA topic models to                                                                                             structural design of battery packs, and how
evaluate the clustering effectiveness. In                                                                                        these designs impact battery pack
addition,       dimensionality       reduction                                                                                   performance. Topic 4 is "Cathode Materials
visualization via UMAP provides an intuitive                                                                                     and Composition for Secondary Batteries",
comparison of cluster distributions; Step Six:                                                                                   focusing on the composition and design of
Strategic Recommendations — Based on the                                                                                         cathode materials for batteries, especially
results, recommendations are made from the                                                                                       secondary batteries, which directly impact
corporate,     industry    and     governance                                                                                    battery performance aspects like capacity,
perspectives.                                                                                                                    energy density, charge/ discharge speed, and
     Massive patents in the new
     energy automobile industry
                                                   Core technology patent
                                                            text
                                                                                                  Generating Topic
                                                                                                   Clusters using
                                                                                                                                 lifespan.
                                                                                                    HDBSCAN
       Information    Patent technical value                        POS tagging
                      Patent strategic value                        Tokenization
      weight method                               preprocessing
                      Patent legal value                            Stop words filtering
                      Patent market value                           Lemmatization

                                                                                                  Dimensionality
                                                   Text Vectorization with
      Core technology patent                                                                     Reduction of Text
                                                           BERT
                                                                                                Vectors using UMAP

                Step1
           Patent Screening
                                                           Step2
                                                  Vector Representation of
                                                           Text
                                                                                                        Step3
                                                                                               Cluster Representation of
                                                                                                      Documents
                                                                                                                                    This research successfully applied the
                                                                                                                                 information weight method to screen high-
        Extracting SAO Triples
                                                    LDA                BERTopic
                                                                                                     Technology
                                                                                                                                 quality core patents from a large-scale patent
                                                                                                                                 dataset, and employed the SAO-BERTopic
         from Topic Clusters                                             SAO-                       Identification
           using C-TF-IDF                        SAO-LDA               BERTopic                     Suggestions to
                                                                   Calinski harabasz index


                                                                                                                                 model to identify core technological topics
                                               Index comparison
                                                    method         Davies bouldin index
                                                                                              Enterprise
                                               Visual comparison   Dimensionality reduction
                                                    method         visualization                      Industry
            Adjusting Topic

                                                                                                                                 from the filtered data. Our empirical analysis
                                                        Comparison of                                        Government
          Representation using
                                                       Clustering Results
                MMR
                                                             Step5                                     Step6

                                                                                                                                 results highlight the superior clustering
                 Step4
                                                       Topic Verification                      Put forward suggestions
              Topic Results


Figure 1: Overall research process                                                                                               performance of SAO-BERTopic in identifying
                                                                                                                                 technological topics in the new energy
                                                                                                                                 vehicle domain compared to traditional
    This article uses the Incopat patent                                                                                         models. The approach presented in this
database to collect data pertinent to the new                                                                                    research, through its efficient processing of
energy vehicle industry, the topics are finally                                                                                  technical terms and complex concepts and its
refined into five core topics.                                                                                                   precise     screening    of    patent     data,
    Topic 0 is "Energy Management and                                                                                            demonstrates the ability to identify core
Power Transmission Technology in Hybrid                                                                                          technologies and drive innovation across a
Electric Vehicles". Topic 1 is "Electric Vehicle                                                                                 wide range of fields. Therefore, this research
Charging Systems and Energy Management",                                                                                         not only has a direct contribution to the
where the SAO triples contained in topic 1                                                                                       technological progress of the new energy
focus on describing the functionalities of                                                                                       vehicle industry, but also provides a new
receiving, managing, and providing energy in                                                                                     perspective and tool for technology
charging panels and systems. Topic 2 is                                                                                          identification and innovation management in
"Battery System Integration and Energy                                                                                           various fields, which is helpful to promote
Efficiency Management", where the SAO                                                                                            the     technological    development       and
triples collectively depict various aspects of                                                                                   innovation strategy formulation in a wider
energy storage system design, management,                                                                                        range of fields.
and     application,     including      modular
composition,       energy       control     and
management, system power supply, and




                                                                                                                           118
                                                         [9] N. Su, Z. Tan, Review and vision for the
                                                              future of the research on technology
[1] Y. Li, The impact of economic systems                     opportunity analysis methods, Inf. Stud.
    and financial systems on new energy                       Theory Appl. 43 (2020) 179–186.
    vehicle industry, Adv. Econ. Manag.                  [10] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent
    Polit.    Sci.   33      (2023)    162-170.               dirichlet allocation, J. Mach. Learn. Res.
    doi:10.54254/2754-1169/33/20231622.                       3 (2003) 993-1022.
[2] Z. Liu, H. Hao, X. Cheng, F. Zhao,                   [11] T. L. Griffiths, M. Steyvers, Finding
    Critical issues of energy efficient and                   scientific topics, Proc. Natl. Acad. Sci. U.
    new energy vehicles development in                        S.     A.     101     (2004)     5228-5235.
    china, Energy Policy 115 (2018) 92-97.                    doi:10.1073/PNAS.0307752101.
    doi:10.1016/J.ENPOL.2018.01.006.                     [12] J. Lafferty, D. Blei, Topic models, in: A.
[3] P. Yu, J. Zhang, D. Yang, X. Lin, T. Xu,                  N. Srivastava and M. Sahami (Eds.) Text
    The evolution of China's new energy                       mining, Chapman and Hall/CRC, New
    vehicle industry from the perspective of                  York,      NY,      2009,     pp.     71-93.
    a technology-market-policy framework,                     doi:10.1201/9781420059458.CH4.
    Sustainability      11      (2019)    1711.          [13] M. Hoffman, F. Bach, D. Blei, Online
    doi:10.3390/SU11061711.                                   learning for latent dirichlet allocation, in:
[4] M. Kendall, Fuel cell development for                     24th Annual Conference on Neural
    new energy vehicles (NEVs) and clean                      Information Processing Systems 2010,
    air in china, Prog. Nat. Sci. Mater. Int. 28              NeurIPS, New York, NY, 2010, pp. 856–
    (2018)                             113-120.               864.
    doi:10.1016/J.PNSC.2018.03.001.                      [14] S. Deerwester, S. T. Dumais, G. W.
[5] T. Yang, C. Xing, X. Li, Evaluation and                   Furnas, T. K. Landauer, R. Harshman,
    analysis of new-energy vehicle industry                   Indexing by latent semantic analysis, J.
    policies in the context of technical                      Am. Soc. Inf. Sci. 41 (1990) 391-407.
    innovation in china, J. Clean. Prod. 281                  doi:10.1002/(SICI)1097-
    (2021)                              125126.               4571(199009)41:6<391::AID-
    doi:10.1016/J.JCLEPRO.2020.125126.                        ASI1>3.0.CO;2-9.
[6] X. L. Xu, H. H. Chen, Exploring the                  [15] T. K. Landauer, P. W. Foltz, D. Laham,
    innovation efficiency of new energy                       An introduction to latent semantic
    vehicle enterprises in china, Clean                       analysis, Discourse Process 25 (1998)
    Technol. Environ. Policy 22 (2020) 1671-                  259-284. doi:10.1080/01638539809545028.
    1685. doi:10.1007/S10098-020-01908-W.                [16] J. Lafferty, D. Blei, Correlated topic
[7] S. Chen, Y. Feng, C. Lin, Z. Liao, X. Mei,                models, in: Proceedings of the 18th
    Research on the technology innovation                     International Conference on Neural
    efficiency of China's listed new energy                   Information Processing Systems, MIT
    vehicle enterprises, Math. Probl. Eng.                    Press, Cambridge, MA, 2005, pp. 147–
    2021 (2021) 6613602.                                      154. doi:10.5555/2976248.2976267.
    doi:10.1155/2021/6613602.
[8] C. Lee, A review of data analytics in
    technological forecasting, Technol.
    Forecast. Soc. Change 166 (2021) 120646.
    doi:10.1016/J.TECHFORE.2021.120646.




                                                   119