Patent Technology Competitor Group Analysis Method Based on IPC Yuan Fu Hongqi Han Lijun Zhu Information Technology Supporting Information Technology Supporting Information Technology Supporting Center, Institute of Scientific and Center, Institute of Scientific and Center, Institute of Scientific and Technical Information of China Technical Information of China Technical Information of China No. 15 Fuxing Rd,.Haidian Distirct, No. 15 Fuxing Rd,.Haidian Distirct, No. 15 Fuxing Rd,.Haidian Distirct, Beijing 100038, P.R. China Beijing 100038, P.R. China Beijing 100038, P.R. China +86 10 5888 2447 +86 10 5888 2447 +86 10 5888 2447 fuyuan2014@istic.ac.cn bithhq@163.com zhulj@istic.ac.cn ABSTRACT It is crucial to understand the technical groups of intra-industry 1. INTRODUCTION Competitiveness is a typical characteristic for industrial and to master the competition in the field of technology. In order technology (Yoon, 2008) [1]. Practically, for almost every to provide valuable information for industry participants and emerging industry, some kinds of technology will become leading policymakers, a process model for mining technical competitor and predominant after developing over a period of time. groups based on IPC classification number is put forward. Firstly, Agglomeration is common for an industry. When the industrial the patent numbers under each IPC are counted for building technology agglomerates to a certain extent so that it can meet the feature vectors for competitors. Then, technical similarities needs of product functions well, the industry will become mature, between each pairs of competitors are computed. Finally, the and the industrial technology system is established. On the other LinLog graph clustering algorithm is carried out to discover three hand, the technology owner compete reciprocally into different levels of groups, i.e. institution, province and country. To obtain technical groups. According to Porter's theory of competitive patent data for this research, an acquisition system for Chinese advantage, the real competitors inside an industry are companies patent data is developed. Experiments on the field of fuel cell is similar to a company (Lee, 2006) [2]. These similar companies conducted and the results show the technique is helpful and constitute a strategic group and become a sub-industry. A effective. company has barriers to enter different strategy groups. Therefore companies which have very similar industrial technology are Categories and Subject Descriptors likely to be main competitors. Information extraction from patent documents The clustering method of dividing data into several clusters can reflect relational schema of the data and the knowledge hidden in General Terms the data. The method of competitor group analysis of industrial Experimentation technology is to use appropriate clustering algorithm to divide competitors into several groups, and thus identify similar competitors inside an industry competitions and their reciprocal Keywords influences. The level of technical competitor group analysis can LinLog; IPC classification number; Technology competitor group be from different aspects such as countries, provinces, and institutions. The purpose of the analysis is to understand the technical groups inside an industry, and to master the competition in the field of technology from higher levels, and to provide valuable information for industry participants and policymakers. Some common clustering algorithms can be used to identify the competitor group of industrial technology, such as self-organizing mapping (SOM), K-means (Lee, 2009) [3], factor analysis, etc. In these models, each competitor is usually expressed as a feature vector which are measured by several technical characteristics. Similar objects will be clustered into one group by calculating distances between them. For example, (Pilkington, 2004)[4] used UPC number and IPC classification respectively as the technical features for competitors and used the factor analysis model to cluster 52 companies in the field of fuel cell into five groups. Copyright © 2015 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. Literature studies found that many researchers have used This volume is published and copyrighted by its editors. visualization methods. The traditional clustering algorithm is Published at Ceur-ws.org based on the unsupervised learning so people often doubt the Proceedings of the Second International Workshop on Patent Mining and effectiveness of the analysis results. The visualization method can its Applications (IPAMIN). May 27–28, 2015, Beijing, China. display abstract data using graph or picture because it combines the computer technology and human cognitive ability effectively. normalized cut. Normalized cut and edge-repulsive model can Therefore, the visualization method enhances the user’s produce unbiased results, therefore it is especially suitable for confidence for the analysis results, so it has been widely accepted normally distributed data. In this paper, LinLog algorithm of in recent years. Considering the advantages of visualization, the Barnes and Hut hierarchy algorithms is used to draw clustered proposed method will use graph clustering method to find graphs (Stegmann, 2003) [9]. After the algorithm draw graphics, it technical competitor groups. also divide nodes into several clusters. 2. RELATED WORK 2.2 IPC 2.1 LinLog graph clustering methods IPC means the international patent classification. IPC is an LinLog algorithm was first put forward by (Noack, 2007) [5]. The international standard which is used by the patent offices of all aim of the algorithm is to produce ideal and visual clustering countries or regions in the world. Although some countries or graphs. Figure 1 shows an example mentioned in Noack's paper regions make its own patent classification system, such as CPC (Noack, 2005) [6]. In the example, Spring and LinLog algorithm system of USPTO, ECLA system of EPO, they provide the IPC were employed respectively for graph clustering using the same classification number. Chinese patent classification system also data. Comparatively, LinLog algorithm clearly divided data into use IPC system. A patent has at least one IPC number, but is not two large clusters which are connected by two nods, Dan and limited to one IPC classification number. In other words, some Upton, while Spring algorithm positioned nodes with high degree patents are endowed with two or more IPC classification numbers. in the center and nodes with low degree near the borders. The first classification number is called the main classification number when there are multiple patent classification numbers. According to the characteristics of technical topics of the invention, the technology fields in IPC system are divided into 8 sections. Each section represents a kind of technology, designated by one of the capital letters A through H as shown in Table1. Table 1 section of technology in IPC system Section Section Title A HUMAN NECESSITIES B PERFORMING OPERATIONS; TRANSPORTING C CHEMISTRY; METALLURGY D TEXTILES; PAPER E FIXED CONSTRUCTIONS MECHANICAL ENGINEERING; LIGHTING; F HEATING; WEAPONS;BLASTING (a) Spring model G PHYSICS H ELECTRICITY The structure of IPC classification system is hierarchical. Sections are the highest level of hierarchy in the system. Each section is subdivided into classes which are the second hierarchical level. Each class comprises one or more subclasses which are the third hierarchical level. Each subclass is broken down into subdivisions referred to as “groups”, which are either main groups (the fourth hierarchical level) or subgroups (lower hierarchical levels dependent upon the main group level). A complete classification symbol comprises the combined symbols representing the section, class, subclass and main group or subgroup, as shown in Figure 2. Currently, there are approximately 70,000 subdivisions in the classification system. Figure 3 is a sample of the hierarchical structure. (b) LinLog model Figure 1 Comparison of Spring and Linlog method The LinLog model does not conform to the traditional aesthetic standard, it aims to group nodes of closely connected and separate nodes of partially connected. There are two kinds of LinLog models: node-repulsion model and edge-repulsion model(Coscia, 2009) [7]. The two models are based on two famous clustering standards respectively (Li, 2008) [8], namely density of cut and Figure 2 Hierarchical structure of the IPC classification system Figure 4 The process model of the graph clustering method Firstly, selecting a clustering level from three categories: institutions, provinces and countries. Then, counting the patent number under each main IPC classification number for each technology competitor. Then the association matrix is established between technology competitors and the main IPC classification Figure 3 A sample of IPC hierarchical structure number (Dibattista, 1994)[10]. Each technology competitor is expressed as a feature vector whose attributes are IPC 3. METHOD classification numbers. The value of each attribute item is the An industrial technology field can be divided into several number of patents under the main IPC classification number. subfields, and each subfield may have smaller technology Finally, calculating the similarity between each pair of subfields. Technology competitors often have different research technological competitors by using cosine formula(Fruchterman, background, bases, objectives and priorities. Competitors with 1991)[11]. Let IPC as the number of the IPC main classification similar technology may be competitors or partners on the market, number covered by industrial technology, and the patent number and they are likely to interact with each other. IPC classification of competitor i under k-th IPC classification number is IPCki . codes are designated by patent examiner with professional knowledge. Therefore IPC provide an effective way to know The equation (1) shows how to compute the technological industrial hot points, and research and development directions of similarity between competitor i and j. technology competitors. A technology competitor tend to invest IPC research in several technical subfields, so it is difficult to determine whether two competitors have similar research  IPC  IPC ki kj technology only from the IPC count statistics. Therefore, a graph sim(i, j )  k 1 clustering method based on main IPC number is put forward to IPC IPC (1) identify technology competitor groups within an industrial technology field. Figure 4 shows the process model of this method.  IPCki 2  k 1  IPCkj 2 k 1 In order to obtain good visual graphics, a minimum similarity threshold (Noack, 2004)[12] should be set. Generally, the threshold is set to the mean of similarity, yet it can also be determined by experiments. There will be a connect between two technology competitors when the similarity between them is higher than the set threshold. Using technology competitors as nodes, the connections between them as edges, and the weight of the edges are the technological similarity values between them, LinLog graph clustering algorithm will generate visual map. The map will show the clusters for identifying competitor groups. 4. DATA Table 2 keywords for identifying application category 4.1 Data acquisition category Key words Nowadays, almost all patent offices of major countries and company company, partnership regions provide patent databases on their official web sites. university university, college People can connect these websites any time and everywhere via the Internet to obtain the patent data freely. In order to get patent institute research institution, data quickly, a patent data acquisition system (Laura, 2008) [13] is others committee, association, foundation developed. The model of the system model is shown in Figure 5. The acquisition system can fetch HTML web pages which personal contains the patent description information from the official website of the state intellectual property office of China (http://www.sipo.gov.cn/). After the patent information is 5. EXPERIMENTAL RESULTS collected, it can automatically obtain the items of description and legal status of patents through the content analysis of web pages 5.1 Research and development institutions In order to have clear visual map, we choose top 20 research and and save them into the local databases. development institutions for graph clustering algorithm. The result is shown in Figure 6. In the map, the size of nodes represents the number of granted invention patents, and the color of nodes shows the group they belong to (Reinhard, 2007) [15]. In the case, the LinLog algorithm identified two technology competitor groups (shown in Figure 6). The group with red node color is the first group, including 10. They are: Samsung (177), Chinese Academy of Sciences(128), Antiq(74), General Motors(56), Honda(52), Wuhan University of Technology(49), Shanghai Jiaotong University(38), Sanyo(37), BYD(32), and Harbin Institute of Technology(26); The group with orange node color is the second group, including 10 other institutions. They are: Shanghai Shen-Li High Tech(194), Panasonic(154), Figure 5 The data acquisition system model Toyota(120), Tsinghua university(72), Nissan(62), Toshiba(48), In order to test the effectiveness of proposed method, the patent Sunrise Power(26), Hitachi (24), LG(20) and UTC (19). The acquisition system is run to download patent data in the field of numbers in parentheses after company names means the numbers fuel cell technology. 6346 patents are collected totally. The of their granted invetion patents. Table 3 shows corresponding following preprocessing steps and the empirical analysis will English names of Chinese Names in Figure 6. employ the downloaded patent data. 4.2 Data preprocess The collected data often have some problems, and it must be preprocessed before the formal analysis. In the experiment, the patent data will be preprocessed to meet the analysis requirements, including identifying the patent categories, countries and provinces of applicants, and categories of applicants, etc. If the first applicants are Chinese individuals or organizations, the addresses of the applicants often contain the information of its province (Kayal, 1999) [14]. Generally, the first 6 digits of the address description is the applicant’s postcode, so the province information can be obtained according to the postcode. If the first applicants are foreign individuals or organizations, the priority item and the international publication item in patent descriptions contain the state information. For example, the priority item of a patent is "1999.8.27 JP 242132/1999", where JP means that the Figure 6 Clustering result of R&D institutions applicant is a Japanese. Table 3 Corresponding English names of Chinese names For the purpose of the research, applicants are divided into 5 of R&D institutions in Figure 6 categories: company, university, research institute, personal and the other. The categories are identified by the keywords in the Chinese name English name applicant names. The corresponding relation of keywords and 清华大学 Tsinghua University categories are shown in Table 2. If there are more than one applicants in a patent description, only the first applicant is 新源动力股份有限公司 Sunrise Power considered. For example, there are two applicants of the patent No. 上海神力科技有限公司 Shanghai Shen-Li High Tech 00112136.7: Nanjing Normal University and Changchun Institute of Applied Chemistry Chinese Academy of Sciences, the system 松下公司 Panasonic will designate "university" category to the patent. 日产公司 Nissan 辽宁 Liaoning 丰田公司 Toyota 江苏 Jiangsu 日立公司 Hitachi 天津 Tianjin 东芝公司 Toshiba 山东 Shandong BTC 公司 BTC 安徽 Anhui 乐金电子电器有限公司 LG 陕西 Shaanxi 上海交通大学 Shanghai Jiaotong University 四川 Sichuan 中国科学院 Chinese Academy of Sciences 三星公司 Samsung 河北 Hebei 三洋公司 Sanyo 北京 Beijing 通用汽车公司 General Motors 广东 Guangdong 胜光科技股份有限公司 Antiq 湖北 Hubei 哈尔滨工业大学 Harbin Institute of Technology 黑龙江 Heilongjiang 比亚迪股份有限公司 BYD 吉林 Jilin 本田株式会社 Honda 重庆 Chongqing 武汉大学 Wuhan University of 湖南 Hunan Technology 山西 Shanxi In the province level, two technology competitor groups are 5.2 Provinces identified. The group with red nodes is the first group, including In the case, totally 22 provinces are extracted in all fuel cell 10 provinces: Shanghai (311), Taiwan (152), Liaoning (127), patents. The graph clustering result is shown in figure 7. The Jiangsu (41), Tianjin(40), Shandong(23), Shaanxi(13), Anhui biggest node in the picture is Shanghai, which means the research (19),Sichuan (4) and Hebei(2), The group with orange node color strength of Shanghai province is the strongest one in China. While represents the second group, including 8 provinces: Beijing(150), the smallest node is Hebei, which means Hebei province is the Guangdong(93), Hubei(58), Heilongjiang(29), Jilin(18), weakest one on the research of fuel cell in these provinces. Chongqing (5), Hunan(4) and Shanxi Province (4). Because the technology similarity value of Zhejiang (16), Fujian (12), Yunnan (1) and Inner Mongolia (1) is lower than the set threshold, the clustering result do not include these provinces. Similarly, the number in parentheses is the number of granted patents of provinces. 5.3 Countries In the case, totally 17 countries or regions are extracted in all fuel cell patents. The graph clustering result is shown in Figure 8. Obviously, the biggest node in the graph is China, the granted patent number of which is 1123. While the smallest nodes are Denmark and Finland. The granted patent number of both country are 3. Figure 7 The clustering results of provinces Table 4 Corresponding English names of Chinese names of provinces in Figure 7 The Chinese Name The English Name 上海 Shanghai 台湾 Taiwan Figure 8 The clustering result of countries Figure 9 The clustering figure of unconnected states Table 5 Corresponding English names of Chinese names of Table 6 Corresponding English names of Chinese names of R&D institutions in Figure 8 R&D institutions in Figure 9 The Chinese Name The English Name The Chinese Name The English Name 中国 China 中国 China 德国 Germany 德国 Germany 英国 Britain 英国 Britain 法国 France 法国 France 欧洲专利局 EPO 欧洲专利局 EPO 瑞典 Sweden 瑞典 Sweden 荷兰 Netherlands 荷兰 Netherlands 日本 Japan 日本 Japan 美国 the United States 美国 the United States 加拿大 Canada 加拿大 Canada 澳大利亚 Australia 澳大利亚 Australia 芬兰 Finland 芬兰 Finland In the country level, four technology competitor groups are 挪威 Norway identified, containing 16 countries and regional organizations. The group with red node color represents the first group, 意大利 Italy including seven countries and regional organizations: China 韩国 Korea (1123), Germany (58), Britain (28), France (16), EPO (10), Sweden (6), and Netherlands (5). The group of orange node color represents the second group, including 5 countries: Japan (740), the United States (292), Canada (33), Australia (4) and Finland 6. CONCLUSION (3). The third group consists of Korea (202) and Denmark (3) two In the paper, a graph clustering algorithm is used to obtain countries. The fourth group includes Norway (3) and Italy (1). technology competitor group analysis based on IPC. The There is an edge between Norway and Italy, but there are no edges proposed method consists of four stages. First, the clustering level with other nodes (Figure 9), however Figure 8 can't show them is determined. There are three levels for selected, i.e. institute, because LinLog algorithm has problems to generate clusters with province and country. Second, the numbers of patents are counted unconnected graphs. The technology similarity of Austria (1) with under each IPC for each object (competitor) in the selected level. other countries is lower than the threshold, so the clustering figure Third, each object is expressed with a vector, the attributes of does not include Austria (1). which are IPC classification codes, and the value of each attribute is corresponding patent count. Fourth, technology similarities are computed between each pair of competitors. Finally, Linlog algorithm is used to cluster competitors into groups and display them in a graph to improve the confidence of analysis results. Experimental results on fuel cell demonstrate the effectiveness of on Advances in Social Network Analysis and Mining the proposed method. (ASONAM). [8] Li, Wanchun., Eades, P., and Nikolov, N. 2008. Using spring 7. ACKNOWLEDGMENTS algorithms to remove node overlapping. C. Proceedings of This work is partially supported by National Natural Science the 2005 Asia-Pacific symposium on Information Foundation of China (Project 71473237), and partially supported visualization. by the Key Work Project of Institute of Scientific and Technical Information of China (ISTIC) (ZD2014-7-1). Authors are grateful [9] Stegmann, J. and Grohmann, G. 2003. Hypothesis generation to the National Natural Science Foundation of China and the guided by co-word clustering. J. Scientometrics. 56, 1, 111- Ministry of Science and Technology of China for financial 135. support to carry out this work. [10] Dibattista, G., Eades, P., Tamassia, R., and Tollis, I. G. 1994 Algorithms for drawing graphs: an annotated bibliography. J. 8. REFERENCES Computational Geometry: Theory and Applications. 4, 5, [1] Yoon, B. and Lee, S. 2008. Patent analysis for technology 235-282. forecasting: sector-specific applications. C. 2008 IEEE International Engineering Management Conference. [11] Fruchterman, T. M. J. and Reingold, E. M. 1991. Graph drawing by force-directed placement. J. Software-Practice [2] Lee, C. K. and Ong, R. 2006. An analysis of the liquid and Experience. 21, 11, 1129-1164. crystal cell patents of LG and Samsung filed at the USPTO. C. 2006 IEEE International Conference on Management of [12] Noack, A. 2004. An energy model for visual graph clustering. Innovation and Technology. C. 11th International Symposium on Graph Drawing. [3] Lee, S., Yoon, B., and Park, Y. 2009. An approach to [13] Laura, R. 2008. Data mining tools for technology and discovering new technology opportunities: Keyword-based competitive intelligence. R. VTT TIEDOTTEITA Research patent map approach. J. Technovation. 29, 6, 481-497. notes 2451. [4] Pilkington, A. 2004. Technology portfolio alignment [14] Kayal, A. A. and Waters, R. C. 1999. An empirical commercialisation: an investigation of fuel cell patenting. J. evaluation of the technology cycle time indicator as a Technovation. 24, 10, 761-771. measure of the pace of technological progress in superconductor technology. J. IEEE Transactions on [5] Noack, A. 2007. Energy models for graph clustering. J. Engineering Management. 46, 2, 127-131. Journal of Graph Algorithms and Applications. 11, 2, 453- 480. [15] Reinhard, H., Martin, K., and Marcus, K. 2007. Patent indicators for the technology life cycle development. J. [6] Noack, A. Energy-based clustering. C.13th International Research Policy. 36, 3, 387-398. Symposium on Graph Drawing. 2005. [7] Coscia, M., Giannotti, F., and Pensa, R. 2009. Social network analysis as knowledge discovery process: a case study on digital bibliography. C. International Conference