Mining Technical Topic Networks from Chinese Patents Hongqi Han Shuo Xu Lijun Zhu Institute of Scientific and Institute of Scientific and Institute of Scientific and Technical Information of China Technical Information of China Technical Information of China Fuxing road 15, haidian Fuxing road 15, haidian Fuxing road 15, haidian district(100038) district(100038) district(100038) Beijing, China Beijing, China Beijing, China bithhq@163.com xush@istic.ac.cn zhulj@istic.ac.cn Xiaodong Qiao Jie Gui Zhaofeng Zhang Institute of Scientific and Institute of Scientific and School of Information Technical Information of China Technical Information of China Management, Nanjing Fuxing road 15, haidian Fuxing road 15, haidian University, district(100038) district(100038) 22 Hankou Road, Nanjing Beijing, China Beijing, China Jiangsu (210093) qiaox@istic.ac.cn guij@istic.ac.cn Nanjing,China zhangzf@istic.ac.cn ABSTRACT 1. INTRODUCTION Patents are one of the most important innovative resources. It Today, along with the rapid development of science and technol- is a challenge and useful to discover technical topics and their ogy and integration of economic globalization process, innovation relations from patents. A process framework is proposed to mine is becoming an important means to obtain technological advan- technical topics and construct their relation network from Chinese tage[8]. Patent documents are one of the major innovative data patents. The process consists of four stages. First, technical resources of technical and commercial knowledge, and thus patent terms are extracted from patent texts and the equivalence index analysis has long been considered helpful for R&D management is selected to measure the link strength between them. Then, a and technoeconomic analysis[14]. By depicting technical topics clustering algorithm is used to group terms into topic clusters, in and mapping their relations, researchers can acquire novel ideas which terms are connected by internal links, and topic clusters are for technology breakthrough, while enterprises can find technical connected by external links. Afterwards, all topics are classified routes for product planning and development, and policy makers into three categories: isolated, principal and secondary. Finally, a can understand dynamic technology change for funding emerging technical topic network is created by using topic clusters as nodes, and potential fields. Traditionally, a small number of experts are external links as edges and the number of external links as weights. selected to undertake such work, yet the method has been widely Experimental results on Chinese fuel cell patents show the method criticized, such as weak representativeness, high cost, and low is effective in mining technical topics and mapping their relations, efficiency [5]. and the constructed network is helpful for technology innovation. It is a challenge to detect technical topics and find the relations between them. On the one hand, rapid developing technology makes it difficult for researchers to grasp the latest topics, on Categories and Subject Descriptors the other hand the amount of patents is huge and increasing H.4 [Information Systems Applications]: Data Mining sharply, which also makes it difficult to mine technical topics ; I.2 [Computing methodologies]: Artificial Intelligence hidden in the data. Yoon [15] and Lee [9] proposed approaches for identifying new technology opportunities using keyword-based General Terms morphology analysis and keyword-based patent maps respectively. Yoon [14] presented a network analysis for high technology Application trend forecast based on text mining technique, where nodes of the network are patents. However, these previous researches didn’t Keywords explore technology topics and map their relations. Callon [2] Technical topic network, topic relation, co-word analysis, patent presented co-word analysis techniques to map the relationship analysis, data mining between concepts, ideas and problems in science. The following researches, for example, Coulter [4], Van [12] and Cobo [3], extended the technique. Now it is common to find scientific papers and reports that contain a science mapping analysis to show and uncover the hidden key elements [3], however most of these works Copyright c 2014 for the individual papers by the papers’ authors. were undertaken for academic purposes using bibliographic data, Copying permitted for private and academic purposes. This volume and few are for competitive animus using patent data. is published and copyrighted by its editors. Published at Ceur-ws.org In this article, we propose an approach based on co-word Proceedings of the First International Workshop on Patent Mining and Its Applications (IPAMIN) 2014. Hildesheim. Oct. 7th. 2014. At analysis technique for detecting technical topics and mapping their KONVENS’14, October 8ĺC10, 2014, Hildesheim, Germany. relations using patent data. The co-word analysis technique is . Table 1: Linguistic Rules for Extracting Chinese Terms Length of Term Rule 1 n,v,l 2 nv+nv,a+nv,b+n, m+n 3 nv+nv+nv, a+nv+n,d+v+n,b+v+n Bibliometric data Extracted terms Clustered topics Topic network 4 nv+nv+nv+n 5 nv+nv+nv+nv+n,a+n+v+n+n, b+v+n+v+n Figure 1: Framework 6 nv+nv+c+vn+nv+n, nv+nv+nv+c+nv+nv, n+n+u+b+vn+n, n+vn+u+n+vn+n based on keywords and their co-occurrence, however most patent Table 2: Meaning of Letters in Table 1 databases don’t provide keywords, therefore one of the challenges Code Part of speech is to extract technical terms from patents. To extract terms from a adjective patent text, we use a hybrid automatic term recognition technique b determiners integrating linguistic rules and statistics indexes. Another chal- c conjunction lenge comes from the way to detect topics in patents, because d adverb technical topics and their numbers are usually unknown. The l idiom words presented approach is a process framework consisting of four steps. m quantifier The first step is data collection and pre-processing, the second step n noun or product terms is extracting technical terms, the third step is detecting technical v verb topics, and the last step is constructing technical topic network. u auxiliary words The remainder of the article is organized as follows. In section nv noun or adverb 2, the process framework is illustrated to introduce the basic idea to create topic network. In section 3, the main techniques used in the article are introduced. In section 4, an experimental results on unithood refers to the degree of strength or stability of syntagmatic Chinese fuel cell patents are described and discussed. Finally, the collocations[7]. Afterwards, the candidate terms are sorted accord- conclusion are made. ing to the statistics index and evaluated by domain experts, and finally, selected technical terms are stored for co-word analysis. 2. METHODS The process framework of the presented method is first intro- 2.3 Topics Detection duced. Then three core techniques in the framework are introduced, After technical terms are extracted, keyword network can be including technical terms extraction, topic detection, and topic constructed based on their co-occurrence relation, however such network construction. network contains so many nodes and complex relations that it can’t be better understood [2]. Therefore, Callon [2] presented a method 2.1 Framework to cluster keywords into topics, in which several keywords are The process framework contains four continuous stages (Fig.1): closely connected. Each topic represents an interested problem of patent data collection and pre-processing, technical terms extrac- researchers, and so it will be more easily understood than single tion, topic detection and topics network construction. In the first keywords. Moreover, the number of topics is far less than that of stage, patent data are collected and stored into database after pre- keywords, which makes it clearer to map the relations of concepts. processing operations. In the second stage, technical terms are To measure the link strength is an important process for detecting extracted from patent text. In the third stage, terms are clustered topics. Many metrics have been proposed for computing the into technical topics. In the last stage, topic clusters are used to link relations between keywords. The common indexes include . create network based on their link relations. association strength [4, 12], Equivalence Index [2], Inclusion Index[5], Jaccard Index [10], and Salton s cosine [10]. Among 2.2 Technical Terms Extraction these indexes, the Equivalence index shows the probability that Fig.2 depicts the overall process of extracting technical terms two keywords co-occur when given the frequency of two keywords from patents. The term extraction process integrates linguistic rules appearing in documents. It provides an intuitive measure of and statistics index. Because there is no delimiter between Chinese link strength between keywords, rather than imposing conceptual words like the space character in English text, word segmentation is inclusion property like other metrics. Moreover, the metric is easier necessary at first. Then POS tagger is used for identifying part-of- to be understood and utilized in the production and interpretation speech of words, e.g. nouns, verbs or adjectives, etc.. Afterwards, of keyword association maps than other metrics [4]. It also allows we collocated words into phrases. The collated phrases are filtered associations of both major and minor keywords and is symmetrical by linguistic rules and stop words for generating candidate terms. in their relationships [2]. Let Ci be the number of times keyword i The linguistic rules used in the article are shown in Table 1. The is used in the corpus, and let Cij be the number of co-occurrences letters in the second column of Table 1 are codes of part-of- of keyword i and j. The link strength Eij between keyword i and speech. These codes are defined in the Chinese segmentation tools j is given by Eq. 1: developed by Hailiang Technology Company. The meanings of these letters are shown in Table 2. 2 Eij = (Cij /Ci ) × (Cij /Cj ) = Cij /(Ci × Cj ) (1) Then statistics index are used to compute the termhood and unithood of the candidate terms. Termhood refers to the degree Based on link strength, research topics in a domain corpus that a linguistic unit is related to domain-specific concepts, while can be detected using keywords clustering algorithm. The main For each topic, ceiling measures the maximum link strength (Eq.2), Full Text of and saturation measures the minimum link strength (Eq.3). Patent ceiling(tl ) = max(Eij ) (2) Word Segmentation saturation(tl ) = min(Eij ) (3) POS Tagger Next, considering the external links and their association values, all the topic clusters can be classified into three categories: isolated, Phrase collocation secondary, principal. • isolated topics: which have no external links with other Linguistic rules Stop-word list filter filter topics, or the numbers of external links between which and other topics are below threshold, so the only question Term regarding them is their internal homogeneity; candidates • secondary topics: the strength values of external links be- Term tween which and other clusters are above the ceiling thresh- evaluation old, and so it is naturally considered that they are the extension of one of these; Technical Terms • principal topics: whose saturation values are greater than links associated to one or more other (secondary) clusters. According to such classification, principal topics seem to be Figure 2: Term extraction basic technologies for some other ones, and secondary topics seem to be dependant technologies on basic ones, while isolated topics seem to be independent technologies. effective clustering algorithms include Callon’s method [2], Coul- ter’s method [4], Multidimensional Scaling (MDS) [13], Latent 2.5 Topics Network Construction Dirichlet Allocation (LDA ) [1] and others. Based on the classification of topic clusters, using detected topics This study uses the two passes algorithm proposed by Coulter as nodes, the external links as edges, and the numbers of external [4]. Pass-1 builds keywords clusters that can identify areas of links as weights of edges, the topics network is constructed to strong focus as research topics. The nodes with big circular shape illustrate the relations between topics. We don’t use multiple edges in Fig.3 show such topics. The internal nodes with triangle shape to represent the relations between two topics. That is to say, if two in a topic node represent strongly connected keywords. The links topic clusters have external links, even when the number of external between keywords in a topic are called internal links. Pass-2 links are greater than 1, there is a single edge between them. In identifies keywords that associate in more than one topics, and practice, a threshold of minimum number of external links is used thereby generates links between Pass-1 nodes across topics and to remove weaker connected edges for getting better results. In indicate pervasive issues. The links between keywords in different order to illustrate the relation between two topics, the classification topics are called external links (Fig.3). information is used to decide the direction of edges. The direction of edges between principal and secondary topics are unidirectional, from the former to the latter, while the edges between two principal topics are bidirectional. 3. EXPERIMENTS 3.1 Data The experimental data is provided by SIPO (http://www.sipo.gov.cn). Chinese patents in the domain of fuel cell are collected using the retrieval strategy combining keywords and IPC codes. All collected patents are pre-precessed. Finally, we get 6,346 patents. Because Figure 3: Topic links the full text of patents are not provided, we just use title and abstract to extract terms in the experiment. 3.2 Technical terms 2.4 Topics Classification First, Chinese word segmentation tools are run to split sentences Denote the set of detected topics as T = {t1 , t2 , ..., tm }, where in patent title and abstract into words. Let the threshold of term m is the number of topics. Then for a topic, tl , where l ∈ [1, m], frequency be 2, we get 28,113 candidate terms. All single words denote the equivalence index of internal link between keywords are eliminated, because single words alone are often too general in ki and kj as Eij , where i, j ∈ [1, n], and n is the number of meanings or ambiguous to represent a concept in patent analysis, keywords in topic tl . Learning from Callon [2] and Coulter [4], while multi-word phrases can be more specific and desirable[11]. we use Eq.2 and Eq.3 to define two indexes: ceiling and saturation. Then the termhood and unithood of all candidates are computed Table 3: Parameters Used to Generate Network Parameter Value Minimum concurrence Number 2 Maximum Node Number in a Topic 20 Maximum Link Number 24 Figure 5: Technical topic network of fuel cell extensions from principal clusters. In the figure, the sizes of nodes represent the patent numbers related to a topic. If a term in a topic occurs in a patent, the patent is related to the topic. Therefore, from the figure, we know topic 1 is the most preferred developed Figure 4: The internal structure of the first topic technology in the domain of fuel cell. This gives useful information to find the popular technologies in the domain. using the methods in [6]. Afterwards, let the threshold of termhood 4. CONCLUSION and unithood be their mean value, and the threshold of document A method based on co-word analysis technique is presented to frequency be 5, we get 1,669 technical terms. Finally, 1,123 terms detect domain research topics and their link relations from Chinese are selected after the evaluation process of domain experts for patents. Because keywords are not provided in patent data, the detecting topics and creating network. method extracts terms from free text data in title and abstract. The term extraction technique integrates linguistic rules and statistics 3.3 Technical Topics indexes. Extracted terms are clustered into topic clusters based With the extracted 1,123 technical terms, we use the method in on equivalence index. Internal links and external links are defined [4] to detect topic clusters. The parameters used to generate topic to classified all topic clusters into three categories, viz. isolated, clusters are shown in Table 3. secondary and principal clusters. Using topic clusters as nodes, We get 62 topics totally. All the topics are numbered, ranging external links as edges, the number of external links as weights, from 1 to 62 according to the generation sequence. The first the technical topic graph is created. Experimental results on fuel generated topic is assigned number 1 and the last one is assigned cell patents show that it can map the relation of topics, and find number 62. The name of each topic is the internal terms with high important research topics. •í?Ñ+´ 6N?Ñ+´ degrees. Fig. 4 shows the internal structure of first detected topic Although the method is designed for Chinese patents, it is also cluster, the name of which is " - " applicable for other patent data, like USPTO and EPO. However, (Hydrogen outlet pipeline-liquid outlet pipeline). the discovered topics in the method are based on links, and we limit the number of keywords in clustered topics. In addition, threshold 3.4 Network values, such as document frequency and maximum external link Detected topics are used to construct networks (Fig. 5). In number in the experimental part is too naive. These human factors the network, nodes are detected topics, edges are external links will affect the clustering result, and maybe the topic clusters can between them. Isolated topics are not shown. With the information not cover relative technical terms. In the future, we will try more provided by the network, we can not only understand the relation specific methods to detect research topics for generating network, between topics but also find out the structure of domain technology. such as topic models based on statistics technology. In the theory In Fig. 5, the value of parameter Minimum External Links is set of co-word analysis, it is difficult to evaluate the accuracy of 4, i.e. only when the number of external links between any two topic selection and the effectiveness of topic network. Although topics are greater than 4, there is an edge between them. Under we believe researchers will be inspired with the topic network such condition, there are 10 sub-domain technology. Each sub- in technology innovation, the reliability of the method should be domain technology is composed of several connected topics. In a considered in the future. sub-domain technology, the importance of each topic is different. For example, in the sub-domain technology which contains topic 2, topic 24 is the joint of topic 2, 22, 24, 30, 39 and 48, so it may play Acknowledgments an essential role in the transformation of the network. Such topics The authors are grateful to Hailiang Technology Company for are called crossroads clusters. By identifying them, we can find the providing the Chinese word segmentation software for this re- important technology in the domain. search. This research was funded partially by "The study on the In Fig. 5, the arrow direction of an edge shows the extension disconnected problem of scientific collaboration network" which is relation between two topics. The topic nodes in the heads of sponsored by ISTIC Pre-research Foundation under grant number arrows are secondary clusters, while the topic nodes in the tails YY–201418 , the Key Technologies R&D Program of Chinese 12th are principal clusters. As stated before, the secondary clusters are Five-Year Plan (2011-2015): Key Technologies Research on Data Mining from the Multiple Electric Vehicle Information Sources under grant number 2013BAG06B01, and Key Technologies Re- search on Mining and Discovery from Patent Resources under grant number 2013BAH21B02. Authors are grateful to the Ministry of Science and Technology of China for financial support to carry out this work. 5. REFERENCES [1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003. [2] M. Callon, J. P. Courtial, and F. Laville. Co-word analysis as a tool for describing the network of interactions between basic and technological research: The case of polymer chemsitry. Scientmetrics, (22):155–205, 1991. _ [3] M. Cobo, A. L pez-Herrera, E. Herrera-Viedma, and F. Herrera. Scimat: A new science mapping analysis software õ tool. Journal of the American Society for Information Science and Technology, 63(8):1609 1630, 2012. [4] N. Coulter, M. Ira, and K. Suresh. Software engineering as seen through its research literature: a study in co-word analysis. Journal of the American Society for Information Science, 49(13):1206–1223, 1998. [5] Q. H. Knowledge discovery through co-word analysis library trends. Library trends, 48(1):133–159, 1999. [6] H. Han and X. An. Chinese scientific and technical term extraction using c-value and unithood measure. Library and Information Service, 56 (19):85–89, 2012. [7] K. Kageura and B. Umino. Methods of automatic term recognition: A review. Terminology, 3(2):259–289, 1996. [8] H. Lee and D. H. Technological innovation of high-tech industry and patent policy-agent based simulation with double loop learning c,intelligent agents:specification,modeling and applications. In Proceedings of 4th Pacific Rim International Workshop on Multi-agents,PRIMA, 2001. [9] S. Lee, ByungunYoon, and Y. Park. An approach to discovering new technology opportunities: Keyword-based patent map approach. Technovation, 29:481–497, 2009. [10] H. Peters and A. F. van Raan. Co-word-based science maps of chemical engineering. part i: Representations by direct multidimensional scaling. Research Policy, 22(1):23–45, 1993. [11] Y.-H. Tseng, C.-J. Lin, and Y.-I. Lin. Text mining techniques for patent analysis. Information Processing & Management, 43(5):1216–1247, 2007. [12] N. J. Van Eck and L. Waltman. Bibliometric mapping of the computational intelligence field. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 15(05):625–645, 2007. [13] D. Ying, C. G. G., and F. Schubert. Bibliometric cartography of information retrieval research by using co-word analysis. Information Processing & Management, 37(6):817–842, 2001. [14] B. Yoon and Y. Park. A text-mining-based patent network:analytic tool for high-technology trend. The Journal of High Technology Management Research, 15(1):37–50, 2004. [15] B. Yoon and Y. Park. A systematic approach for identifying technology opportunities: Keyword-based morphology analysis. Technological Forecasting & Social Change, 72:145–160, 2005.