EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents A Novel Approach for Patent Similarity Measurement Based on Sequence Alignment Xin An1 Jinghong Li2 Shuo Xu3* School of Economics & School of Economics & Research Base of Beijing Modern Management Management Manufacturing Development, Beijing Forestry University Beijing Forestry University College of Economics and Beijing, P.R. China Beijing, P.R. China Management anxin@bjfu.edu.cn 724298617@qq.com Beijing University of Technology Beijing, P.R. China Sainan Pi5 xushuo@bjut.edu.cn Liang Chen4 Institute of Scientific and School of Economics & Technical Information of China Management Beijing, P.R. China Beijing Forestry University * Corresponding author 25565853@qq.com Beijing, P.R. China silencepipi@bjfu.edu.cn ABSTRACT Nowadays, Subject-Action-Object (SAO) semantic analysis [2, 4, 9, 17] is the most widely used method to measure patent Patent similarity measurement, as one of fundamental building similarity, which stresses the key concepts and functional blocks for patent analysis, not only can derive technical relations. By function, it means “the action changing a feature of intelligence efficiently, but also can detect the risk of infringement any object” [18]. That is to say, SAO structure explicitly describes and evaluate whether the invention meets the criteria of novelty a relation between components in the patent documents. However, and innovation. However, traditional approaches make implicitly on closer examination, one can see that traditional SAO semantic several assumptions, such as bag of words in each component, analysis [2, 4, 9, 17] has several shortcomings. First, the semantic semantic direction irrelevance and so on. In order to relax these direction of each SAO structure and the word order in each assumptions, this study proposes a novel approach on the basis of component of a SAO structure are not taken into account. Second, sequence alignment, which takes semantic direction of each intuitively, each SAO structure carries different amount of sequence structure and the word order information of each domain-specific information. To say it in another way, the component into consideration. Meanwhile, an algorithm for importance of each SAO structure should be different [13]. But calculating the global importance of each sequence structure is put the SAO semantic analysis usually assigns equal weight to each forward. Finally, to verify the effectiveness and performance of SAO structure. Last but not least, the SAO semantic analysis only the improved semantic analysis, a case study is conducted on the focuses on the functional relations, but ignores the valuable thin film head subfield in the field of hard disk drive. Extensive technology intelligence underlying in the non-functional relations experimental results show that our approach is significantly more which is based on the prepositions [1]. accurate and is not sensitive to several core parameters. In order to overcome these issues, this article proposes an improved semantic analysis approach for assessing patent KEYWORDS similarity on the basis of sequence alignment. Different from Patent similarity measurement, Semantic analysis, Entities and previous studies, the sequence structures are used in this paper. A semantic relations, Sequence alignment sequence structure can be explained as an “Entity(1) – Relation – Entity(2)” sequence. This type of structure embraces the functional 1 Introduction and non-functional relations. For example, the phrases, “…the seed film acting as a stop layer…” and “…planar layers on According to many surveys of authorities, patents cover more than opposing sides of a pole piece…”, reflecting the form and spatial 90% latest technical information of the world, of which 80% relation respectively, will generate two sequence structures as would not be published in other forms [5]. Thus, patents analysis “[seed film] (E) – form(R) – [stop layer] (E)” and “[planar layers](E) is increasingly vital for mining technical intelligence. Patent – spatial(R) – [pole piece] (E)”. It is worth mentioning that the similarity measurement, as one of fundamental building blocks for “sequence” emphasizes two aspects in this study: the semantic patent analysis, not only can derive technical intelligence direction of these functional and non-functional structures and the efficiently, but also can detect the risk of infringement and word order of each entity. Meanwhile, an algorithm for evaluate whether the invention meets the criteria of novelty and calculating the global importance of each sequence structure is put innovation [13]. forward. Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 45 EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents 2 Related Work concepts, and IC represents the information content value of the concepts. Before delving into more specifies, discussion of the literature Note that a word may express different meaning (concept) in pertinent to patent similarity measurement is in order. different context, viz. polysemy. This paper uses the concepts corresponding to the highest similarity between two words. At 2.1 Patent Similarity Measurement based on SAO length, given that the synset of word1 and word2 in WordNet is structures Syn1 and Syn2 respectively, the similarity of two words can be Some researchers utilized SAO structures based on semantic defined as follows. similarity to evaluate the risk of patent infringement [2, 9], 𝑠𝑖𝑚(𝑤𝑜𝑟𝑑1 , 𝑤𝑜𝑟𝑑2 ) = 𝑚𝑎𝑥 𝑚𝑎𝑥 𝑠𝑖𝑚(𝑐𝑖 , 𝑐𝑗 ) (2) identified the evolving technological trend for R&D planning 𝑐𝑖 ∈𝑆𝑦𝑛1 𝑐𝑗 ∈𝑆𝑦𝑛2 [17], build a technology tree for technology planning [4] and so on. But in these approaches, each SAO structure is assigned the 3 Methodology same weight. As an improvement, Wang et al. [13] has As shown in Figure 1, our research framework consists of four constructed a DWSAO indicator through assigning different phases. The first is to extract sequence structures (functional and weights to SAO structures for measuring patent similarity. non-functional semantic relations) from patent documents through However, it neglects the influence of the number of SAO natural language processing (NLP) techniques and tools. At the structures of patents, which may result in the phenomenon that second phase, the similarity between sequence structures is patents with high similarity values are actually not similar. measured, which takes semantic direction of each sequence Besides, it is not a symmetrical indicator. structure and the word order information of each component into In addition, previous methods implicitly omit the word order consideration. The third phase is to calculate the global information of each component in a SAO structure. As we all importance of each sequence structure based on the TV_LinkA know, the meaning of a phrase may be varied when the words are algorithm [16]. Finally, the similarity between patents is assessed permutated. For example, the phrases “car gasoline” and “gasoline with the well-known optimal transportation problem solver [10, car” both consist of the same words but in different orders. The 14]. These phases are described in more details in the following former is a kind of fuels while the latter is one kind of cars, so subsections. they should not be seen as the same thing. Finally, just as An et al. [1] mentioned, the SAO analysis only focuses on functional relations between the components, but ignores the valuable technology intelligence in the form of non- functional relations. They proposed an approach based on preposition semantic network where prepositions aid to revealing the relations between keywords related to technologies and applied it to mine intelligence information in the patents. Thus, prepositional semantic analysis can be viewed to be complementary to SAO semantic analysis. This study integrates functional and non-functional relations, which are collectively Figure 1: The overall procedure for measuring patent referred to sequence structures. similarity. 2.2 WordNet 3.1 Sequence structures extraction In order to calculate lexical semantic similarity, WordNet is usually chosen as the source of word relations. WordNet is a Recently, Chen et al. [3] have proposed a promising patent lexical database which groups English concepts into sets of information extraction framework, where two deep-learning synonyms called “synsets” and constructs the hierarchical models are respectively used for entity identification and semantic structure to connect “synsets” by means of hypernym/hyponym relation extraction. This framework can be used here to extract the relations. Just because of this property, WordNet is commonly sequence structures mentioned in the patent documents. For more used to calculate the semantic similarity of concepts. In this paper, elaborate and detailed descriptions, we refer the readers to Chen et the information-content (IC) based approach is used, which al. [3]. measures semantic similarity between concepts based on the notion of IC that is calculated in accordance to the probability of 3.2 Similarity between sequence structures encountering a concept [6, 7, 12]. The IC-based approach can be After extracting sequence structures, each patent can be formally defined as follows [6]: represented by a collection of different number of sequence 2×𝐼𝐶(𝐿𝐶𝑆) structures. In this way, patent similarity calculation problem can 𝑠𝑖𝑚(𝑐1 , 𝑐2 ) = (1) 𝐼𝐶(𝑐1 )+𝐼𝐶(𝑐2 ) be transformed to compute the similarity between the collections Here, 𝑠𝑖𝑚(𝑐1 , 𝑐2 ) is the similarity between two concepts 𝑐1 of sequence structures. Before this, this subsection illustrates how and 𝑐2 . LCS is the Least Common Subsumer (hypernym) of two to calculate the semantic similarity between two sequence structures, as shown in Figure 2. Since each sequence structure 46 EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents consists of three components: E(1) (Entity(1)), R (relation) and E(2) that Wang et al. [13] omits the word order information. Figure 3 (Entity(2)), the key is how to align the components from different (b)-(c) illustrates the alignment of words in the entities “car structures and even the words in each component. gasoline” and “gasoline car” based on our approach, in which the symbol “_” denotes a gap. When a word corresponds with “_”, the resulting similarity is regarded as zero. Thus, the similarity between two entities is the average of the similarity of the aligned words, that is, 0.3333. Compared to Wang et al. [13], this result seems be more realistic and credible. Figure 2: The overall procedure for calculating the similarity between sequence structures. Figure 3: The correspondence of words in the entities “car gaosoline” and “gasoline car”. As for the alignment of components, we argue that the semantic direction between E(1) and E(2), which can be judged by To more understandably show the procedure of calculating R (relation), is also very important. According to the relation the similarity between two sequences, Figure 4 illustrates an types, we can define corresponding semantic directions to match example. the components. In this study, the similarity between two Example 2. One sequence structure is “[insulating sequence structures is defined as the average of the similarity of material](E)-partof(R)-[planar layers] (E)” that means “insulating the matched components as follows. material” is a whole and “planar layers” is part of it, and another 𝑠𝑖𝑚(𝐸𝑅𝐸𝑖 , 𝐸𝑅𝐸𝑗 ) is “[seed film](E)-form(R)-[stop layer](E)” that means “stop layer” is (1) (1) (2) (2) 𝑠𝑖𝑚(𝐸𝑖 ,𝐸𝑗 )+𝑠𝑖𝑚(𝑅𝑖 ,𝑅𝑗 )+𝑠𝑖𝑚(𝐸𝑖 ,𝐸𝑗 ) 3 (1) , 𝐸𝑖 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝐸𝑗 (1) (2) 𝑎𝑛𝑑 𝐸𝑖 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝐸𝑗 (2) a whole or a product and “seed film” is part of it or the material ={ (1) (2) (2) (1) 𝑠𝑖𝑚(𝐸𝑖 ,𝐸𝑗 )+𝑠𝑖𝑚(𝑅𝑖 ,𝑅𝑗 )+𝑠𝑖𝑚(𝐸𝑖 ,𝐸𝑗 ) (3) making of it. We can define the semantic direction of the former (1) (2) (2) (1) , 𝐸𝑖 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝐸𝑗 𝑎𝑛𝑑 𝐸𝑖 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝐸𝑗 3 as “insulating material ← planar layers”, and the latter as “seed Here, 𝑠𝑖𝑚(𝐸𝑅𝐸𝑖 , 𝐸𝑅𝐸𝑗 ) , ranging from 0 to 1, represents the similarity between EREi and EREj. The larger this index is, the film → stop layer”. Hence, “insulating material” and “stop layer” greater the similarity between the sequence structures is. are the homogeneous components which can be considered to (1) (1) (2) (2) (1) (2) 𝑠𝑖𝑚(𝐸𝑖 , 𝐸𝑗 ) , 𝑠𝑖𝑚(𝑅𝑖 , 𝑅𝑗 ) , 𝑠𝑖𝑚(𝐸𝑖 , 𝐸𝑗 ) , 𝑠𝑖𝑚(𝐸𝑖 , 𝐸𝑗 ) and match, so do “planar layers” and “seed film”. After matching the (2) (1) components, we use the Needleman-Wunsch algorithm to align 𝑠𝑖𝑚(𝐸𝑖 , 𝐸𝑗 ) denote the similarity between the matched words and then calculate the similarity between the aligned components of EREi and EREj. components. The similarity between two sequence structures is Of course, there exist undirected and bidirectional relations. the average of the similarity of the aligned components. As for these two case, we cannot assert whether E(1) of one sequence structure matches with E(1) or E(2) of another. In this situation, Eq. (4) is used to calculate the similarity between two sequence structures. (1) (1) (2) (2) 𝑠𝑖𝑚(𝐸𝑖 ,𝐸𝑗 )+𝑠𝑖𝑚(𝑅𝑖 ,𝑅𝑗 )+𝑠𝑖𝑚(𝐸𝑖 ,𝐸𝑗 ) 3 𝑠𝑖𝑚(𝐸𝑅𝐸𝑖 , 𝐸𝑅𝐸𝑗 ) = 𝑚𝑎𝑥 { (1) (2) (2) (1) (4) 𝑠𝑖𝑚(𝐸𝑖 ,𝐸𝑗 )+𝑠𝑖𝑚(𝑅𝑖 ,𝑅𝑗 )+𝑠𝑖𝑚(𝐸𝑖 ,𝐸𝑗 ) 3 Figure 4: The procedure of calculating the similarity of Now there remains how to align the words in the matched example 2. component. Here, the Needleman-Wunsch algorithm [8, 15] is utilized here to construct the correspondences of words in the focused words. As a comparison, the method of Wang et al. [13] 3.3 Weight estimation of sequence structures of is considered, which adopts the alignment form of Cartesian each patent product. That is, each word from one component is aligned to Base on the concept that each sequence structure carries different each word from another component. If the similarity between amount of domain-specific information. This paper introduces a aligned words is greater than a threshold, these two words are new method to calculate the global importance of each component deemed to be matched. of sequence structures based on TV_LinkA algorithm [16]. First, Example 1. Consider two entities “car gasoline” and the network 𝒢(𝒱, ℰ) is constructed, where 𝒱 is the set of nodes “gasoline car”. Following Wang et al. [13], one can generate the which consist of abstracts, sentences and components (entities and correspondences of words as shown in Figure 3 (a) and the relations), and ℰ is the set of edges. Each abstract links to the similarity between these two entities is 1.0000. It is counter- sentences which are original from it, and each sentence links to intuitive for the two entities to have a high degree of similarity, the components which are extracted from it. Second, the values of since the former is a kind of fuels while the latter is one type of sentence and component nodes are preset to 1. Third, set the cars. In our opinion, main reason for counter-intuitive similarity is 47 EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents appropriate number of iterations. For each iteration, the value of performance of our method. If a method can better identify these each component node is updated to the sum of the values of the 84 pairs of patents, its performance should be better. sentence nodes connected to it and the updated values are Before comparing the sequence structures, we should judge standardized by the L2 norm. So does the value of each sentence the semantic direction in accordance to the type of semantic node. Repeat the above steps to continuously update the value of relations between the components E(1) and E(2) in a sequence the node until it is stable. At last, given that a terminology structure so that they can correctly match to the E(1) and E(2) of occurring a few times in domain-relevant sentences is more likely another sequence structures. As shown in Table 1, we have to be domain specific than another occurring many times in some defined 4 types of semantic directions. If the sequence structures general sentences, inverse document frequency (IDF) is multiplied are both single-direction, we can match the components E(1) and the resulting value of each node. E(2) between two sequence structures and apply Eq. (3) to After that, we can obtain the global importance of each calculate the similarity, Eq. (4) otherwise. component in all patent documents. Thus, the importance of each sequence structure is the average of the importance of the Table 1: The semantic directions of each relation type. corresponding components. To let the weights lie from 0 to 1, for all the sequence structures in a same patent, the weights are Relation Type Semantic Direction 1 spatial relation Undirected normalized so that their summary is guaranteed to be equal to 1. 2 part-of (1) E ← E (2) 3 causative relation (1) E ← E (2) 3.4 Patent similarity assessment 4 operation (1) E ← E (2) From the similarity matrix to the patent similarity, in order to 5 made-of (1) E ← E (2) make full use of all the information, patent similarity 6 instance-of (1) E → E (2) 7 attribution (1) E ← E (2) measurement problem can be transformed into the well-known 8 generate (1) E ← E (2) optimal transportation problem [10, 14]. Just as Figure 5, the 9 purpose (1) E ← E (2) patent distance matrix, which can get from 1 minus patent in-manner-of (1) (2) 10 E ← E similarity matrix, and the weight vectors are fed to an optimal 11 alias Bidirectional transportation problem solver to obtain the shortest distance 12 formation (1) E → E (2) between two patents. The similarity of two interested patents is 13 comparison Undirected equal to 1 minus the shortest distance. 14 measurement (1) E ← E (2) 15 others Undirected 4.2 Experiment Setup In this paper, we use WordNet as the source of word relations to calculate semantic similarity of words, but unfortunately, some words in the dataset are not included in WordNet. To solve this problem, we apply the “gestalt pattern matching” algorithm [11] as a supplement, which computes the similarity of two strings as the number of matching characters divided by the total number of characters in the two strings. Figure 5: The procedure of calculating the similarity between In our methodology, there are two parameters needed to be two patents. preset by user. The first one is the number of iterations when calculating the weight of each sequence structure, and the second 4 Case Study one is the gap penalty in the Needleman-Wunsch algorithm. As for the number of iterations, one can determine whether it 4.1 Dataset is stable by observing the trend of the weights after several To evaluate the performance of our methodology, an annotated iterations. Through the experiment, we find that the weights of corpus1 by [3] is used in this work. This dataset comes from thin components gradually stabilize after 4 iterations. Thus, the film head subfield in the field of hard disk drive. It contains 1,010 number of iterations is fixed to 10 in this article. patent documents. Note that, in this dataset, there are 84 pairs of As for the gap penalty, to assess its impact on patent patents coming from the same patent family. That is, each pair of similarity, we choose multiple values for comparison, such as - patents both has the same abstract and the identical collection of 0.05, -0.1, -0.15, -0.2 and -0.3. But we find that no matter which sequence structures so that they should have higher similarity than value to choose, the word alignment, patent similarity matrix and others. These patents can be used to assess the effectiveness and patent similarity will not be affected. Hence, the gap penalty is set to -0.05 in this paper. 1 https://github.com/awesome-patent-mining/TFH_Annotated_Dataset 48 EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents 4.3 Experimental results and discussions the field of hard disk drive was used. Extensive experimental results demonstrate that our patent similarity measurement is To verify the effectiveness and performance of our approach, the significantly more accurate. Meanwhile, a significant advantage result will be used to compare with the result of Wang et al. [13]. of the improved semantic analysis is that the results are not Figure 6 shows the results of these two approaches. Each sensitive to several core parameters. But this method with patent is compared with other patents, Top 1 (@1), Top 2 (@2), different weights does not perform as well as the method with the Top 3 (@3), Top 4 (@4) and Top 5 (@5) highest similarity is equal importance. In our opinion, the main reason is that this chosen to form 5 collections and then to judge how many of 84 weighting process actually considers the importance of each pairs of patents are covered. If we select Top 1 highest similarity sequence structure in the global context, not the importance in the of each patent, our method can obtain 54 pairs of patents that local context (i.e., each patent). In the near future, a locally come from a patent family when the weights are determined by weighting method will be further investigated. the weighting algorithm (section 3.3), while 58 pairs can be outputted by our approach with the same weights. But the DWSAO analysis can even recognize none of them. If Top 2 ACKNOWLEDGMENTS collection is considered, our weighted and non-weighted versions This research received the financial support from the Social contain 70 pairs and 78 pairs respectively, while the DWSAO Science Foundation of Beijing Municipality under grant number analysis only identifies 2 pairs. When we enlarge to Top 5 highest 17GLB074, and Natural Science Foundation of Guangdong similarity of each patent, the weighted one can identify 81 pairs Province (Grant Number 2018A030313695). and the non-weighted one can fully recognize 84 pairs of patents while only 3 pairs are identified by the DWSAO analysis. REFERENCES [1] An, J., Kim, K., Mortara, L., & Lee, S. (2018). Deriving technology intelligence from patents: Preposition-based semantic analysis. Journal of Informetrics, 12(1), 217-236. doi:10.1016/j.joi.2018.01.001 [2] Bergmann, I., Butzke, D., Walter, L., Fuerste, J. P., Moehrle, M. G., & Erdmann, V. A. (2008). Evaluating the risk of patent infringement by means of semantic patent analysis: the case of DNA chips. R&D Management, 38(5), 550-562. [3] Chen, L., Xu, S., Zhu, L., Zhang, J., Lei,X., & Yang, G. (2020). A deep learning based method for extracting semantic information from patent documents. Scientometrics. [4] Choi, S., Park, H., Kang, D., Lee, J. Y., & Kim, K. (2012). An SAO-based text mining approach to building a technology tree for technology planning. Expert Systems with Applications, 39(13), 11443-11455. doi:10.1016/j.eswa.2012.04.014 [5] Zha, X., & Chen, M. (2010). Study on early warning of competitive technical intelligence based on the patent map. Journal of Computers, 5(2). Figure 6: The performance of our approach and DWSAO doi:10.4304/jcp.5.2.274-281 method. [6] Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning, pp. 296-304. [7] Jiang, J. J., & Conrath, D. W. (1997). Semantic Similarity Based on Corpus It is no doubt that our patent similarity measurement is Statistics and Lexical Taxonomy. arXiv: Computation and Language. [8] Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the significantly more accurate than the DWSAO approach. At the search for similarities in the amino acid sequence of two proteins. Journal of meanwhile, a significant advantage of the improved semantic Molecular Biology, 48(3), 443-453. analysis is that the results are not sensitive to several core [9] Park, H., Yoon, J., & Kim, K. (2012). Identifying patent infringement using SAO based semantic technological similarities. Scientometrics, 90(2), 515-529. parameters. But this method with different weights does not [10] Rachev, S.T. (1998). In L. Ruschendorf (Ed.), Mass transportation problems: perform as well as the method with the equal importance. In our Volume I: Theory (probability and its applications). New York, NY: Springer. [11] Ratcliff, J. W., & Metzener, D. E. (1988). Pattern-matching-the gestalt opinion, the main reason is that the weighting algorithm actually approach. Dr Dobbs Journal, 13(7), 46. considers the importance of each sequence structure in the global [12] Resnik, P. (1995). Using information content to evaluate semantic similarity in context, not the importance in the local context (i.e., each patent). a taxonomy. International Joint Conference on Artificial Intelligence, 448-453. [13] Wang, X., Ren, H., Chen, Y., Liu, Y., Qiao, Y., & Huang, Y. (2019). In the near future, a locally weighting method will be further Measuring patent similarity with SAO semantic analysis. Scientometrics, investigated. 121(1), 1-23. doi:10.1007/s11192-019-03191-z [14] Xu, S., Zhai, D., Wang, F., An, X., Pang, H., & Sun, Y. (2019). A novel method for topic linkages between scientific publications and patents. Journal of the 5 Conclusion Association for Information Science and Technology, 70(9), 1026-1042. doi:10.1002/asi.24175 This study proposes an improved semantic analysis for [15] Xu, S., Zhu, L., Qiao, X., & Xue, C. (2009). A novel approach for measuring chinese terms semantic similarity based on pairwise sequence alignment. In assessing patent similarity on the basis of entities and semantic Proceedings of the 5th International Conference on Semantics, Knowledge and relations (functional and non-functional relations), which takes Grid (pp. 92-98). IEEE. semantic direction of each sequence structure and the word order [16] Yang, Y., Lu, Q., & Zhao, T. (2010). A delimiter-based general approach for Chinese term extraction. Journal of the American Society for Information information of each component into consideration. Meanwhile, Science and Technology, 61(1), 111-125. doi:10.1002/asi.21221 we introduce a new method to calculate the global importance of [17] Yoon, J., & Kim, K. (2011). Identifying rapidly evolving technological trends for R&D planning using SAO-based semantic patent networks. Scientometrics, each sequence structure. To verify the effectiveness and 88(1), 213-228. doi:10.1007/s11192-011-0383-0 performance of the improved semantic analysis, a case study on [18] Savransky, S.D. (2000) Engineering of creativity: Introduction to TRIZ patent similarity measurement related to thin film head subfield in methodology of inventive problem solving. Boca Raton, FL: CRC Press. 49