A Patent Semantic Representation Using Technical Compound Sentences Shuxuan Xiang1 , Jin Mao2,3,* and Gang Li2,3 1 Laboratory of Data Intelligence and Interdisciplinary Innovation, Nanjing University, Nanjing 210000, China 2 School of Information Management, Wuhan University, Wuhan 430072, China 3 Center for Studies of Information Resources, Wuhan University, Wuhan 430072, China Abstract The claims of a patent define the scope of exclusive rights to an invention, containing all essential technical features reflecting the novelty and non-obviousness. Current patent text mining methods have not fully leveraged patent claims by considering the expression of technical features in patent claims. In this study, we clarify the textual structure of patent claims and model the claims in a patent as a tree by capturing the denpendency relationships among the patent claims. We derive patent technology compound sentences (TCS), then propose a novel patent semantic representation based on TCS. To evaluate the proposed patent representation, we apply relational and direct strategies of empirical evaluation to a dataset of USPTO. The results show that our TCS-based and quantity-quality-weighted representation for patents outperforms other methods on task of P2P similarity and automated IPC symbol classification, which suggest that TCS enables more efficient use of technical information of the patent claim. The potential application of the novel representation in novelty analysis is discussed as well. The foundamental patent representation method using TCS could unleash the value of patent claims as technical information resource, and have many potentials in improving many subsequent tasks of patent mining. Keywords Claim tree, patent semantic representation, technical compound sentence 1. Introduction proved methods to deal with patent claims. In this study, we propose a method of patent technology compound Patent documents are valuable resources for technology sentences (TCS) to structure patent claims, then apply text mining. As a combination of legal and technical it to design a novel patent semantic representation. We terms, patent text differs significantly from other types evaluate the proposed patent semantic representation on of documents as scientific articles [1, 2]. The character- a patent dataset. The fundamental patent representation istics of patent text should be considered and utilized method based on TCS could unleash the resource value in patent text mining. To this end, many recent tech- of patent claims, and have many potentials in improving niques of patent mining have increasingly employed a many subsequent tasks of patent mining. few methods like information fusion and text reorganiza- tion [3, 4]. As an important element in patent document, patent claim outlines the scope of an invention’s exclu- 2. Related work sive rights and include all essential technical elements that demonstrate its novelty and non-obviousness. Patent For patent semantic representations, terms and phrases claim has been exploited by many applications of patent [14, 15] or original text [16, 17, 18, 19, 20] are used as mining, including patent infringement detection [5, 6, 7], the input. Keywords extraction and subject-action-object patent evaluation [8, 9, 10], patent classification and clus- (SAO) analysis are leveraged to describe the technologies tering [11, 12, 13], patent information representation, etc. embedded in the patent texts. These methods, however, Therefore, it is an essential task to design text process- could be unable to capture the relationships within the ing methods of patent claims by fully leveraging their technical concepts and might overlook some of the tech- features. However, current studies have not yet clarified nical specifics. The original text may be a superior option the textual structure of patent claims, nor designed im- in terms of information integrity with the advancement of NLP techniques. Title and abstract of patent are de- PatentSemTech’23: 4th Workshop on Patent Text Mining and sirable sources of technical information, yet the claim of Semantic Technologies, July 27, 2023, Taipei, Taiwan. patent alone is able to achieve state-of-the-art results [12]. * Corresponding author. Recently, a growing body of research has concentrated $ xsx@smail.nju.edu.cn (S. Xiang); danveno@163.com (J. Mao); on applying patent claim in patent semantic represen- imiswhu@aliyun.com (G. Li) tation for its delicate writing [3, 14, 18, 19, 20, 21]. Yet  0000-0002-3259-7169 (S. Xiang); 0000-0001-9572-6709 (J. Mao); 0000-0002-8336-4891 (G. Li) the virtue of patent claims’ characteristics on NLP tasks © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License are not always valued, and the particularities of patent Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 claim are not dealt with properly. There have been some CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 44 further studies which optimize the input by attending technicalities. Furthermore, TCS enables the disambigua- to characteristics that distinguish patent text from other tion of claims following the serial dependency and claims text types, such as information enhancement with patent following the parallel dependency. The claims following citation [3], or input transformation according to claim the serial dependency add into the length of TCS, i.e., the structure [20]. These methods leverage idiosyncrasies technicalities volume of a full description. The claims of claim text to some extent. To our knowledge, little following the parallel dependency add into the count research on patent semantic representation utilizes the of TCS, i.e., generalize and thus expand the scope of a specific structure and internal logic of technical informa- patent. As shown in fig.1, the example patent claim can tion within patent claim. Therefore, we contribute to the be break down into 12 TCSs, and each of them consists research on patent semantic representation by provid- of 5 claims. ing an embedding method that can capture the nuance internal logic of patent claims. 3. A representation using technical compound sentence 3.1. The tree structure of patent claims The claims of patent can be classified into independent claims and dependent claims. Independent claims de- scribe different embodiments or aspects, uses, or meth- ods of producing the invention. Dependent claims refer back to and further limit another claim or the claims in Figure 1: Claim Tree and TCS the same application, to further limit the scope and com- plete the description with more details. The technologi- cal embodiments of dependent claims are embedded in the independent claims. With such structure, the patent 3.3. Patent Representation Learning using claims can be model as a tree. Typically, each patent TCS claim is provided as a separate numbered sentence, and the referenced claim is easily identified in the sentence. We develop a method for semantic representations of Theoretically, it is easy to identify the dependencies of patent based on technical compound sentence (TCS). The patent claims and construct the tree structure of claims embedding vector of a patent is the weighted average [22, 23, 24]. We refer such tree structure of claims as of the embedding vector of its TCSs, where the weights claim tree. In a claim tree, a claim follows serial depen- are based on the quantity Q(s) and quality F(s) of the dency refers to the previous claim, and a claim follows information the TCS contains. The representation is parallel dependency refers to claim or claims before the obtained through previous one. Serial dependency between claims adds ⃗ = ∑︀ 1 ∑︁ into the depth, and parallel dependency adds into the 𝑃 ⃗𝑠 × 𝑄(𝑠) × 𝐹 (𝑠) (1) 𝑠∈𝑆 𝑊 (𝑠) breadth, resulting in varying structures. 𝑠∈𝑆 A patent claim can be represented as a graph where nodes 3.2. Construction of Technical Compound are terms of the claim. The graph-of-words of patent claim C is defined asG = (V,E) where V is the set of nodes Sentence that represents the nouns and verbs of C and E is the The logical connections between technicalities embodied set of edges which represents the co-occurrence of the in the claims are reflected by the dependencies of claims. words in a 1-size window. Information quantity Q(s) of a Therefore, a path from the root to the leaf nodes in claim TCS is determined by its cover of level H(s) and cover of tree denotes a chain of claims that together provide a full breadth R(s) of the claims it includes. Cover of level H(s) statement of an aspect, use, or method of fabricating the is the maximum depth of a claim that form the TCS in the invention. A technical compound sentence (TCS) is con- claim tree, which is positively related with more technical structed by combining the claims of the path in sequence. details. And cover of breadth R(s) is measured by radius It is capable of grasping the progressive and explanatory of subgraph of the TCS 𝐺𝑠 , which can describe the scope relationships of claims, as well as the superior and sub- of technical information the TCS contains. Information ordinate relationships between technical concepts and quantity Q(s) is calculated with 𝑄(𝑠) = 𝐻(𝑠) × 𝑅𝑠 (2) 45 As for the information quality F(s) of a TCS, the k-core similarities is investigated. The latter method analyzes approach is employed [25], which focus on cohesiveness the representation’s performance in the prediction of the and connections of nodes (terms). The 𝑐𝑖 -core of G is a associated IPC classes [7]. Firstly, we demonstrate the subgraph 𝐺𝑐𝑖 , in which the degree of nodes is greater benefit of TCS and the weighting strategy, by comparing than or equal to 𝑐𝑖 . In the 𝐺𝑐𝑖 , for the edge D(𝑣𝑚 ,𝑣𝑛 ) with: (i.) full text of claim; (ii.) the first claim; (iii.) TCS + linking the term 𝑣𝑚 and 𝑣𝑛 of G, its weight equals to unweighted average; (iv.) TCS + quantity weighted av- the number of co-occurrences of two terms, and its core erage; (v.) TCS + quality weighted average. One should degree is 𝑐𝑖 . Weight of the edge linking two terms and notice the above methods share the Bert+SimCSE-unsup the core where those two terms appear are combined to model for embeddings. For good measure, other baseline calculate the information quality F(s) as models include: (vi.) PatentSBERTa [20]; (vii.) Technolog- ical Signature [18]; (viii.) Doc2vec [28]; (ix.) tfidf-Mittens 𝑘 ∑︁ ∑︁ [29]; (x.) Mittens+WR [30]. Each IPC of a patent can 𝐹 (𝑠) = 𝐷 (𝑣𝑚 , 𝑣𝑛 ) × 𝑐𝑖 (3) be represented by a tree for it comprises a hierarchi- 𝑖=1 (vm ,vn )∈s (vm ,vn )∈𝐺𝑐𝑖 cally organized taxonomy, and the IPC tree of a patent is structured by additionally inserting a root node to unify The TCSs are then embedded using a custom the trees of all assigned IPC codes. The dissimilarity Bert+SimCSE-unsup model, and the claim repre- space embedding (DSE) is adapted for IPC representa- sentation is obtained by taking weighted average of the tion [26, 31], which transform the IPC tree into a vector TCSs embeddings. The whole process is illustrated in space. Given a distance function d, the dissimilarity space Figure 2. embedding of IPC is defined as 𝜙𝑛 (𝑐) : 𝐺 → ℜ𝑛 𝜙𝑛 (𝑐) = (𝑑(𝑐1 , 𝑐), 𝑑(𝑐2 , 𝑐), . . . , 𝑑(𝑐𝑛 , 𝑐)) (4) Tree edit distance (TED) is employed as distance function. It is given by the minimal cost sequence of all operations including insertion, deletion, and relabeling transform- ing one tree to another. Then we calculate similarity by dot product of two representation vector. Besides, the absolute value of difference between 1 and the ratio of two similarities (i.e., the similarity derived from the rep- resentation and IPC assignment), which takes the form of ⃒ ⃒ 𝑝⃗1 · 𝑝⃗2 (5) ⃒ ⃒ Figure 2: Patent Semantic Representation Using TCS 𝜇 = ⃒⃒ − 1⃒⃒ 𝜙𝑛 (𝑐1 ) · 𝜙𝑛 (𝑐2 ) is adopted in the variance analysis. Using TCS as the input format considerably improves the overall perfor- mance, as illustrated by Table 1. Additionally, the perfor- 4. Experiments mance of the model is further enhanced by the weighting of quantity and quality developed on TCS. As a result, 4.1. Datasets TCS with weight increases the model’s efficiency for task With the help of the Patent Public Search tool provided by of p2p similarity, and the use of TCS alone is able to boost the United States Patent and Trademark Office (USPTO), the performance of patent representation in an observ- we gather claims, descriptions, and IPC assignments of able way. We apply Z-test on 𝜇 to compare the average 2114 patents that were submitted between 2016 and 2017 score of two patent semantic representations, and thus to and contained the terms "quantum computing", "quantum testify the outperformance of embedding using TCS and computer" and "quantum computation" in their abstracts. the weight strategy based on TCS. As Table 2-4 depicts, the p-values are all less than 0.001, indicating that the null hypotheses are rejected and the differences across 4.2. Evaluation the models are not chance variations. We could come to We apply "relational" and "direct" methodologies to eval- the conclusion that TCS facilitates more effective use of uate the TCS-based and quantity-quality-weighted rep- technical information in the patent claim and could be resentation for patents [18]. The former method assesses effective in organizing technical information of patents. the similarity of two items from the semantic representa- In addition, based on TCS, the weight of quantity and tion and regularly used observable metrics such as IPC quality can result in superior patent semantic represen- assignments [26, 27]. The correlation between the two tation, allowing the representation to maintain a balance 46 Table 1 Performance of Patent Semantic Representations (i.) Method Relevance(%) p-value First claim 24.55 0.0043 Full claim 22.43 0.0032 TCS + unweighted average 26.37 0.0033 TCS + quantity weighted 27.67 0.0032 TCS + quality weighted 26.41 0.0031 TCS + quantity and quality weighted 27.72 0.0031 PatentSBERTa 13.63 0.0035 Technological Signature 17.90 0.0021 Doc2vec 21.72 0.0036 tfidf-Mittens 19.25 0.0035 Mittens+WR 22.16 0.0032 Table 2 Table 5 Result of Z-Test (i.) Performance of Patent Semantic Representations (ii.) TCS + unweighted average Full-claim Method Loss Acc(%) Pre(%) Avg. 0.5797 0.6678 TCS + quantity and quality weighted 0.489 74.65 66.67 Std. 1.1598 1.5396 PatentSBERTa 0.611 65.45 64.60 Z value -18.4620 Technological Signature 0.605 69.18 62.50 P value (one-sided) 0.0000 Doc2vec 0.598 69.44 53.80 tfidf-Mittens 0.597 72.05 66.29 Mittens+WR 0.638 64.93 52.20 Table 3 Result of Z-Test (ii.) TCS + unweighted average First claim 4.3. Application Avg. 0.5797 We apply technical compound sentence (TCS) on novelty 0.5929 Std. 1.1598 1.3834 analysis. Innovation consists in carrying out new combi- Z value -2.9395 P value (one-sided) 0.0016 nations. Actually, innovation is fundamentally the com- bination of facts, concepts, techniques, theories, goals, etc. [32]. Thus, for novelty analysis, the combinations Table 4 held by the patent are vital and the combinations should Result of Z-Test (iii.) be considered when conducting patent semantic search TCS + weighted TCS + unweighted in novelty analysis. Patent claims define the boundary for an exclusive right granted by the patent office, and we Avg. 0.5409 0.5797 Std. 0.9152 1.1598 may express the same thing by saying that each patent Z value -10.6267 occupies a certain inventive space of the protecting parts P value (one-sided) 0.0000 of technologies that exclude other inventions. A TCS derived from a patent claim tree, naturally, describes a relatively separate segment of the entire space the claim between highlighting the key details and elaborating the defines, which means it contains the implicit combina- full scope. tions of an aspect or method the patent right intends to Furthermore, we examine whether the generated vec- protect. Therefore, the relevant patents can be located tors can function as inputs for automated IPC symbol and identified by matching similar TCS. By applying TCS classification for the main section (In this case, binary embedding as the query, we are able to retrieve more of classification of section G and section H). An artificial relevant items which might be novelty-prejudicial to the neural network (ANN) is deployed [18], which takes the target patent for novelty assessment. Thus, TCS could representations as input and predicts the main section of improve the recall of patent retrieval in patent semantic the patent. Table 5 demonstrates that our method outper- search in novelty analysis. forms the baseline methods on this task, which indicates the capability of the presented method in semantic rep- resentation and proves the TCS as well as the weighting strategy effective. 47 5. Conclusion implications of patent scope, Research Policy 44 (2015) 493–507. A technical compound sentence (TCS) is composed of [11] D. H. Milanez, L. I. L. de Faria, R. M. do Amaral, a set of claims that on the path from the root to the J. A. R. Gregolin, Claim-based patent indicators: leaf nodes in a claim tree. The experiment’s findings A novel approach to analyze patent content and demonstrate that the employment of TCS enhances the monitor technological advances, World Patent In- performance of patent semantic representation. This in- formation 50 (2017) 64–72. dicates the capability of the TCS in technical information [12] J. Lee, J. Hsiang, Patent classification by fine-tuning organization of patents. Additionally, the balance of em- bert language model, World Patent Information 61 phasizing the key information and elaborating the full (2020). scope is achieved by the weight of quantity and quality [13] S. Huang, H. Ke, W. Yang, Structure clustering for built on TCS, which further improves the semantic rep- chinese patent documents, Expert Systems with resentation. For future work, we will further explore the Applications 34 (2008) 2290–2297. uses of TCS in the field of patent text mining, attempt- [14] Z. Qiu, Z. Wang, Construction and application ing to achieve efficient processing, interpretation, and of patent technical element dependency network, utilization of patent texts. IEEE Transactions on Engineering Management (2022) 1–15. doi:10.1109/TEM.2022.3227175. [15] S. Yun, W. Cho, C. Kim, S. Lee, Technological trend References mining: identifying new technology opportunities [1] S. Casola, A. Lavelli, Summarization, simplification, using patent semantic analysis, Information Pro- and generation: The case of patents, Expert Systems cessing and Management 59 (2022). with Applications 205 (2022). [16] Z. Qiu, Z. Wang, What is your next invention?—a [2] J. Wang, W. Lu, L. HanTong, A two-level parser framework of mining technological development for patent claim parsing, Advanced Engineering rules and assisting in designing new technologies Informatics 29 (2015) 431–439. based on bert as well as patent citations, Computers [3] J. Qi, L. Lei, K. Zheng, X. Wang, Patent analytic in Industry 145 (2023). citation-based vsm: Challenges and applications, [17] deGrazia Charles AW, J. P. Frumkin, N. A. Pairolero, IEEE Access 8 (2020) 17464–17476. doi:10.1109/ Embracing invention similarity for the measure- ACCESS.2020.2967817. ment of vertically overlapping claims, Economics [4] Y. Chi, H. Wang, Establish a patent risk prediction of Innovation and New Technology 29 (2020) 113– model for emerging technologies using deep learn- 146. ing and data augmentation, Advanced Engineering [18] D. S. Hain, R. Jurowetzki, T. Buchmann, P. Wolf, Informatics 52 (2022). A text-embedding-based approach to measuring [5] C. Lee, B. Song, Y. Park, How to assess patent in- patent-to-patent technological similarity, Techno- fringement risks: a semantic patent claim analysis logical Forecasting and Social Change 177 (2022). using dependency relationships, Technology anal- [19] L. Lei, J. Qi, K. Zheng, Patent analytics based on ysis and strategic management 25 (2013) 23–28. feature vector space model: A case of iot, 2019. [6] H. Jang, S. Kim, B. Yoon, An explainable ai (xai) arXiv:1904.08100. model for text-based patent novelty analysis, Avail- [20] H. Bekamiri, D. S. Hain, R. Jurowetzki, Patents- able at SSRN (2023). doi:http://dx.doi.org/10. berta: A deep nlp based hybrid model for patent 2139/ssrn.4341594. distance and classification using augmented sbert, [7] L. Du, W. Liu, K. Xiao, S. Gao, Y. Han, Techni- 2021. arXiv:2103.11933. cal function-effect based patent multi-to-one nega- [21] T. Roh, Y. Jeong, B. Yoon, Developing a methodol- tion game model, in: 2022 IEEE 25th International ogy of structuring and layering technological infor- Conference on Computer Supported Cooperative mation in patent documents through natural lan- Work in Design (CSCWD), 2022, pp. 1443–1448. guage processing, Sustainability 11 (2017). doi:10.1109/CSCWD54268.2022.9776122. [22] Y. T. Demey, D. Golzio, Search strategies at the [8] S. Wittfoth, Measuring technological patent scope european patent office, World Patent Information by semantic analysis of patent claims–an indicator 63 (2020). for valuating patents, World Patent Information 58 [23] L. FuRen, C. K. Chen, L. SzuYin, A hybrid patent (2019). prior art retrieval approach using claim structure [9] A. J. H, Modeling patent clarity, Research Policy 51 and description, 2014, pp. 231–248. (2022). [24] J. Rossi, M. Wirth, E. Kanoulas, Query generation [10] E. Novelli, An examination of the antecedents and for patent retrieval with keyword extraction based on syntactic features, 2019. arXiv:1906.07591. 48 [25] H. Mirisaee, E. Gaussier, C. Lagnier, A. Guerraz, Terminology-based text embedding for computing document similarities on technical content, 2019. arXiv:1906.01874. [26] K. Riesen, H. Bunke, Graph classification based on vector space embedding, International Journal of Pattern Recognition and Artificial Intelligence 23 (2009) 1053–1081. [27] Y. M. G. Costa, D. Bertolini, A. S. B. Jr., G. D. C. Cav- alcanti, L. E. S. Oliveira, The dissimilarity approach: a review, Artificial Intelligence Review 53 (2020) 2783–2808. [28] J. H. Lau, T. Baldwin, An empirical evaluation of doc2vec with practical insights into document em- bedding generation, 2016. arXiv:1607.05368. [29] N. Dingwall, C. Potts, Mittens: An extension of glove for learning domain-specialized representa- tions, 2018. arXiv:1803.09901. [30] E. Kawin, Unsupervised random walk sentence embeddings: A strong but simple baseline, in: Pro- ceedings of the Third Workshop on Representation Learning for NLP, 2018, pp. 91–100. URL: https: //aclanthology.org/W18-3012. doi:10.18653/v1/ W18-3012. [31] K. Frerich, M. Bukowski, S. Geisler, R. Farkas, On the potential of taxonomic graphs to improve ap- plicability and performance for the classification of biomedical patents, Applied Sciences 11 (2023). [32] H. Michael, T. Kiessling, M. Moeller, A view of en- trepreneurship and innovation from the economist “for all seasons”: Joseph s. schumpeter, Journal of Management History 16 (2010) 527–531. 49