A Patent Semantic Representation Using Technical
                                Compound Sentences
                                Shuxuan Xiang1 , Jin Mao2,3,* and Gang Li2,3
                                1
                                  Laboratory of Data Intelligence and Interdisciplinary Innovation, Nanjing University, Nanjing 210000, China
                                2
                                  School of Information Management, Wuhan University, Wuhan 430072, China
                                3
                                  Center for Studies of Information Resources, Wuhan University, Wuhan 430072, China


                                                                       Abstract
                                                                       The claims of a patent define the scope of exclusive rights to an invention, containing all essential technical features reflecting
                                                                       the novelty and non-obviousness. Current patent text mining methods have not fully leveraged patent claims by considering
                                                                       the expression of technical features in patent claims. In this study, we clarify the textual structure of patent claims and model
                                                                       the claims in a patent as a tree by capturing the denpendency relationships among the patent claims. We derive patent
                                                                       technology compound sentences (TCS), then propose a novel patent semantic representation based on TCS. To evaluate the
                                                                       proposed patent representation, we apply relational and direct strategies of empirical evaluation to a dataset of USPTO. The
                                                                       results show that our TCS-based and quantity-quality-weighted representation for patents outperforms other methods on
                                                                       task of P2P similarity and automated IPC symbol classification, which suggest that TCS enables more efficient use of technical
                                                                       information of the patent claim. The potential application of the novel representation in novelty analysis is discussed as well.
                                                                       The foundamental patent representation method using TCS could unleash the value of patent claims as technical information
                                                                       resource, and have many potentials in improving many subsequent tasks of patent mining.

                                                                       Keywords
                                                                       Claim tree, patent semantic representation, technical compound sentence


                                1. Introduction                                                                                       proved methods to deal with patent claims. In this study,
                                                                                                                                      we propose a method of patent technology compound
                                Patent documents are valuable resources for technology sentences (TCS) to structure patent claims, then apply
                                text mining. As a combination of legal and technical it to design a novel patent semantic representation. We
                                terms, patent text differs significantly from other types evaluate the proposed patent semantic representation on
                                of documents as scientific articles [1, 2]. The character- a patent dataset. The fundamental patent representation
                                istics of patent text should be considered and utilized method based on TCS could unleash the resource value
                                in patent text mining. To this end, many recent tech- of patent claims, and have many potentials in improving
                                niques of patent mining have increasingly employed a many subsequent tasks of patent mining.
                                few methods like information fusion and text reorganiza-
                                tion [3, 4]. As an important element in patent document,
                                patent claim outlines the scope of an invention’s exclu- 2. Related work
                                sive rights and include all essential technical elements
                                that demonstrate its novelty and non-obviousness. Patent For patent semantic representations, terms and phrases
                                claim has been exploited by many applications of patent [14, 15] or original text [16, 17, 18, 19, 20] are used as
                                mining, including patent infringement detection [5, 6, 7], the input. Keywords extraction and subject-action-object
                                patent evaluation [8, 9, 10], patent classification and clus- (SAO) analysis are leveraged to describe the technologies
                                tering [11, 12, 13], patent information representation, etc. embedded in the patent texts. These methods, however,
                                Therefore, it is an essential task to design text process- could be unable to capture the relationships within the
                                ing methods of patent claims by fully leveraging their technical concepts and might overlook some of the tech-
                                features. However, current studies have not yet clarified nical specifics. The original text may be a superior option
                                the textual structure of patent claims, nor designed im- in terms of information integrity with the advancement
                                                                                                                                      of NLP techniques. Title and abstract of patent are de-
                                PatentSemTech’23: 4th Workshop on Patent Text Mining and
                                                                                                                                      sirable sources of technical information, yet the claim of
                                Semantic Technologies, July 27, 2023, Taipei, Taiwan.                                                 patent alone is able to achieve state-of-the-art results [12].
                                *
                                 Corresponding author.                                                                                Recently, a growing body of research has concentrated
                                $ xsx@smail.nju.edu.cn (S. Xiang); danveno@163.com (J. Mao);                                          on applying patent claim in patent semantic represen-
                                imiswhu@aliyun.com (G. Li)
                                                                                                                                      tation for its delicate writing [3, 14, 18, 19, 20, 21]. Yet
                                 0000-0002-3259-7169 (S. Xiang); 0000-0001-9572-6709 (J. Mao);
                                0000-0002-8336-4891 (G. Li)                                                                           the virtue of patent claims’ characteristics on NLP tasks
                                         © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License are not always valued, and the particularities of patent
                                         Attribution 4.0 International (CC BY 4.0).
                                    CEUR

                                         CEUR Workshop Proceedings (CEUR-WS.org)
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073                                                                      claim are not dealt with properly. There have been some


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                               44
further studies which optimize the input by attending          technicalities. Furthermore, TCS enables the disambigua-
to characteristics that distinguish patent text from other     tion of claims following the serial dependency and claims
text types, such as information enhancement with patent        following the parallel dependency. The claims following
citation [3], or input transformation according to claim       the serial dependency add into the length of TCS, i.e., the
structure [20]. These methods leverage idiosyncrasies          technicalities volume of a full description. The claims
of claim text to some extent. To our knowledge, little         following the parallel dependency add into the count
research on patent semantic representation utilizes the        of TCS, i.e., generalize and thus expand the scope of a
specific structure and internal logic of technical informa-    patent. As shown in fig.1, the example patent claim can
tion within patent claim. Therefore, we contribute to the      be break down into 12 TCSs, and each of them consists
research on patent semantic representation by provid-          of 5 claims.
ing an embedding method that can capture the nuance
internal logic of patent claims.


3. A representation using technical
   compound sentence
3.1. The tree structure of patent claims
The claims of patent can be classified into independent
claims and dependent claims. Independent claims de-
scribe different embodiments or aspects, uses, or meth-
ods of producing the invention. Dependent claims refer
back to and further limit another claim or the claims in                     Figure 1: Claim Tree and TCS
the same application, to further limit the scope and com-
plete the description with more details. The technologi-
cal embodiments of dependent claims are embedded in
the independent claims. With such structure, the patent        3.3. Patent Representation Learning using
claims can be model as a tree. Typically, each patent               TCS
claim is provided as a separate numbered sentence, and
the referenced claim is easily identified in the sentence.     We develop a method for semantic representations of
Theoretically, it is easy to identify the dependencies of      patent based on technical compound sentence (TCS). The
patent claims and construct the tree structure of claims       embedding vector of a patent is the weighted average
[22, 23, 24]. We refer such tree structure of claims as        of the embedding vector of its TCSs, where the weights
claim tree. In a claim tree, a claim follows serial depen-     are based on the quantity Q(s) and quality F(s) of the
dency refers to the previous claim, and a claim follows        information the TCS contains. The representation is
parallel dependency refers to claim or claims before the       obtained through
previous one. Serial dependency between claims adds
                                                                       ⃗ = ∑︀ 1
                                                                                          ∑︁
into the depth, and parallel dependency adds into the                  𝑃                      ⃗𝑠 × 𝑄(𝑠) × 𝐹 (𝑠)    (1)
                                                                               𝑠∈𝑆 𝑊  (𝑠)
breadth, resulting in varying structures.                                                 𝑠∈𝑆

                                                               A patent claim can be represented as a graph where nodes
3.2. Construction of Technical Compound                        are terms of the claim. The graph-of-words of patent
                                                               claim C is defined asG = (V,E) where V is the set of nodes
     Sentence                                                  that represents the nouns and verbs of C and E is the
The logical connections between technicalities embodied        set of edges which represents the co-occurrence of the
in the claims are reflected by the dependencies of claims.     words in a 1-size window. Information quantity Q(s) of a
Therefore, a path from the root to the leaf nodes in claim     TCS is determined by its cover of level H(s) and cover of
tree denotes a chain of claims that together provide a full    breadth R(s) of the claims it includes. Cover of level H(s)
statement of an aspect, use, or method of fabricating the      is the maximum depth of a claim that form the TCS in the
invention. A technical compound sentence (TCS) is con-         claim tree, which is positively related with more technical
structed by combining the claims of the path in sequence.      details. And cover of breadth R(s) is measured by radius
It is capable of grasping the progressive and explanatory      of subgraph of the TCS 𝐺𝑠 , which can describe the scope
relationships of claims, as well as the superior and sub-      of technical information the TCS contains. Information
ordinate relationships between technical concepts and          quantity Q(s) is calculated with
                                                                                  𝑄(𝑠) = 𝐻(𝑠) × 𝑅𝑠                     (2)


                                                          45
As for the information quality F(s) of a TCS, the k-core similarities is investigated. The latter method analyzes
approach is employed [25], which focus on cohesiveness   the representation’s performance in the prediction of the
and connections of nodes (terms). The 𝑐𝑖 -core of G is a associated IPC classes [7]. Firstly, we demonstrate the
subgraph 𝐺𝑐𝑖 , in which the degree of nodes is greater   benefit of TCS and the weighting strategy, by comparing
than or equal to 𝑐𝑖 . In the 𝐺𝑐𝑖 , for the edge D(𝑣𝑚 ,𝑣𝑛 )
                                                         with: (i.) full text of claim; (ii.) the first claim; (iii.) TCS +
linking the term 𝑣𝑚 and 𝑣𝑛 of G, its weight equals to    unweighted average; (iv.) TCS + quantity weighted av-
the number of co-occurrences of two terms, and its core  erage; (v.) TCS + quality weighted average. One should
degree is 𝑐𝑖 . Weight of the edge linking two terms and  notice the above methods share the Bert+SimCSE-unsup
the core where those two terms appear are combined to    model for embeddings. For good measure, other baseline
calculate the information quality F(s) as                models include: (vi.) PatentSBERTa [20]; (vii.) Technolog-
                                                         ical Signature [18]; (viii.) Doc2vec [28]; (ix.) tfidf-Mittens
                  𝑘
                 ∑︁      ∑︁                              [29]; (x.) Mittens+WR [30]. Each IPC of a patent can
        𝐹 (𝑠) =                   𝐷 (𝑣𝑚 , 𝑣𝑛 ) × 𝑐𝑖  (3) be represented by a tree for it comprises a hierarchi-
                 𝑖=1 (vm ,vn )∈s
                    (vm ,vn )∈𝐺𝑐𝑖                        cally organized taxonomy, and the IPC tree of a patent is
                                                         structured by additionally inserting a root node to unify
The TCSs are then embedded using a custom the trees of all assigned IPC codes. The dissimilarity
Bert+SimCSE-unsup model, and the claim repre- space embedding (DSE) is adapted for IPC representa-
sentation is obtained by taking weighted average of the tion [26, 31], which transform the IPC tree into a vector
TCSs embeddings. The whole process is illustrated in space. Given a distance function d, the dissimilarity space
Figure 2.                                                embedding of IPC is defined as

                                                                 𝜙𝑛 (𝑐) : 𝐺 → ℜ𝑛 𝜙𝑛 (𝑐) = (𝑑(𝑐1 , 𝑐), 𝑑(𝑐2 , 𝑐), . . . , 𝑑(𝑐𝑛 , 𝑐))
                                                                                                                            (4)
                                                                 Tree edit distance (TED) is employed as distance function.
                                                                 It is given by the minimal cost sequence of all operations
                                                                 including insertion, deletion, and relabeling transform-
                                                                 ing one tree to another. Then we calculate similarity by
                                                                 dot product of two representation vector. Besides, the
                                                                 absolute value of difference between 1 and the ratio of
                                                                 two similarities (i.e., the similarity derived from the rep-
                                                                 resentation and IPC assignment), which takes the form
                                                                 of                   ⃒                         ⃒
                                                                                             𝑝⃗1 · 𝑝⃗2
                                                                                                                            (5)
                                                                                      ⃒                         ⃒
   Figure 2: Patent Semantic Representation Using TCS                            𝜇 = ⃒⃒                     − 1⃒⃒
                                                                                        𝜙𝑛 (𝑐1 ) · 𝜙𝑛 (𝑐2 )
                                                                 is adopted in the variance analysis. Using TCS as the
                                                                 input format considerably improves the overall perfor-
                                                                 mance, as illustrated by Table 1. Additionally, the perfor-
4. Experiments                                                   mance of the model is further enhanced by the weighting
                                                                 of quantity and quality developed on TCS. As a result,
4.1. Datasets                                                    TCS with weight increases the model’s efficiency for task
With the help of the Patent Public Search tool provided by       of p2p similarity, and the use of TCS alone is able to boost
the United States Patent and Trademark Office (USPTO),           the performance of patent representation in an observ-
we gather claims, descriptions, and IPC assignments of           able way. We apply Z-test on 𝜇 to compare the average
2114 patents that were submitted between 2016 and 2017           score of two patent semantic representations, and thus to
and contained the terms "quantum computing", "quantum            testify the outperformance of embedding using TCS and
computer" and "quantum computation" in their abstracts.          the weight strategy based on TCS. As Table 2-4 depicts,
                                                                 the p-values are all less than 0.001, indicating that the
                                                                 null hypotheses are rejected and the differences across
4.2. Evaluation
                                                                 the models are not chance variations. We could come to
We apply "relational" and "direct" methodologies to eval-        the conclusion that TCS facilitates more effective use of
uate the TCS-based and quantity-quality-weighted rep-            technical information in the patent claim and could be
resentation for patents [18]. The former method assesses         effective in organizing technical information of patents.
the similarity of two items from the semantic representa-        In addition, based on TCS, the weight of quantity and
tion and regularly used observable metrics such as IPC           quality can result in superior patent semantic represen-
assignments [26, 27]. The correlation between the two            tation, allowing the representation to maintain a balance


                                                            46
Table 1
Performance of Patent Semantic Representations (i.)
                                               Method                       Relevance(%)    p-value
                                             First claim                        24.55        0.0043
                                              Full claim                        22.43        0.0032
                                     TCS + unweighted average                   26.37        0.0033
                                      TCS + quantity weighted                   27.67        0.0032
                                       TCS + quality weighted                   26.41        0.0031
                                 TCS + quantity and quality weighted            27.72        0.0031
                                           PatentSBERTa                         13.63        0.0035
                                       Technological Signature                  17.90        0.0021
                                               Doc2vec                          21.72        0.0036
                                            tfidf-Mittens                       19.25        0.0035
                                            Mittens+WR                          22.16        0.0032


Table 2                                                               Table 5
Result of Z-Test (i.)                                                 Performance of Patent Semantic Representations (ii.)
                         TCS + unweighted average   Full-claim                          Method                Loss    Acc(%)   Pre(%)
          Avg.                    0.5797                0.6678          TCS + quantity and quality weighted   0.489   74.65    66.67
          Std.                    1.1598                1.5396                    PatentSBERTa                0.611   65.45    64.60
         Z value                        -18.4620                              Technological Signature         0.605   69.18    62.50
   P value (one-sided)                   0.0000                                       Doc2vec                 0.598   69.44    53.80
                                                                                   tfidf-Mittens              0.597   72.05    66.29
                                                                                   Mittens+WR                 0.638   64.93    52.20
Table 3
Result of Z-Test (ii.)
                         TCS + unweighted average   First claim       4.3. Application
          Avg.                    0.5797                  We apply technical compound sentence (TCS) on novelty
                                                        0.5929
          Std.                    1.1598                1.3834
                                                          analysis. Innovation consists in carrying out new combi-
         Z value                       -2.9395
   P value (one-sided)                  0.0016            nations. Actually, innovation is fundamentally the com-
                                                          bination of facts, concepts, techniques, theories, goals,
                                                          etc. [32]. Thus, for novelty analysis, the combinations
Table 4                                                   held by the patent are vital and the combinations should
Result of Z-Test (iii.)                                   be considered when conducting patent semantic search
                        TCS + weighted TCS + unweighted   in novelty analysis. Patent claims define the boundary
                                                          for an exclusive right granted by the patent office, and we
          Avg.              0.5409             0.5797
          Std.              0.9152             1.1598     may express the same thing by saying that each patent
         Z value                     -10.6267             occupies a certain inventive space of the protecting parts
   P value (one-sided)                0.0000              of technologies that exclude other inventions. A TCS
                                                          derived from a patent claim tree, naturally, describes a
                                                          relatively separate segment of the entire space the claim
between highlighting the key details and elaborating the defines, which means it contains the implicit combina-
full scope.                                               tions of an aspect or method the patent right intends to
   Furthermore, we examine whether the generated vec- protect. Therefore, the relevant patents can be located
tors can function as inputs for automated IPC symbol and identified by matching similar TCS. By applying TCS
classification for the main section (In this case, binary embedding as the query, we are able to retrieve more of
classification of section G and section H). An artificial relevant items which might be novelty-prejudicial to the
neural network (ANN) is deployed [18], which takes the target patent for novelty assessment. Thus, TCS could
representations as input and predicts the main section of improve the recall of patent retrieval in patent semantic
the patent. Table 5 demonstrates that our method outper- search in novelty analysis.
forms the baseline methods on this task, which indicates
the capability of the presented method in semantic rep-
resentation and proves the TCS as well as the weighting
strategy effective.


                                                                 47
5. Conclusion                                                      implications of patent scope, Research Policy 44
                                                                   (2015) 493–507.
A technical compound sentence (TCS) is composed of            [11] D. H. Milanez, L. I. L. de Faria, R. M. do Amaral,
a set of claims that on the path from the root to the              J. A. R. Gregolin, Claim-based patent indicators:
leaf nodes in a claim tree. The experiment’s findings              A novel approach to analyze patent content and
demonstrate that the employment of TCS enhances the                monitor technological advances, World Patent In-
performance of patent semantic representation. This in-            formation 50 (2017) 64–72.
dicates the capability of the TCS in technical information    [12] J. Lee, J. Hsiang, Patent classification by fine-tuning
organization of patents. Additionally, the balance of em-          bert language model, World Patent Information 61
phasizing the key information and elaborating the full             (2020).
scope is achieved by the weight of quantity and quality       [13] S. Huang, H. Ke, W. Yang, Structure clustering for
built on TCS, which further improves the semantic rep-             chinese patent documents, Expert Systems with
resentation. For future work, we will further explore the          Applications 34 (2008) 2290–2297.
uses of TCS in the field of patent text mining, attempt-      [14] Z. Qiu, Z. Wang, Construction and application
ing to achieve efficient processing, interpretation, and           of patent technical element dependency network,
utilization of patent texts.                                       IEEE Transactions on Engineering Management
                                                                   (2022) 1–15. doi:10.1109/TEM.2022.3227175.
                                                              [15] S. Yun, W. Cho, C. Kim, S. Lee, Technological trend
References                                                         mining: identifying new technology opportunities
 [1] S. Casola, A. Lavelli, Summarization, simplification,         using patent semantic analysis, Information Pro-
     and generation: The case of patents, Expert Systems           cessing and Management 59 (2022).
     with Applications 205 (2022).                            [16] Z. Qiu, Z. Wang, What is your next invention?—a
 [2] J. Wang, W. Lu, L. HanTong, A two-level parser                framework of mining technological development
     for patent claim parsing, Advanced Engineering                rules and assisting in designing new technologies
     Informatics 29 (2015) 431–439.                                based on bert as well as patent citations, Computers
 [3] J. Qi, L. Lei, K. Zheng, X. Wang, Patent analytic             in Industry 145 (2023).
     citation-based vsm: Challenges and applications,         [17] deGrazia Charles AW, J. P. Frumkin, N. A. Pairolero,
     IEEE Access 8 (2020) 17464–17476. doi:10.1109/                Embracing invention similarity for the measure-
     ACCESS.2020.2967817.                                          ment of vertically overlapping claims, Economics
 [4] Y. Chi, H. Wang, Establish a patent risk prediction           of Innovation and New Technology 29 (2020) 113–
     model for emerging technologies using deep learn-             146.
     ing and data augmentation, Advanced Engineering          [18] D. S. Hain, R. Jurowetzki, T. Buchmann, P. Wolf,
     Informatics 52 (2022).                                        A text-embedding-based approach to measuring
 [5] C. Lee, B. Song, Y. Park, How to assess patent in-            patent-to-patent technological similarity, Techno-
     fringement risks: a semantic patent claim analysis            logical Forecasting and Social Change 177 (2022).
     using dependency relationships, Technology anal-         [19] L. Lei, J. Qi, K. Zheng, Patent analytics based on
     ysis and strategic management 25 (2013) 23–28.                feature vector space model: A case of iot, 2019.
 [6] H. Jang, S. Kim, B. Yoon, An explainable ai (xai)             arXiv:1904.08100.
     model for text-based patent novelty analysis, Avail-     [20] H. Bekamiri, D. S. Hain, R. Jurowetzki, Patents-
     able at SSRN (2023). doi:http://dx.doi.org/10.                berta: A deep nlp based hybrid model for patent
     2139/ssrn.4341594.                                            distance and classification using augmented sbert,
 [7] L. Du, W. Liu, K. Xiao, S. Gao, Y. Han, Techni-               2021. arXiv:2103.11933.
     cal function-effect based patent multi-to-one nega-      [21] T. Roh, Y. Jeong, B. Yoon, Developing a methodol-
     tion game model, in: 2022 IEEE 25th International             ogy of structuring and layering technological infor-
     Conference on Computer Supported Cooperative                  mation in patent documents through natural lan-
     Work in Design (CSCWD), 2022, pp. 1443–1448.                  guage processing, Sustainability 11 (2017).
     doi:10.1109/CSCWD54268.2022.9776122.                     [22] Y. T. Demey, D. Golzio, Search strategies at the
 [8] S. Wittfoth, Measuring technological patent scope             european patent office, World Patent Information
     by semantic analysis of patent claims–an indicator            63 (2020).
     for valuating patents, World Patent Information 58       [23] L. FuRen, C. K. Chen, L. SzuYin, A hybrid patent
     (2019).                                                       prior art retrieval approach using claim structure
 [9] A. J. H, Modeling patent clarity, Research Policy 51          and description, 2014, pp. 231–248.
     (2022).                                                  [24] J. Rossi, M. Wirth, E. Kanoulas, Query generation
[10] E. Novelli, An examination of the antecedents and             for patent retrieval with keyword extraction based
                                                                   on syntactic features, 2019. arXiv:1906.07591.


                                                         48
[25] H. Mirisaee, E. Gaussier, C. Lagnier, A. Guerraz,
     Terminology-based text embedding for computing
     document similarities on technical content, 2019.
     arXiv:1906.01874.
[26] K. Riesen, H. Bunke, Graph classification based on
     vector space embedding, International Journal of
     Pattern Recognition and Artificial Intelligence 23
     (2009) 1053–1081.
[27] Y. M. G. Costa, D. Bertolini, A. S. B. Jr., G. D. C. Cav-
     alcanti, L. E. S. Oliveira, The dissimilarity approach:
     a review, Artificial Intelligence Review 53 (2020)
     2783–2808.
[28] J. H. Lau, T. Baldwin, An empirical evaluation of
     doc2vec with practical insights into document em-
     bedding generation, 2016. arXiv:1607.05368.
[29] N. Dingwall, C. Potts, Mittens: An extension of
     glove for learning domain-specialized representa-
     tions, 2018. arXiv:1803.09901.
[30] E. Kawin, Unsupervised random walk sentence
     embeddings: A strong but simple baseline, in: Pro-
     ceedings of the Third Workshop on Representation
     Learning for NLP, 2018, pp. 91–100. URL: https:
     //aclanthology.org/W18-3012. doi:10.18653/v1/
     W18-3012.
[31] K. Frerich, M. Bukowski, S. Geisler, R. Farkas, On
     the potential of taxonomic graphs to improve ap-
     plicability and performance for the classification of
     biomedical patents, Applied Sciences 11 (2023).
[32] H. Michael, T. Kiessling, M. Moeller, A view of en-
     trepreneurship and innovation from the economist
     “for all seasons”: Joseph s. schumpeter, Journal of
     Management History 16 (2010) 527–531.


                                                             49