EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents


   A Novel Approach for Patent Similarity Measurement Based on
                       Sequence Alignment
               Xin An1                                          Jinghong Li2                                           Shuo Xu3*
      School of Economics &                               School of Economics &                              Research Base of Beijing Modern
            Management                                          Management                                    Manufacturing Development,
     Beijing Forestry University                         Beijing Forestry University                            College of Economics and
         Beijing, P.R. China                                 Beijing, P.R. China                                      Management
         anxin@bjfu.edu.cn                                  724298617@qq.com                                 Beijing University of Technology
                                                                                                                    Beijing, P.R. China
                                                                  Sainan Pi5                                       xushuo@bjut.edu.cn
            Liang Chen4
     Institute of Scientific and                          School of Economics &
  Technical Information of China                                Management
        Beijing, P.R. China                              Beijing Forestry University
                                                                                                         *         Corresponding            author
        25565853@qq.com                                      Beijing, P.R. China
                                                          silencepipi@bjfu.edu.cn


ABSTRACT                                                                            Nowadays, Subject-Action-Object (SAO) semantic analysis
                                                                              [2, 4, 9, 17] is the most widely used method to measure patent
Patent similarity measurement, as one of fundamental building                 similarity, which stresses the key concepts and functional
blocks for patent analysis, not only can derive technical                     relations. By function, it means “the action changing a feature of
intelligence efficiently, but also can detect the risk of infringement        any object” [18]. That is to say, SAO structure explicitly describes
and evaluate whether the invention meets the criteria of novelty              a relation between components in the patent documents. However,
and innovation. However, traditional approaches make implicitly               on closer examination, one can see that traditional SAO semantic
several assumptions, such as bag of words in each component,                  analysis [2, 4, 9, 17] has several shortcomings. First, the semantic
semantic direction irrelevance and so on. In order to relax these             direction of each SAO structure and the word order in each
assumptions, this study proposes a novel approach on the basis of             component of a SAO structure are not taken into account. Second,
sequence alignment, which takes semantic direction of each                    intuitively, each SAO structure carries different amount of
sequence structure and the word order information of each                     domain-specific information. To say it in another way, the
component into consideration. Meanwhile, an algorithm for                     importance of each SAO structure should be different [13]. But
calculating the global importance of each sequence structure is put           the SAO semantic analysis usually assigns equal weight to each
forward. Finally, to verify the effectiveness and performance of              SAO structure. Last but not least, the SAO semantic analysis only
the improved semantic analysis, a case study is conducted on the              focuses on the functional relations, but ignores the valuable
thin film head subfield in the field of hard disk drive. Extensive            technology intelligence underlying in the non-functional relations
experimental results show that our approach is significantly more             which is based on the prepositions [1].
accurate and is not sensitive to several core parameters.                           In order to overcome these issues, this article proposes an
                                                                              improved semantic analysis approach for assessing patent
KEYWORDS                                                                      similarity on the basis of sequence alignment. Different from
Patent similarity measurement, Semantic analysis, Entities and                previous studies, the sequence structures are used in this paper. A
semantic relations, Sequence alignment                                        sequence structure can be explained as an “Entity(1) – Relation –
                                                                              Entity(2)” sequence. This type of structure embraces the functional
1 Introduction                                                                and non-functional relations. For example, the phrases, “…the
                                                                              seed film acting as a stop layer…” and “…planar layers on
According to many surveys of authorities, patents cover more than
                                                                              opposing sides of a pole piece…”, reflecting the form and spatial
90% latest technical information of the world, of which 80%
                                                                              relation respectively, will generate two sequence structures as
would not be published in other forms [5]. Thus, patents analysis
                                                                              “[seed film] (E) – form(R) – [stop layer] (E)” and “[planar layers](E)
is increasingly vital for mining technical intelligence. Patent
                                                                              – spatial(R) – [pole piece] (E)”. It is worth mentioning that the
similarity measurement, as one of fundamental building blocks for
                                                                              “sequence” emphasizes two aspects in this study: the semantic
patent analysis, not only can derive technical intelligence
                                                                              direction of these functional and non-functional structures and the
efficiently, but also can detect the risk of infringement and
                                                                              word order of each entity. Meanwhile, an algorithm for
evaluate whether the invention meets the criteria of novelty and
                                                                              calculating the global importance of each sequence structure is put
innovation [13].
                                                                              forward.


      Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                         45
                    EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents


2 Related Work                                                               concepts, and IC represents the information content value of the
                                                                             concepts.
Before delving into more specifies, discussion of the literature
                                                                                  Note that a word may express different meaning (concept) in
pertinent to patent similarity measurement is in order.
                                                                             different context, viz. polysemy. This paper uses the concepts
                                                                             corresponding to the highest similarity between two words. At
2.1 Patent Similarity Measurement based on SAO
                                                                             length, given that the synset of word1 and word2 in WordNet is
    structures                                                               Syn1 and Syn2 respectively, the similarity of two words can be
Some researchers utilized SAO structures based on semantic                   defined as follows.
similarity to evaluate the risk of patent infringement [2, 9],                         𝑠𝑖𝑚(𝑤𝑜𝑟𝑑1 , 𝑤𝑜𝑟𝑑2 ) = 𝑚𝑎𝑥 𝑚𝑎𝑥 𝑠𝑖𝑚(𝑐𝑖 , 𝑐𝑗 )        (2)
identified the evolving technological trend for R&D planning                                                    𝑐𝑖 ∈𝑆𝑦𝑛1 𝑐𝑗 ∈𝑆𝑦𝑛2

[17], build a technology tree for technology planning [4] and so
on. But in these approaches, each SAO structure is assigned the              3 Methodology
same weight. As an improvement, Wang et al. [13] has                         As shown in Figure 1, our research framework consists of four
constructed a DWSAO indicator through assigning different                    phases. The first is to extract sequence structures (functional and
weights to SAO structures for measuring patent similarity.                   non-functional semantic relations) from patent documents through
However, it neglects the influence of the number of SAO                      natural language processing (NLP) techniques and tools. At the
structures of patents, which may result in the phenomenon that               second phase, the similarity between sequence structures is
patents with high similarity values are actually not similar.                measured, which takes semantic direction of each sequence
Besides, it is not a symmetrical indicator.                                  structure and the word order information of each component into
     In addition, previous methods implicitly omit the word order            consideration. The third phase is to calculate the global
information of each component in a SAO structure. As we all                  importance of each sequence structure based on the TV_LinkA
know, the meaning of a phrase may be varied when the words are               algorithm [16]. Finally, the similarity between patents is assessed
permutated. For example, the phrases “car gasoline” and “gasoline            with the well-known optimal transportation problem solver [10,
car” both consist of the same words but in different orders. The             14]. These phases are described in more details in the following
former is a kind of fuels while the latter is one kind of cars, so           subsections.
they should not be seen as the same thing.
     Finally, just as An et al. [1] mentioned, the SAO analysis
only focuses on functional relations between the components, but
ignores the valuable technology intelligence in the form of non-
functional relations. They proposed an approach based on
preposition semantic network where prepositions aid to revealing
the relations between keywords related to technologies and
applied it to mine intelligence information in the patents. Thus,
prepositional semantic analysis can be viewed to be
complementary to SAO semantic analysis. This study integrates
functional and non-functional relations, which are collectively
                                                                             Figure 1: The overall procedure for measuring patent
referred to sequence structures.
                                                                             similarity.

2.2 WordNet
                                                                             3.1 Sequence structures extraction
In order to calculate lexical semantic similarity, WordNet is
usually chosen as the source of word relations. WordNet is a                 Recently, Chen et al. [3] have proposed a promising patent
lexical database which groups English concepts into sets of                  information extraction framework, where two deep-learning
synonyms called “synsets” and constructs the hierarchical                    models are respectively used for entity identification and semantic
structure to connect “synsets” by means of hypernym/hyponym                  relation extraction. This framework can be used here to extract the
relations. Just because of this property, WordNet is commonly                sequence structures mentioned in the patent documents. For more
used to calculate the semantic similarity of concepts. In this paper,        elaborate and detailed descriptions, we refer the readers to Chen et
the information-content (IC) based approach is used, which                   al. [3].
measures semantic similarity between concepts based on the
notion of IC that is calculated in accordance to the probability of          3.2 Similarity between sequence structures
encountering a concept [6, 7, 12]. The IC-based approach can be              After extracting sequence structures, each patent can be
formally defined as follows [6]:                                             represented by a collection of different number of sequence
                                         2×𝐼𝐶(𝐿𝐶𝑆)                           structures. In this way, patent similarity calculation problem can
                      𝑠𝑖𝑚(𝑐1 , 𝑐2 ) =                            (1)
                                        𝐼𝐶(𝑐1 )+𝐼𝐶(𝑐2 )                      be transformed to compute the similarity between the collections
      Here, 𝑠𝑖𝑚(𝑐1 , 𝑐2 ) is the similarity between two concepts 𝑐1          of sequence structures. Before this, this subsection illustrates how
and 𝑐2 . LCS is the Least Common Subsumer (hypernym) of two                  to calculate the semantic similarity between two sequence
                                                                             structures, as shown in Figure 2. Since each sequence structure


                                                                        46
                            EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents


consists of three components: E(1) (Entity(1)), R (relation) and E(2)                                                         that Wang et al. [13] omits the word order information. Figure 3
(Entity(2)), the key is how to align the components from different                                                            (b)-(c) illustrates the alignment of words in the entities “car
structures and even the words in each component.                                                                              gasoline” and “gasoline car” based on our approach, in which the
                                                                                                                              symbol “_” denotes a gap. When a word corresponds with “_”, the
                                                                                                                              resulting similarity is regarded as zero. Thus, the similarity
                                                                                                                              between two entities is the average of the similarity of the aligned
                                                                                                                              words, that is, 0.3333. Compared to Wang et al. [13], this result
                                                                                                                              seems be more realistic and credible.


Figure 2: The overall procedure for calculating the similarity
between sequence structures.                                                                                                  Figure 3: The correspondence of words in the entities “car
                                                                                                                              gaosoline” and “gasoline car”.
     As for the alignment of components, we argue that the
semantic direction between E(1) and E(2), which can be judged by                                                                    To more understandably show the procedure of calculating
R (relation), is also very important. According to the relation                                                               the similarity between two sequences, Figure 4 illustrates an
types, we can define corresponding semantic directions to match                                                               example.
the components. In this study, the similarity between two                                                                           Example 2. One sequence structure is “[insulating
sequence structures is defined as the average of the similarity of                                                            material](E)-partof(R)-[planar layers] (E)” that means “insulating
the matched components as follows.                                                                                            material” is a whole and “planar layers” is part of it, and another
𝑠𝑖𝑚(𝐸𝑅𝐸𝑖 , 𝐸𝑅𝐸𝑗 )                                                                                                             is “[seed film](E)-form(R)-[stop layer](E)” that means “stop layer” is
           (1)   (1)                (2)   (2)
     𝑠𝑖𝑚(𝐸𝑖 ,𝐸𝑗 )+𝑠𝑖𝑚(𝑅𝑖 ,𝑅𝑗 )+𝑠𝑖𝑚(𝐸𝑖 ,𝐸𝑗 )
                       3
                                                   (1)
                                                , 𝐸𝑖         𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝐸𝑗
                                                                            (1)        (2)
                                                                                  𝑎𝑛𝑑 𝐸𝑖     𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝐸𝑗
                                                                                                             (2)
                                                                                                                              a whole or a product and “seed film” is part of it or the material
 ={        (1) (2)                   (2) (1)
      𝑠𝑖𝑚(𝐸𝑖 ,𝐸𝑗 )+𝑠𝑖𝑚(𝑅𝑖 ,𝑅𝑗 )+𝑠𝑖𝑚(𝐸𝑖 ,𝐸𝑗 )
                                                                                                                   (3)        making of it. We can define the semantic direction of the former
                                                       (1)                  (2)        (2)                   (1)
                                                , 𝐸𝑖         𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝐸𝑗           𝑎𝑛𝑑 𝐸𝑖     𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝐸𝑗
                        3
                                                                                                                              as “insulating material ← planar layers”, and the latter as “seed
     Here, 𝑠𝑖𝑚(𝐸𝑅𝐸𝑖 , 𝐸𝑅𝐸𝑗 ) , ranging from 0 to 1, represents the
similarity between EREi and EREj. The larger this index is, the                                                               film → stop layer”. Hence, “insulating material” and “stop layer”
greater the similarity between the sequence structures is.                                                                    are the homogeneous components which can be considered to
       (1)  (1)                        (2) (2)         (1) (2)
𝑠𝑖𝑚(𝐸𝑖 , 𝐸𝑗 ) , 𝑠𝑖𝑚(𝑅𝑖 , 𝑅𝑗 ) , 𝑠𝑖𝑚(𝐸𝑖 , 𝐸𝑗 ) , 𝑠𝑖𝑚(𝐸𝑖 , 𝐸𝑗 ) and
                                                                                                                              match, so do “planar layers” and “seed film”. After matching the
      (2)  (1)                                                                                                                components, we use the Needleman-Wunsch algorithm to align
𝑠𝑖𝑚(𝐸𝑖 , 𝐸𝑗 ) denote the similarity between the matched
                                                                                                                              words and then calculate the similarity between the aligned
components of EREi and EREj.
                                                                                                                              components. The similarity between two sequence structures is
     Of course, there exist undirected and bidirectional relations.
                                                                                                                              the average of the similarity of the aligned components.
As for these two case, we cannot assert whether E(1) of one
sequence structure matches with E(1) or E(2) of another. In this
situation, Eq. (4) is used to calculate the similarity between two
sequence structures.
                                                              (1)    (1)                       (2)    (2)
                                                  𝑠𝑖𝑚(𝐸𝑖            ,𝐸𝑗 )+𝑠𝑖𝑚(𝑅𝑖 ,𝑅𝑗 )+𝑠𝑖𝑚(𝐸𝑖        ,𝐸𝑗 )
                                                                                  3
         𝑠𝑖𝑚(𝐸𝑅𝐸𝑖 , 𝐸𝑅𝐸𝑗 ) = 𝑚𝑎𝑥 {                            (1)    (2)                       (2)    (1)          (4)
                                                  𝑠𝑖𝑚(𝐸𝑖            ,𝐸𝑗 )+𝑠𝑖𝑚(𝑅𝑖 ,𝑅𝑗 )+𝑠𝑖𝑚(𝐸𝑖        ,𝐸𝑗 )
                                                                                  3                                           Figure 4: The procedure of calculating the similarity of
      Now there remains how to align the words in the matched                                                                 example 2.
component. Here, the Needleman-Wunsch algorithm [8, 15] is
utilized here to construct the correspondences of words in the
focused words. As a comparison, the method of Wang et al. [13]
                                                                                                                              3.3 Weight estimation of sequence structures of
is considered, which adopts the alignment form of Cartesian                                                                       each patent
product. That is, each word from one component is aligned to                                                                  Base on the concept that each sequence structure carries different
each word from another component. If the similarity between                                                                   amount of domain-specific information. This paper introduces a
aligned words is greater than a threshold, these two words are                                                                new method to calculate the global importance of each component
deemed to be matched.                                                                                                         of sequence structures based on TV_LinkA algorithm [16]. First,
      Example 1. Consider two entities “car gasoline” and                                                                     the network 𝒢(𝒱, ℰ) is constructed, where 𝒱 is the set of nodes
“gasoline car”. Following Wang et al. [13], one can generate the                                                              which consist of abstracts, sentences and components (entities and
correspondences of words as shown in Figure 3 (a) and the                                                                     relations), and ℰ is the set of edges. Each abstract links to the
similarity between these two entities is 1.0000. It is counter-                                                               sentences which are original from it, and each sentence links to
intuitive for the two entities to have a high degree of similarity,                                                           the components which are extracted from it. Second, the values of
since the former is a kind of fuels while the latter is one type of                                                           sentence and component nodes are preset to 1. Third, set the
cars. In our opinion, main reason for counter-intuitive similarity is


                                                                                                                         47
                         EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents


appropriate number of iterations. For each iteration, the value of           performance of our method. If a method can better identify these
each component node is updated to the sum of the values of the               84 pairs of patents, its performance should be better.
sentence nodes connected to it and the updated values are                          Before comparing the sequence structures, we should judge
standardized by the L2 norm. So does the value of each sentence              the semantic direction in accordance to the type of semantic
node. Repeat the above steps to continuously update the value of             relations between the components E(1) and E(2) in a sequence
the node until it is stable. At last, given that a terminology               structure so that they can correctly match to the E(1) and E(2) of
occurring a few times in domain-relevant sentences is more likely            another sequence structures. As shown in Table 1, we have
to be domain specific than another occurring many times in some              defined 4 types of semantic directions. If the sequence structures
general sentences, inverse document frequency (IDF) is multiplied            are both single-direction, we can match the components E(1) and
the resulting value of each node.                                            E(2) between two sequence structures and apply Eq. (3) to
     After that, we can obtain the global importance of each                 calculate the similarity, Eq. (4) otherwise.
component in all patent documents. Thus, the importance of each
sequence structure is the average of the importance of the                   Table 1: The semantic directions of each relation type.
corresponding components. To let the weights lie from 0 to 1, for
all the sequence structures in a same patent, the weights are                                     Relation Type      Semantic Direction
                                                                                           1    spatial relation        Undirected
normalized so that their summary is guaranteed to be equal to 1.
                                                                                           2    part-of                  (1)
                                                                                                                        E ← E
                                                                                                                                 (2)


                                                                                           3    causative relation       (1)
                                                                                                                        E ← E
                                                                                                                                 (2)

3.4 Patent similarity assessment                                                           4    operation                (1)
                                                                                                                        E ← E
                                                                                                                                 (2)


From the similarity matrix to the patent similarity, in order to                           5    made-of                  (1)
                                                                                                                        E ← E
                                                                                                                                 (2)


make full use of all the information, patent similarity                                    6    instance-of              (1)
                                                                                                                        E → E
                                                                                                                                 (2)


                                                                                           7    attribution              (1)
                                                                                                                        E ← E
                                                                                                                                 (2)
measurement problem can be transformed into the well-known
                                                                                           8    generate                 (1)
                                                                                                                        E ← E
                                                                                                                                 (2)

optimal transportation problem [10, 14]. Just as Figure 5, the
                                                                                           9    purpose                  (1)
                                                                                                                        E ← E
                                                                                                                                 (2)

patent distance matrix, which can get from 1 minus patent                                       in-manner-of             (1)     (2)
                                                                                          10                            E ← E
similarity matrix, and the weight vectors are fed to an optimal                           11    alias                  Bidirectional
transportation problem solver to obtain the shortest distance                             12    formation                (1)
                                                                                                                        E → E
                                                                                                                                 (2)


between two patents. The similarity of two interested patents is                          13    comparison              Undirected
equal to 1 minus the shortest distance.                                                   14    measurement              (1)
                                                                                                                        E ← E
                                                                                                                                 (2)


                                                                                          15    others                  Undirected

                                                                             4.2 Experiment Setup
                                                                             In this paper, we use WordNet as the source of word relations to
                                                                             calculate semantic similarity of words, but unfortunately, some
                                                                             words in the dataset are not included in WordNet. To solve this
                                                                             problem, we apply the “gestalt pattern matching” algorithm [11]
                                                                             as a supplement, which computes the similarity of two strings as
                                                                             the number of matching characters divided by the total number of
                                                                             characters in the two strings.
Figure 5: The procedure of calculating the similarity between                      In our methodology, there are two parameters needed to be
two patents.                                                                 preset by user. The first one is the number of iterations when
                                                                             calculating the weight of each sequence structure, and the second
4 Case Study                                                                 one is the gap penalty in the Needleman-Wunsch algorithm.
                                                                                   As for the number of iterations, one can determine whether it
4.1 Dataset                                                                  is stable by observing the trend of the weights after several
To evaluate the performance of our methodology, an annotated                 iterations. Through the experiment, we find that the weights of
corpus1 by [3] is used in this work. This dataset comes from thin            components gradually stabilize after 4 iterations. Thus, the
film head subfield in the field of hard disk drive. It contains 1,010        number of iterations is fixed to 10 in this article.
patent documents. Note that, in this dataset, there are 84 pairs of                As for the gap penalty, to assess its impact on patent
patents coming from the same patent family. That is, each pair of            similarity, we choose multiple values for comparison, such as -
patents both has the same abstract and the identical collection of           0.05, -0.1, -0.15, -0.2 and -0.3. But we find that no matter which
sequence structures so that they should have higher similarity than          value to choose, the word alignment, patent similarity matrix and
others. These patents can be used to assess the effectiveness and            patent similarity will not be affected. Hence, the gap penalty is set
                                                                             to -0.05 in this paper.


1
    https://github.com/awesome-patent-mining/TFH_Annotated_Dataset


                                                                        48
                    EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents


4.3 Experimental results and discussions                                     the field of hard disk drive was used. Extensive experimental
                                                                             results demonstrate that our patent similarity measurement is
To verify the effectiveness and performance of our approach, the
                                                                             significantly more accurate. Meanwhile, a significant advantage
result will be used to compare with the result of Wang et al. [13].
                                                                             of the improved semantic analysis is that the results are not
     Figure 6 shows the results of these two approaches. Each
                                                                             sensitive to several core parameters. But this method with
patent is compared with other patents, Top 1 (@1), Top 2 (@2),
                                                                             different weights does not perform as well as the method with the
Top 3 (@3), Top 4 (@4) and Top 5 (@5) highest similarity is
                                                                             equal importance. In our opinion, the main reason is that this
chosen to form 5 collections and then to judge how many of 84
                                                                             weighting process actually considers the importance of each
pairs of patents are covered. If we select Top 1 highest similarity
                                                                             sequence structure in the global context, not the importance in the
of each patent, our method can obtain 54 pairs of patents that
                                                                             local context (i.e., each patent). In the near future, a locally
come from a patent family when the weights are determined by
                                                                             weighting method will be further investigated.
the weighting algorithm (section 3.3), while 58 pairs can be
outputted by our approach with the same weights. But the
DWSAO analysis can even recognize none of them. If Top 2                     ACKNOWLEDGMENTS
collection is considered, our weighted and non-weighted versions             This research received the financial support from the Social
contain 70 pairs and 78 pairs respectively, while the DWSAO                  Science Foundation of Beijing Municipality under grant number
analysis only identifies 2 pairs. When we enlarge to Top 5 highest           17GLB074, and Natural Science Foundation of Guangdong
similarity of each patent, the weighted one can identify 81 pairs            Province (Grant Number 2018A030313695).
and the non-weighted one can fully recognize 84 pairs of patents
while only 3 pairs are identified by the DWSAO analysis.                     REFERENCES
                                                                             [1] An, J., Kim, K., Mortara, L., & Lee, S. (2018). Deriving technology intelligence
                                                                                  from patents: Preposition-based semantic analysis. Journal of Informetrics,
                                                                                  12(1), 217-236. doi:10.1016/j.joi.2018.01.001
                                                                             [2] Bergmann, I., Butzke, D., Walter, L., Fuerste, J. P., Moehrle, M. G., &
                                                                                  Erdmann, V. A. (2008). Evaluating the risk of patent infringement by means of
                                                                                  semantic patent analysis: the case of DNA chips. R&D Management, 38(5),
                                                                                  550-562.
                                                                             [3] Chen, L., Xu, S., Zhu, L., Zhang, J., Lei,X., & Yang, G. (2020). A deep
                                                                                  learning based method for extracting semantic information from patent
                                                                                  documents. Scientometrics.
                                                                             [4] Choi, S., Park, H., Kang, D., Lee, J. Y., & Kim, K. (2012). An SAO-based text
                                                                                  mining approach to building a technology tree for technology planning. Expert
                                                                                  Systems          with         Applications,        39(13),         11443-11455.
                                                                                  doi:10.1016/j.eswa.2012.04.014
                                                                             [5] Zha, X., & Chen, M. (2010). Study on early warning of competitive technical
                                                                                  intelligence based on the patent map. Journal of Computers, 5(2).
Figure 6: The performance of our approach and DWSAO                               doi:10.4304/jcp.5.2.274-281
method.                                                                      [6] Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings
                                                                                  of the 15th International Conference on Machine Learning, pp. 296-304.
                                                                             [7] Jiang, J. J., & Conrath, D. W. (1997). Semantic Similarity Based on Corpus
     It is no doubt that our patent similarity measurement is                     Statistics and Lexical Taxonomy. arXiv: Computation and Language.
                                                                             [8] Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the
significantly more accurate than the DWSAO approach. At the                       search for similarities in the amino acid sequence of two proteins. Journal of
meanwhile, a significant advantage of the improved semantic                       Molecular Biology, 48(3), 443-453.
analysis is that the results are not sensitive to several core               [9] Park, H., Yoon, J., & Kim, K. (2012). Identifying patent infringement using
                                                                                  SAO based semantic technological similarities. Scientometrics, 90(2), 515-529.
parameters. But this method with different weights does not                  [10] Rachev, S.T. (1998). In L. Ruschendorf (Ed.), Mass transportation problems:
perform as well as the method with the equal importance. In our                   Volume I: Theory (probability and its applications). New York, NY: Springer.
                                                                             [11] Ratcliff, J. W., & Metzener, D. E. (1988). Pattern-matching-the gestalt
opinion, the main reason is that the weighting algorithm actually                 approach. Dr Dobbs Journal, 13(7), 46.
considers the importance of each sequence structure in the global            [12] Resnik, P. (1995). Using information content to evaluate semantic similarity in
context, not the importance in the local context (i.e., each patent).             a taxonomy. International Joint Conference on Artificial Intelligence, 448-453.
                                                                             [13] Wang, X., Ren, H., Chen, Y., Liu, Y., Qiao, Y., & Huang, Y. (2019).
In the near future, a locally weighting method will be further                    Measuring patent similarity with SAO semantic analysis. Scientometrics,
investigated.                                                                     121(1), 1-23. doi:10.1007/s11192-019-03191-z
                                                                             [14] Xu, S., Zhai, D., Wang, F., An, X., Pang, H., & Sun, Y. (2019). A novel method
                                                                                  for topic linkages between scientific publications and patents. Journal of the
5 Conclusion                                                                      Association for Information Science and Technology, 70(9), 1026-1042.
                                                                                  doi:10.1002/asi.24175
      This study proposes an improved semantic analysis for                  [15] Xu, S., Zhu, L., Qiao, X., & Xue, C. (2009). A novel approach for measuring
                                                                                  chinese terms semantic similarity based on pairwise sequence alignment. In
assessing patent similarity on the basis of entities and semantic                 Proceedings of the 5th International Conference on Semantics, Knowledge and
relations (functional and non-functional relations), which takes                  Grid (pp. 92-98). IEEE.
semantic direction of each sequence structure and the word order             [16] Yang, Y., Lu, Q., & Zhao, T. (2010). A delimiter-based general approach for
                                                                                  Chinese term extraction. Journal of the American Society for Information
information of each component into consideration. Meanwhile,                      Science and Technology, 61(1), 111-125. doi:10.1002/asi.21221
we introduce a new method to calculate the global importance of              [17] Yoon, J., & Kim, K. (2011). Identifying rapidly evolving technological trends
                                                                                  for R&D planning using SAO-based semantic patent networks. Scientometrics,
each sequence structure. To verify the effectiveness and                          88(1), 213-228. doi:10.1007/s11192-011-0383-0
performance of the improved semantic analysis, a case study on               [18] Savransky, S.D. (2000) Engineering of creativity: Introduction to TRIZ
patent similarity measurement related to thin film head subfield in               methodology of inventive problem solving. Boca Raton, FL: CRC Press.


                                                                        49