=Paper=
{{Paper
|id=Vol-3604/paper2
|storemode=property
|title= Patent Classification on Search-Optimized Graph-Based Representations
|pdfUrl=https://ceur-ws.org/Vol-3604/paper2.pdf
|volume=Vol-3604
|authors=Jarkko Lagus,Ekaterina Kotliarova,Sebastian Björkqvist
|dblpUrl=https://dblp.org/rec/conf/patentsemtech/LagusKB23
}}
== Patent Classification on Search-Optimized Graph-Based Representations==
<pdf width="1500px">https://ceur-ws.org/Vol-3604/paper2.pdf</pdf>
<pre>
                                Patent Classification on Search-Optimized Graph-Based
                                Representations
                                Jarkko Lagus1 , Ekaterina Kotliarova1 and Sebastian Björkqvist1
                                1
                                    IPRally Technologies Oy, Helsinki, Finland


                                                                             Abstract
                                                                             Patent documents can be effectively represented using embeddings derived from graphs. These graph-based representations
                                                                             capture the intricate relationships and contextual information within the documents. By leveraging the power of graph
                                                                             embeddings, we can create rich document representations that can be further fine-tuned to enhance their performance for
                                                                             specific tasks.
                                                                                 In this paper, we aim to address the fundamental question if search-optimized graph-based document embeddings can be
                                                                             directly used for classification. Traditionally, different training pipelines and storage mechanisms were required for each
                                                                             distinct task, resulting in increased complexity and resource consumption. However, by establishing whether the same
                                                                             representations can be effectively used for both search and classification, we can streamline the process and eliminate the
                                                                             need for maintaining multiple sets of embeddings.
                                                                                 Our results provide evidence that embeddings optimized for search tasks can be directly employed to perform classification
                                                                             tasks, offering a promising solution that significantly improves efficiency and resource utilization. By repurposing the same set
                                                                             of optimized embeddings for both search and classification, we not only achieve data efficiency but also reduce computational
                                                                             overhead. This approach allows us to leverage the benefits of existing search-optimized embeddings without sacrificing the
                                                                             accuracy or effectiveness of classification tasks. As a result, we present a novel and efficient classification method that reduces
                                                                             the complexity of maintaining separate training pipelines and storing multiple representations.

                                                                             Keywords
                                                                             classification, patents, document embeddings, patent search


                                1. Introduction                                                                                                                       mapping from these classes to classes of interest does not
                                                                                                                                                                      often work [6].
                                The categorization of patent documents plays a crucial                                                                                   Due to the discrete nature of certain metrics, direct
                                role in various aspects of strategic decision-making, such                                                                            optimization becomes challenging. Consequently, in var-
                                as competitor monitoring, portfolio management,                                                                                       ious machine learning tasks, a common approach is to
                                and patent landscaping. Document classification itself                                                                                solve the main objective by optimizing a substitute target
                                is a foundational task in natural language processing                                                                                 instead. An illustrative case of such a task is ranking,
                                and a vast amount of research has been done both using                                                                                where instead of directly solving the discrete ranking
                                tradi-tional machine learning approaches and deep                                                                                     problem, we turn it into a problem of optimizing pair-
                                learning-based approaches [1]. Specifically, in the                                                                                   wise distances. Inspired by this concept, we investigate
                                domain of patent classification, approaches using                                                                                     the approach of performing classification directly on doc-
                                methods such as convolutional neural networks [2]                                                                                     ument embeddings that have been optimized for a search
                                and transformers [3, 4, 5] have been used. Performing                                                                                 task.
                                document classifica-tion for patents manually can be                                                                                     The work presented here is based on the hypothesis
                                very time-consuming and often requires domain                                                                                         that graph-based embeddings optimized for a search task
                                expertise. This means that the amount of labeled data                                                                                 contain rich enough information to be directly applied to
                                available for training may be small.                                                                                                  a classification task with no additional fine-tuning steps.
                                   As patent documents are already categorized by the                                                                                 Only training a lightweight classification model on top of
                                patent offices using the International or Cooperative                                                                                 the embeddings is needed. This approach results in a very
                                Patent Classification (IPC/CPC) standards, it may be                                                                                  efficient way to perform classification as such models
                                tempting to try mapping these classes directly to the                                                                                 scale well to larger datasets and in the usual problem
                                classification task of interest. These classes however                                                                                scale can be trained in a few seconds. This enables online
                                rarely correlate with the actual business tasks, so simple                                                                            training of new classification models on the fly, allowing
                                PatentSemTech'23: 4th Workshop on Patent Text Mining and                                                                              for quick verification of results and multiple iterations if
                                Semantic Technologies, July 27th, 2023, Taipei, Taiwan.                                                                               needed.
                                $ jarkko@iprally.com (J. Lagus); ekaterina@iprally.com
                                (E. Kotliarova); sebastian@iprally.com (S. Björkqvist)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                                                       Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                                                                 33
Figure 1: An example of a patent claim describing a snowthrower and the corresponding graph created from the claim. When
used in the downstream classification task, these graphs are further encoded as d-dimensional vectors using a graph neural
network model.


2. Methodology                                                      detect all nouns and noun chunks in the text. The nouns
                                                                    in the text describe the features of the invention and
Representations for text documents can be made in vari-             become the nodes of the graph. In Figure 1, examples
ous ways. In this work, our main focus is to investigate            of nouns and noun chunks are snowthrower, motor, and
the usability of a search-optimized graph-based embed-              handle device. After this, the parser detects, using hand-
ding method, where the patent document is first parsed              crafted rules, words indicating relationships between
into an intermediate graph representation that is then              the features of the invention (e.g. comprising, having,
turned into an embedding. This is then compared to                  containing). These words will create edges to the graph.
other common document embedding methods.                            The endpoints of the edges are found using the output of
                                                                    the linguistic analysis done previously. For instance, in
2.1. Graph-based representations                                    Figure 1 the term comprising will result in, among others,
     optimized for search                                           an edge between snowthrower and motor.
                                                                       The graph neural network model is trained in a super-
In contrast to traditional methods, such as word embed-             vised manner using citations reported by patent office ex-
ding or transformer-based approaches, where the whole               aminers, resulting in documents having similar technical
document is directly encoded into a vector format with-             content being placed close to each other in the embed-
out task-specific regularization, the graph format adds             ding space. The model is trained using triplet loss, where
additional prior information about the relations between            a patent application acts as the anchor and a patent cited
the elements in the document. The idea of the graph is to           by the application is the positive sample. The negative
describe all the relevant technical features of a patent in         sample is chosen to be some other patent document that
a concise form that is easily understandable by humans              is not cited by the application. This results in a model that
and efficient to process by machines. An example of a               is useful for searching for prior art for new inventions.
patent claim converted to a graph can be seen in Figure             The embeddings used for the later classification stage are
1.                                                                  created from the description graph of the patent docu-
   The details of how the graphs and embeddings are                 ment, the description graph including both the claims
created are described in [7]. In short, the process is the          and the description text of the document.
following:
    1. Turn the text of a patent document into a graph              2.2. Other document embedding models
       using a specialized parser, resulting in a collection
                                                                    In order to comparatively measure the effectiveness of
       of nodes and edges.
                                                                    our embedding method, we conduct the experiments us-
    2. Embed the graph into a vector space using a graph            ing a few additional models to provide meaningful base-
       neural network model trained to perform prior                lines. For the baseline evaluations, we create document
       art searches for patents.                                    embeddings using five different methods: TF-IDF em-
  The parser that converts text to graphs uses the spaCy            beddings, two different GloVe [9] embeddings and two
[8] library to do a linguistic analysis of the text and to          different BERT-based [10, 4, 11] embeddings (see Table


                                                               34
    Embedding model                     Dimensionality                    Dataset              Labels     Train size     Test size
    Ours [7]                                 150                          Qubit [6]              2          1,124          282
    TF-IDF                                ≈ 33, 000                       Mechanical eng.        10         3,768          943
    GloVe (Stanford) [9]                     300
    GloVe (patents)                          300                       Table 2
    BERT (base uncased) [10]                 768                       Dataset statistics for the datasets used for training and evalu-
    BERT (patents) [11, 4]                  1,024                      ation. In both datasets only one document per patent family
                                                                       is preserved to avoid overrepresenting certain families.
Table 1
The set of different embeddings used in the experiments. BERT
(patents) is the large BERT and GloVe (patents) is the standard
GloVe model trained with patent data.
                                                            using stratified 5-fold cross-validation. In both, binary
                                                            and multi-label cases, only one threshold is selected. For
                                                            the multi-label case, the threshold that maximizes the
                                                            micro-averaged F1 score of all classifiers is chosen.
1 for more details). All the embedding models chosen           For the experiments where we limit the data amount,
represent conceptually different ways of forming the doc- we first randomly sample 𝑝 percent of data points (with
ument embeddings. The embeddings are created using 𝑝 varying from 0.5 to 75) and then follow the same pro-
the full text of the patent document, including both the cedure as with the full data case. The sampling is done
claims and the description of the document.                 so that all the models are trained using the same fixed
   For TF-IDF embeddings we use scikit-learn [12] subset. When training on a subset of data, we repeat the
library. To form the document embeddings out of the training process 𝑛 times in order to reduce the amount
GloVe embeddings, we use the spaCy [8] library. For the of noise caused by poor train-validation split, where 𝑛
BERT models we use HuggingFace [13] library. Because varies from 2 for the largest subsets to 10 for the smallest
of the limitations of input layer size and the length of subsets.
patent documents, to form the BERT-based document
embeddings, we split the documents into chunks of 100
                                                            2.3.2. Model evaluation
tokens, and embed each chunk individually. After this,
we extract all the separate embeddings and form a mean Evaluations are done using a separate holdout test set
vector representation out of these.                         independent of training data. For evaluation, we calcu-
                                                            late the standard F1 scores. In the case of the multi-label
2.3. Classification models                                  dataset, micro averaging is used. We conduct the eval-
                                                            uation on two different datasets, one binary, and one
As one of the goals is to minimize the training cost for multi-label. The binary dataset is the gold-standard Qubit
the classification model, we employ simple classification patent dataset [6] and the multi-label dataset is a pro-
models instead of heavy deep learning models. The only prietary dataset from a mechanical engineering patent
requirement we impose on the model is the ability to domain (see Table 2 for details).
output a probability estimate for the input sample be-         The same holdout test set is used for all evaluations,
longing to a specific class. For the classification, we use both for the experiments with the full data and the exper-
ready-made implementations from the scikit-learn iments with subsets of the data. To convert the predicted
[12] library. The specific models chosen are the basic lo- probabilities to binary predictions we use the optimal
gistic regression and k-nearest-neighbors classifiers using threshold selected using the training phase.
the default parameters.

2.3.1. Model training                                                  3. Experiments
We train each model using a training set separated from                We experiment with how different choices of embedding
the full dataset. The input for the models is the document             the documents (see Table 1 for the list of methods) affect
embedding and the output is the probability for each label.            the performance. Our main interests are classification
In the case of the binary dataset, we train one classifier.            accuracy (measured using the F1 score), sample efficiency,
In the case of the multi-label dataset, we train one binary            and training time. When measuring the training time, we
classifier for each class following the one-versus-rest                do not consider the time required to create the document
strategy leading into a collection of 𝑚 separate binary                embeddings or include the hyperparameter search but
classifiers.                                                           assume that the embeddings are readily available and the
   For the experiments with full data, we train one clas-              optimal hyperparameters are known.
sifier for each dataset-model pair. As the outputs are
probabilities, we need to find the optimal cut-off thresh-
old that maximizes the F1 score. This threshold is selected


                                                                  35
                                                                              Qubit dataset (binary)
                                                                  Embedding type      Model       F1         Time (s)
                                                                  BERT (base uncased)     knn       0.854        0.049
                                                                  BERT (patents)          knn       0.856        0.065
                                                                  GloVe (Stanford)        knn       0.873        0.003
                                                                  GloVe (patents)         knn       0.851        0.003
                                                                  Ours                    knn       0.860        0.011
                                                                  TF-IDF                  knn       0.860        1.216
                                                                  BERT (base uncased)     lr        0.860        0.135
                                                                  BERT (patents)          lr        0.912        0.184
                                                                  GloVe (Stanford)        lr        0.844        0.035
                                                                  GloVe (patents)         lr        0.842        0.021
                                                                  Ours                    lr        0.865        0.021
                                                                  TF-IDF                  lr        0.868        1.792

                                                                    Mechanical engineering dataset (multi-label)
                                                                  Embedding type       Model      F1      Time (s)
                                                                  BERT (base uncased)     knn       0.655        1.215
                                                                  BERT (patents)          knn       0.664        1.561
                                                                  GloVe (Stanford)        knn       0.691        0.041
                                                                  GloVe (patents)         knn       0.645        0.040
                                                                  Ours                    knn       0.775        0.261
                                                                  TF-IDF                  knn       0.698       53.909
                                                                  BERT (base uncased)     lr        0.719        4.448
                                                                  BERT (patents)          lr        0.770        5.734
                                                                  GloVe (Stanford)        lr        0.681        0.595
                                                                  GloVe (patents)         lr        0.717        0.750
                                                                  Ours                    lr        0.770        0.374
                                                                  TF-IDF                  lr        0.752       80.881
                                                               Table 3
                                                               Evaluation results for models trained on the full train set
Figure 2: The effects of the amount of training data on the    for the Qubit and mechanical engineering datasets for all
Qubit dataset on model performance over different embed-       embedding types and classification models.
dings.


                                                               tion between the training time and the embedding size.
3.1. Experiments on embedding                                  Especially the TF-IDF embeddings show an extreme case
     performance                                               of this requiring over ten-fold time compared to any
                                                               other embedding type. The main reason for poor train-
To measure the overall embedding performance, we look          ing speed with TF-IDF is, however, that the models used
into two factors, overall accuracy measured in F1 score        do not support sparse training.
and required training time following the process de-
scribed in Section 2.3.1. The results are summarized in
Table 3.
                                                               3.2. Experiments on sample efficiency
   The F1 scores on the Qubit dataset show much vari-          To experiment with how well the models perform when
ability between models and embedding methods: For              data is scarce, i.e. how much data is actually needed
instance, the GloVe (Stanford) embeddings perform the          to gain reasonable performance, we limited the amount
best when using the k-nearest-neighbors (knn in the fig-       of training data to smaller subsets of specific percent-
ures) model but the second worst when using the logistic       ages. The same test set was used here as in the previous
regression (lr in the figures) model. This suggests that       experiment on full data.
the results on the Qubit dataset do not give much infor-          From Figures 2 and 3, we can see that most models
mation about which model or embedding method works             start to plateau already when around 30% of full data is
the best. For the multi-label dataset, however, the results    included. On the Qubit dataset, the same effect of no
are more consistent, with our approach reaching the top        clear separation seems to be present as well when using
performance with both models.                                  smaller subsets of the data, similar to what was seen with
   From the training times, we can see a direct correla-       the full data case. The curves fluctuate over each other,


                                                          36
                                                              In the binary dataset case, the results were inconclusive.
                                                              We showed that training of the classification models can
                                                              be done in less than a second, enabling users to train clas-
                                                              sifiers in an online fashion. Due to the limited amount of
                                                              datasets available, nothing conclusive can be said about
                                                              the generalization capabilities of the method, but we
                                                              believe this result generalizes to any rich-enough embed-
                                                              dings optimized for a search task. Further investigations
                                                              are needed to say anything conclusive, however.


                                                              References
                                                             [1] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa,
                                                                 S. Mendu, L. Barnes, D. Brown, Text classification
                                                                 algorithms: A survey, Information 10 (2019) 150.
                                                             [2] S. Li, J. Hu, Y. Cui, J. Hu, Deeppatent: patent clas-
                                                                 sification with convolutional neural networks and
                                                                 word embedding, Scientometrics 117 (2018) 721–
                                                                 744.
                                                             [3] J.-S. Lee, J. Hsiang, PatentBERT: Patent classifica-
                                                                 tion with fine-tuning a pre-trained BERT model,
                                                                 arXiv preprint arXiv:1906.02124 (2019).
                                                             [4] R. Srebrovic, J. Yonamine, Leveraging the BERT
                                                                 algorithm for Patents with TensorFlow and Big-
                                                                 Query, 2020. URL: https://services.google.com/fh/
                                                                 files/blogs/bert_for_patents_white_paper.pdf.
                                                             [5] H. Bekamiri, D. S. Hain, R. Jurowetzki, Patentsberta:
                                                                 a deep nlp based hybrid model for patent distance
                                                                 and classification using augmented sbert, arXiv
                                                                 preprint arXiv:2103.11933 (2021).
Figure 3: The effects of the amount of training data on the
                                                             [6] S. Harris, A. Trippe, D. Challis, N. Swycher, Con-
multi-label mechanical engineering patent dataset on model
performance over different embeddings.
                                                                 struction and evaluation of gold standards for
                                                                 patent classification—a case study on quantum com-
                                                                 puting, World Patent Information 61 (2020) 101961.
                                                             [7] S. Björkqvist, J. Kallio, Building a graph-based
and no clear distinction can be seen between models.             patent search engine, in: 46th International
   In the multi-label case, however, clear differences           ACM SIGIR Conference on Research and Devel-
show up: When using 0.5% of the data there is almost a           opment in Information Retrieval (SIGIR’23), to
20-percentage-point difference between the best (Ours)           appear, 2023. doi:https://doi.org/10.1145/
and the worst (TF-IDF with lr and BERT (patents) for             3539618.3591842.
knn). The performance difference between the models          [8] M. Honnibal, I. Montani, S. Van Landeghem,
decreases when the number of samples is increased, but           A. Boyd, spacy: Industrial-strength natural lan-
the rankings of the different models mostly stay the same        guage processing in python, zenodo, 2020, 2020.
regardless of the amount of data used, with our method       [9] J. Pennington, R. Socher, C. D. Manning, Glove:
reaching the highest scores on virtually all subset sizes.       Global vectors for word representation, in: Em-
                                                                 pirical Methods in Natural Language Processing
                                                                 (EMNLP), 2014, pp. 1532–1543.
4. Conclusions                                              [10] J. Devlin, M. Chang, K. Lee, K. Toutanova,
In this paper, we showed that patent classification can be       BERT: pre-training of deep bidirectional trans-
done efficiently on rich graph embeddings optimized for          formers for language understanding,            CoRR
a search task. We evaluated the performance on both a            abs/1810.04805    (2018).  URL: http://arxiv.org/abs/
binary and a multi-label dataset and demonstrated that           1810.04805. arXiv:1810.04805.
search-optimized embeddings work well with a very lim- [11] F. Cariaggi, BERT for Patents, 2023. URL: https:
ited amount of labeled samples in the multi-label case.          //huggingface.co/anferico/bert-for-patents.


                                                         37
[12] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
     B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
     R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
     D. Cournapeau, M. Brucher, M. Perrot, E. Duch-
     esnay, Scikit-learn: Machine learning in Python,
     Journal of Machine Learning Research 12 (2011)
     2825–2830.
[13] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
     langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
     towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma,
     Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger,
     M. Drame, Q. Lhoest, A. M. Rush, Transformers:
     State-of-the-art natural language processing, in:
     Proceedings of the 2020 Conference on Empirical
     Methods in Natural Language Processing: System
     Demonstrations, Association for Computational
     Linguistics, Online, 2020, pp. 38–45.


                                                          38

</pre>