=Paper=
{{Paper
|id=Vol-3604/paper2
|storemode=property
|title= Patent Classification on Search-Optimized Graph-Based Representations
|pdfUrl=https://ceur-ws.org/Vol-3604/paper2.pdf
|volume=Vol-3604
|authors=Jarkko Lagus,Ekaterina Kotliarova,Sebastian Björkqvist
|dblpUrl=https://dblp.org/rec/conf/patentsemtech/LagusKB23
}}
== Patent Classification on Search-Optimized Graph-Based Representations==
Patent Classification on Search-Optimized Graph-Based Representations Jarkko Lagus1 , Ekaterina Kotliarova1 and Sebastian Björkqvist1 1 IPRally Technologies Oy, Helsinki, Finland Abstract Patent documents can be effectively represented using embeddings derived from graphs. These graph-based representations capture the intricate relationships and contextual information within the documents. By leveraging the power of graph embeddings, we can create rich document representations that can be further fine-tuned to enhance their performance for specific tasks. In this paper, we aim to address the fundamental question if search-optimized graph-based document embeddings can be directly used for classification. Traditionally, different training pipelines and storage mechanisms were required for each distinct task, resulting in increased complexity and resource consumption. However, by establishing whether the same representations can be effectively used for both search and classification, we can streamline the process and eliminate the need for maintaining multiple sets of embeddings. Our results provide evidence that embeddings optimized for search tasks can be directly employed to perform classification tasks, offering a promising solution that significantly improves efficiency and resource utilization. By repurposing the same set of optimized embeddings for both search and classification, we not only achieve data efficiency but also reduce computational overhead. This approach allows us to leverage the benefits of existing search-optimized embeddings without sacrificing the accuracy or effectiveness of classification tasks. As a result, we present a novel and efficient classification method that reduces the complexity of maintaining separate training pipelines and storing multiple representations. Keywords classification, patents, document embeddings, patent search 1. Introduction mapping from these classes to classes of interest does not often work [6]. The categorization of patent documents plays a crucial Due to the discrete nature of certain metrics, direct role in various aspects of strategic decision-making, such optimization becomes challenging. Consequently, in var- as competitor monitoring, portfolio management, ious machine learning tasks, a common approach is to and patent landscaping. Document classification itself solve the main objective by optimizing a substitute target is a foundational task in natural language processing instead. An illustrative case of such a task is ranking, and a vast amount of research has been done both using where instead of directly solving the discrete ranking tradi-tional machine learning approaches and deep problem, we turn it into a problem of optimizing pair- learning-based approaches [1]. Specifically, in the wise distances. Inspired by this concept, we investigate domain of patent classification, approaches using the approach of performing classification directly on doc- methods such as convolutional neural networks [2] ument embeddings that have been optimized for a search and transformers [3, 4, 5] have been used. Performing task. document classifica-tion for patents manually can be The work presented here is based on the hypothesis very time-consuming and often requires domain that graph-based embeddings optimized for a search task expertise. This means that the amount of labeled data contain rich enough information to be directly applied to available for training may be small. a classification task with no additional fine-tuning steps. As patent documents are already categorized by the Only training a lightweight classification model on top of patent offices using the International or Cooperative the embeddings is needed. This approach results in a very Patent Classification (IPC/CPC) standards, it may be efficient way to perform classification as such models tempting to try mapping these classes directly to the scale well to larger datasets and in the usual problem classification task of interest. These classes however scale can be trained in a few seconds. This enables online rarely correlate with the actual business tasks, so simple training of new classification models on the fly, allowing PatentSemTech'23: 4th Workshop on Patent Text Mining and for quick verification of results and multiple iterations if Semantic Technologies, July 27th, 2023, Taipei, Taiwan. needed. $ jarkko@iprally.com (J. Lagus); ekaterina@iprally.com (E. Kotliarova); sebastian@iprally.com (S. Björkqvist) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 33 Figure 1: An example of a patent claim describing a snowthrower and the corresponding graph created from the claim. When used in the downstream classification task, these graphs are further encoded as d-dimensional vectors using a graph neural network model. 2. Methodology detect all nouns and noun chunks in the text. The nouns in the text describe the features of the invention and Representations for text documents can be made in vari- become the nodes of the graph. In Figure 1, examples ous ways. In this work, our main focus is to investigate of nouns and noun chunks are snowthrower, motor, and the usability of a search-optimized graph-based embed- handle device. After this, the parser detects, using hand- ding method, where the patent document is first parsed crafted rules, words indicating relationships between into an intermediate graph representation that is then the features of the invention (e.g. comprising, having, turned into an embedding. This is then compared to containing). These words will create edges to the graph. other common document embedding methods. The endpoints of the edges are found using the output of the linguistic analysis done previously. For instance, in 2.1. Graph-based representations Figure 1 the term comprising will result in, among others, optimized for search an edge between snowthrower and motor. The graph neural network model is trained in a super- In contrast to traditional methods, such as word embed- vised manner using citations reported by patent office ex- ding or transformer-based approaches, where the whole aminers, resulting in documents having similar technical document is directly encoded into a vector format with- content being placed close to each other in the embed- out task-specific regularization, the graph format adds ding space. The model is trained using triplet loss, where additional prior information about the relations between a patent application acts as the anchor and a patent cited the elements in the document. The idea of the graph is to by the application is the positive sample. The negative describe all the relevant technical features of a patent in sample is chosen to be some other patent document that a concise form that is easily understandable by humans is not cited by the application. This results in a model that and efficient to process by machines. An example of a is useful for searching for prior art for new inventions. patent claim converted to a graph can be seen in Figure The embeddings used for the later classification stage are 1. created from the description graph of the patent docu- The details of how the graphs and embeddings are ment, the description graph including both the claims created are described in [7]. In short, the process is the and the description text of the document. following: 1. Turn the text of a patent document into a graph 2.2. Other document embedding models using a specialized parser, resulting in a collection In order to comparatively measure the effectiveness of of nodes and edges. our embedding method, we conduct the experiments us- 2. Embed the graph into a vector space using a graph ing a few additional models to provide meaningful base- neural network model trained to perform prior lines. For the baseline evaluations, we create document art searches for patents. embeddings using five different methods: TF-IDF em- The parser that converts text to graphs uses the spaCy beddings, two different GloVe [9] embeddings and two [8] library to do a linguistic analysis of the text and to different BERT-based [10, 4, 11] embeddings (see Table 34 Embedding model Dimensionality Dataset Labels Train size Test size Ours [7] 150 Qubit [6] 2 1,124 282 TF-IDF ≈ 33, 000 Mechanical eng. 10 3,768 943 GloVe (Stanford) [9] 300 GloVe (patents) 300 Table 2 BERT (base uncased) [10] 768 Dataset statistics for the datasets used for training and evalu- BERT (patents) [11, 4] 1,024 ation. In both datasets only one document per patent family is preserved to avoid overrepresenting certain families. Table 1 The set of different embeddings used in the experiments. BERT (patents) is the large BERT and GloVe (patents) is the standard GloVe model trained with patent data. using stratified 5-fold cross-validation. In both, binary and multi-label cases, only one threshold is selected. For the multi-label case, the threshold that maximizes the micro-averaged F1 score of all classifiers is chosen. 1 for more details). All the embedding models chosen For the experiments where we limit the data amount, represent conceptually different ways of forming the doc- we first randomly sample 𝑝 percent of data points (with ument embeddings. The embeddings are created using 𝑝 varying from 0.5 to 75) and then follow the same pro- the full text of the patent document, including both the cedure as with the full data case. The sampling is done claims and the description of the document. so that all the models are trained using the same fixed For TF-IDF embeddings we use scikit-learn [12] subset. When training on a subset of data, we repeat the library. To form the document embeddings out of the training process 𝑛 times in order to reduce the amount GloVe embeddings, we use the spaCy [8] library. For the of noise caused by poor train-validation split, where 𝑛 BERT models we use HuggingFace [13] library. Because varies from 2 for the largest subsets to 10 for the smallest of the limitations of input layer size and the length of subsets. patent documents, to form the BERT-based document embeddings, we split the documents into chunks of 100 2.3.2. Model evaluation tokens, and embed each chunk individually. After this, we extract all the separate embeddings and form a mean Evaluations are done using a separate holdout test set vector representation out of these. independent of training data. For evaluation, we calcu- late the standard F1 scores. In the case of the multi-label 2.3. Classification models dataset, micro averaging is used. We conduct the eval- uation on two different datasets, one binary, and one As one of the goals is to minimize the training cost for multi-label. The binary dataset is the gold-standard Qubit the classification model, we employ simple classification patent dataset [6] and the multi-label dataset is a pro- models instead of heavy deep learning models. The only prietary dataset from a mechanical engineering patent requirement we impose on the model is the ability to domain (see Table 2 for details). output a probability estimate for the input sample be- The same holdout test set is used for all evaluations, longing to a specific class. For the classification, we use both for the experiments with the full data and the exper- ready-made implementations from the scikit-learn iments with subsets of the data. To convert the predicted [12] library. The specific models chosen are the basic lo- probabilities to binary predictions we use the optimal gistic regression and k-nearest-neighbors classifiers using threshold selected using the training phase. the default parameters. 2.3.1. Model training 3. Experiments We train each model using a training set separated from We experiment with how different choices of embedding the full dataset. The input for the models is the document the documents (see Table 1 for the list of methods) affect embedding and the output is the probability for each label. the performance. Our main interests are classification In the case of the binary dataset, we train one classifier. accuracy (measured using the F1 score), sample efficiency, In the case of the multi-label dataset, we train one binary and training time. When measuring the training time, we classifier for each class following the one-versus-rest do not consider the time required to create the document strategy leading into a collection of 𝑚 separate binary embeddings or include the hyperparameter search but classifiers. assume that the embeddings are readily available and the For the experiments with full data, we train one clas- optimal hyperparameters are known. sifier for each dataset-model pair. As the outputs are probabilities, we need to find the optimal cut-off thresh- old that maximizes the F1 score. This threshold is selected 35 Qubit dataset (binary) Embedding type Model F1 Time (s) BERT (base uncased) knn 0.854 0.049 BERT (patents) knn 0.856 0.065 GloVe (Stanford) knn 0.873 0.003 GloVe (patents) knn 0.851 0.003 Ours knn 0.860 0.011 TF-IDF knn 0.860 1.216 BERT (base uncased) lr 0.860 0.135 BERT (patents) lr 0.912 0.184 GloVe (Stanford) lr 0.844 0.035 GloVe (patents) lr 0.842 0.021 Ours lr 0.865 0.021 TF-IDF lr 0.868 1.792 Mechanical engineering dataset (multi-label) Embedding type Model F1 Time (s) BERT (base uncased) knn 0.655 1.215 BERT (patents) knn 0.664 1.561 GloVe (Stanford) knn 0.691 0.041 GloVe (patents) knn 0.645 0.040 Ours knn 0.775 0.261 TF-IDF knn 0.698 53.909 BERT (base uncased) lr 0.719 4.448 BERT (patents) lr 0.770 5.734 GloVe (Stanford) lr 0.681 0.595 GloVe (patents) lr 0.717 0.750 Ours lr 0.770 0.374 TF-IDF lr 0.752 80.881 Table 3 Evaluation results for models trained on the full train set Figure 2: The effects of the amount of training data on the for the Qubit and mechanical engineering datasets for all Qubit dataset on model performance over different embed- embedding types and classification models. dings. tion between the training time and the embedding size. 3.1. Experiments on embedding Especially the TF-IDF embeddings show an extreme case performance of this requiring over ten-fold time compared to any other embedding type. The main reason for poor train- To measure the overall embedding performance, we look ing speed with TF-IDF is, however, that the models used into two factors, overall accuracy measured in F1 score do not support sparse training. and required training time following the process de- scribed in Section 2.3.1. The results are summarized in Table 3. 3.2. Experiments on sample efficiency The F1 scores on the Qubit dataset show much vari- To experiment with how well the models perform when ability between models and embedding methods: For data is scarce, i.e. how much data is actually needed instance, the GloVe (Stanford) embeddings perform the to gain reasonable performance, we limited the amount best when using the k-nearest-neighbors (knn in the fig- of training data to smaller subsets of specific percent- ures) model but the second worst when using the logistic ages. The same test set was used here as in the previous regression (lr in the figures) model. This suggests that experiment on full data. the results on the Qubit dataset do not give much infor- From Figures 2 and 3, we can see that most models mation about which model or embedding method works start to plateau already when around 30% of full data is the best. For the multi-label dataset, however, the results included. On the Qubit dataset, the same effect of no are more consistent, with our approach reaching the top clear separation seems to be present as well when using performance with both models. smaller subsets of the data, similar to what was seen with From the training times, we can see a direct correla- the full data case. The curves fluctuate over each other, 36 In the binary dataset case, the results were inconclusive. We showed that training of the classification models can be done in less than a second, enabling users to train clas- sifiers in an online fashion. Due to the limited amount of datasets available, nothing conclusive can be said about the generalization capabilities of the method, but we believe this result generalizes to any rich-enough embed- dings optimized for a search task. Further investigations are needed to say anything conclusive, however. References [1] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, D. Brown, Text classification algorithms: A survey, Information 10 (2019) 150. [2] S. Li, J. Hu, Y. Cui, J. Hu, Deeppatent: patent clas- sification with convolutional neural networks and word embedding, Scientometrics 117 (2018) 721– 744. [3] J.-S. Lee, J. Hsiang, PatentBERT: Patent classifica- tion with fine-tuning a pre-trained BERT model, arXiv preprint arXiv:1906.02124 (2019). [4] R. Srebrovic, J. Yonamine, Leveraging the BERT algorithm for Patents with TensorFlow and Big- Query, 2020. URL: https://services.google.com/fh/ files/blogs/bert_for_patents_white_paper.pdf. [5] H. Bekamiri, D. S. Hain, R. Jurowetzki, Patentsberta: a deep nlp based hybrid model for patent distance and classification using augmented sbert, arXiv preprint arXiv:2103.11933 (2021). Figure 3: The effects of the amount of training data on the [6] S. Harris, A. Trippe, D. Challis, N. Swycher, Con- multi-label mechanical engineering patent dataset on model performance over different embeddings. struction and evaluation of gold standards for patent classification—a case study on quantum com- puting, World Patent Information 61 (2020) 101961. [7] S. Björkqvist, J. Kallio, Building a graph-based and no clear distinction can be seen between models. patent search engine, in: 46th International In the multi-label case, however, clear differences ACM SIGIR Conference on Research and Devel- show up: When using 0.5% of the data there is almost a opment in Information Retrieval (SIGIR’23), to 20-percentage-point difference between the best (Ours) appear, 2023. doi:https://doi.org/10.1145/ and the worst (TF-IDF with lr and BERT (patents) for 3539618.3591842. knn). The performance difference between the models [8] M. Honnibal, I. Montani, S. Van Landeghem, decreases when the number of samples is increased, but A. Boyd, spacy: Industrial-strength natural lan- the rankings of the different models mostly stay the same guage processing in python, zenodo, 2020, 2020. regardless of the amount of data used, with our method [9] J. Pennington, R. Socher, C. D. Manning, Glove: reaching the highest scores on virtually all subset sizes. Global vectors for word representation, in: Em- pirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. 4. Conclusions [10] J. Devlin, M. Chang, K. Lee, K. Toutanova, In this paper, we showed that patent classification can be BERT: pre-training of deep bidirectional trans- done efficiently on rich graph embeddings optimized for formers for language understanding, CoRR a search task. We evaluated the performance on both a abs/1810.04805 (2018). URL: http://arxiv.org/abs/ binary and a multi-label dataset and demonstrated that 1810.04805. arXiv:1810.04805. search-optimized embeddings work well with a very lim- [11] F. Cariaggi, BERT for Patents, 2023. URL: https: ited amount of labeled samples in the multi-label case. //huggingface.co/anferico/bert-for-patents. 37 [12] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duch- esnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [13] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural language processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. 38